infra-nimbus/ansible/roles/nimbus-stats
Miran 2b079f1774
fix various typos in comments and roles
2022-07-04 15:07:55 +02:00
..
defaults
files fix various typos in comments and roles 2022-07-04 15:07:55 +02:00
handlers
tasks
templates
README.md

README.md

Description

This role defined a simple service which lists the current state of the nodes in the fleet. It does it by collecting most recent messages from the logs stored in ElasticSearch cluster and saves them in a JSON file which is then hosted publicly via Nginx.

Configuration

The only setting that should be usually changed is the domain:

nimbus_stats_domain: my-amazing-domain.example.org
# To query for ElasticSearch Load Blancer to query
consul_service_name: elasticsearch-lb
consul_service_tag: my-logs-cluster-name

Script

The collect.py script queries ElasticSearch for the logs we want to publish.

This script is by default put under /usr/local/bin/collect_nimbus_stats.py and is used by the /usr/local/bin/save_nimbus_stats.py in a cron job.

Usage: collect.py [options]

This script collects latest log entries for provided messages from all nodes
in a Nimbus fleet

Options:
  -h, --help            show this help message and exit
  -i ES_INDEX, --index=ES_INDEX
                        Patter for matching indices. (logstash-2019.04.18)
  -m MESSAGES, --messages=MESSAGES
                        Messages to query for. (['Fork chosen', 'Attestation
                        received', 'Slot start'])
  -H ES_HOST, --host=ES_HOST
                        ElasticSearch host. (localhost)
  -P ES_PORT, --port=ES_PORT
                        ElasticSearch port. (9200)
  -p PROGRAM, --program=PROGRAM
                        Program to query for. (*beacon-node-*)
  -s SINCE, --since=SINCE
                        Period for which to query logs. (now-15m)
  -S PAGE_SIZE, --page-size=PAGE_SIZE
                        Size of results page. (10000)
  -f FLEET, --fleet=FLEET
                        Fleet to query for. (nimbus.test)
  -t TIMEOUT, --timeout=TIMEOUT
                        Connection timeout in seconds. (120)
  -l LOG_LEVEL, --log-level=LOG_LEVEL
                        Logging level. (INFO)
  -o OUTPUT_FILE, --output-file=OUTPUT_FILE
                        File to which write the resulting JSON.

Example: collect -i logstash-2019.03.01 output.json

Timer

The script runs on a systemd timer which can be checked with:

 $ sudo systemctl list-timers -a nimbus-stats.timer
NEXT                         LEFT     LAST                         PASSED       UNIT               ACTIVATES
Wed 2020-02-19 10:50:00 UTC  37s left Wed 2020-02-19 10:45:00 UTC  4min 21s ago nimbus-stats.timer nimbus-stats.service

Which triggers the nimbus-stats service:

 $ sudo systemctl status nimbus-stats.service    
● nimbus-stats.service - Generates stats for Nimbus cluster.
   Loaded: loaded (/lib/systemd/system/nimbus-stats.service; static; vendor preset: enabled)
   Active: inactive (dead) since Wed 2020-02-19 10:47:24 UTC; 2min 33s ago
     Docs: https://github.com/status-im/infra-role-systemd-timer
  Process: 24950 ExecStart=/usr/local/bin/nimbus-stats (code=exited, status=0/SUCCESS)
 Main PID: 24950 (code=exited, status=0/SUCCESS)

Feb 19 10:47:21 master-01.aws-eu-central-1a.nimbus.test systemd[1]: Starting Generates stats for Nimbus cluster....
Feb 19 10:47:22 master-01.aws-eu-central-1a.nimbus.test nimbus-stats[24950]: [INFO]: Querying fleet: nimbus.test
Feb 19 10:47:24 master-01.aws-eu-central-1a.nimbus.test nimbus-stats[24950]: [INFO]: Found matching logs: 10000
Feb 19 10:47:24 master-01.aws-eu-central-1a.nimbus.test nimbus-stats[24950]: [INFO]: Saving to file: /var/www/nimbus/nimbus_stats.json
Feb 19 10:47:24 master-01.aws-eu-central-1a.nimbus.test systemd[1]: Started Generates stats for Nimbus cluster..

Context

For more details see: https://github.com/status-im/infra-nimbus/issues/1