Sunbird Monitoring

Sunbird Monitoring

Monitoring System overview

High level architecture diagram

What resources are monitored? What metrics are scraped?

  • CPU usage of Pods and Nodes

  • Memory usage of Pods and Nodes

  • Disk usage of Pods and Nodes

  • Network usage of Pods and Nodes

  • Traffic and API metrics such as latency, request per second, request / response size

  • Kafka consumer lag metrics

  • Cassandra metrics such as heap, compactions, read and write etc

  • Process metrics on the nodes

  • Service endpoints and their health

For an exhaustive list of what all is being monitored, refer to the grafana dashboards.

What alert rules are configured

  • Many alert rules are configured such as

  • High cpu usage on nodes

  • High memory usage on nodes

  • High disk usage on nodes

  • Increasing API latencies etc.

For an exhaustive list of alert rules, take a look at this helm chart - https://github.com/project-sunbird/sunbird-devops/tree/master/kubernetes/helm_charts/monitoring/alertrules

What notifications are configured

  • The above section (alert rules) are configured as notifications

  • The notifications are sent to email and slack channel

Code base structure and explain what is what

Monitoring chart is present here - https://github.com/project-sunbird/sunbird-devops/tree/master/kubernetes/helm_charts/monitoring

additional-scrape-configs

  • This helm chart contains the prometheus scrape configuration, labels, interval and timeout

alertrules

  • This helm chart contains the alert rules

azure-ambari-prometheus-exporter

  • This helm chart is used to install ambari exporter which monitors the hadoop cluster like HDInsights

blackbox-exporter

  • This helm chart is used to monitor service or http(s) endpoints and check if they are healthy or not

cassandra-jmx-exporter

  • This helm chart is used to monitor cassandra clusters

dashboards

  • This helm chart contains the grafana dashboards

elasticsearch-exporter

  • This helm chart is used to monitor elasticsearch cluster

ingestion-kafka-exporter

  • This helm chart is used to monitor ingestion kaka cluster

json-path-exporter

  • This helm chart is used to scrape remote jsons and convert them to prometheus metrics

kafka-exporter

  • This helm chart is used to monitor kafka clusters

kafka-lag-exporter

  • This helm chart is used to monitor kafka topic / group lag

kafka-topic-exporter

  • This helm chart is used to monitor kafka topics

oauth2-proxy

  • This helm chart installs oauth2 proxy in the monitoring namespace

processing-kafka-exporter

  • This helm chart is used to monitor processing kaka cluster

prometheus-operator

  • This helm chart installs the prometheus operator along with grafana, node exporter, kube state metrics and alertmanager

prometheus-redis-exporter

  • This helm chart is used to morning redis nodes

statsd-exporter

  • This helm chart is used to monitor kong api metrics

ansible role

Overriding the specs

Defining additional specs

Deploying the monitoring stack

  • Use the jenkins job named Monitoring under the Deploy/Kubernetes directory folder.

  • The variables defined in the private repo template under mandatory and optional should be filled which are required for the monitoring stack. Example - slack channel name, smtp configurations etc., if you plan to use the alerting capabilities

Service Monitoring

Backup and Restore

  • There are no jobs or processed that are available to perform the backup and restore

  • Take a look at this page to understand how you can manually do backup and restore - Prometheus PV Backup and Restore on Kubernetes

  • This can be automated in future to perform automated backup and restore

How to's?

Monitoring new resources

  • If a new node needs to be monitored then

    • Install node exporter on the VM using the Opsadminstration/Bootstrap Jenkins job

    • Add the IP of the node under the node-exporter ansible group in the inventory

    • Deploy the Kubernetes/Monitoring jenkins job with tag as all

  • If a new service within Kubernetes needs to be monitored (which has the capability to directly emit prometheus metrics), then

    • Add a service monitor file in the helm chart (Already covered in previous section)

    • Redeploy the service

Scrapping new metrics

  • Prometheus automatically scrapes new metrics when the target is added in the configuration

  • A target can be

    • An exporter endpoint (example: node exporter endpoint)

    • A service monitor endpoint (example: see service monitor file)

Adding new dashboards

Modify existing dashboard

How to add alert rules?

How to add new service monitor?

How to add notification endpoints ? Mail, slack etc