Sunbird Monitoring
Monitoring System overview
We are using prometheus operator on Kubernetes to monitor the infrastructure and APIs. For more information on the operator, read here https://github.com/prometheus-operator/prometheus-operator
The monitoring system system includes many components such as
Promethus
Grafana
Exporters such as node exporter, process exporter, jmx exporter, blackbox exporter etc,.
Alertmanager and alert rules
Sunbird monitoring helm chart is located here - https://github.com/project-sunbird/sunbird-devops/tree/master/kubernetes/helm_charts/monitoring
The system runs on Kubernetes and monitors both kubernetes and non kubernetes components
Sunbird monitoring ansible role is located here - https://github.com/project-sunbird/sunbird-devops/tree/master/kubernetes/ansible/roles/sunbird-monitoring
High level architecture diagram
What resources are monitored? What metrics are scraped?
CPU usage of Pods and Nodes
Memory usage of Pods and Nodes
Disk usage of Pods and Nodes
Network usage of Pods and Nodes
Traffic and API metrics such as latency, request per second, request / response size
Kafka consumer lag metrics
Cassandra metrics such as heap, compactions, read and write etc
Process metrics on the nodes
Service endpoints and their health
For an exhaustive list of what all is being monitored, refer to the grafana dashboards.
What alert rules are configured
Many alert rules are configured such as
High cpu usage on nodes
High memory usage on nodes
High disk usage on nodes
Increasing API latencies etc.
For an exhaustive list of alert rules, take a look at this helm chart - https://github.com/project-sunbird/sunbird-devops/tree/master/kubernetes/helm_charts/monitoring/alertrules
What notifications are configured
The above section (alert rules) are configured as notifications
The notifications are sent to email and slack channel
Code base structure and explain what is what
Monitoring chart is present here - https://github.com/project-sunbird/sunbird-devops/tree/master/kubernetes/helm_charts/monitoring
additional-scrape-configs
This helm chart contains the prometheus scrape configuration, labels, interval and timeout
alertrules
This helm chart contains the alert rules
azure-ambari-prometheus-exporter
This helm chart is used to install ambari exporter which monitors the hadoop cluster like HDInsights
blackbox-exporter
This helm chart is used to monitor service or http(s) endpoints and check if they are healthy or not
cassandra-jmx-exporter
This helm chart is used to monitor cassandra clusters
dashboards
This helm chart contains the grafana dashboards
elasticsearch-exporter
This helm chart is used to monitor elasticsearch cluster
ingestion-kafka-exporter
This helm chart is used to monitor ingestion kaka cluster
json-path-exporter
This helm chart is used to scrape remote jsons and convert them to prometheus metrics
kafka-exporter
This helm chart is used to monitor kafka clusters
kafka-lag-exporter
This helm chart is used to monitor kafka topic / group lag
kafka-topic-exporter
This helm chart is used to monitor kafka topics
oauth2-proxy
This helm chart installs oauth2 proxy in the monitoring namespace
processing-kafka-exporter
This helm chart is used to monitor processing kaka cluster
prometheus-operator
This helm chart installs the prometheus operator along with grafana, node exporter, kube state metrics and alertmanager
prometheus-redis-exporter
This helm chart is used to morning redis nodes
statsd-exporter
This helm chart is used to monitor kong api metrics
ansible role
The anible role install each of the helm chart in a loop
The ansible role has a template file for each of the helm chart which is used to override environment specific values as well as overriding values specifically for sunbird as the charts are taken from upstream sources
The role is located here - https://github.com/project-sunbird/sunbird-devops/tree/master/kubernetes/ansible/roles/sunbird-monitoring
Overriding the specs
To override the spec, define the values defined in the template files in your private repo
For example if you want to change the name of the monitoring stack, you can close the repo, change the value shown here as per your liking - https://github.com/project-sunbird/sunbird-devops/blob/master/kubernetes/ansible/roles/sunbird-monitoring/templates/prometheus-operator.yaml#L9
Deploy using the Jenkins job from your fork
Defining additional specs
You can make use of the provided optional variables to define additional parameters
For example if you want to have a PV for grafana, you can use this variable and define it in your private repo - https://github.com/project-sunbird/sunbird-devops/blob/master/kubernetes/ansible/roles/sunbird-monitoring/templates/prometheus-operator.yaml#L290-L292
To understand how to define these variable, refer to the subchart which it will be used. For example in case of the grafana persistence, you can find the definition here - https://github.com/project-sunbird/sunbird-devops/blob/master/kubernetes/helm_charts/monitoring/prometheus-operator/charts/grafana/values.yaml#L196-L207
Take a look at the other set of variables in the template files to understand the usage and capabilities
Deploying the monitoring stack
Use the jenkins job named Monitoring under the Deploy/Kubernetes directory folder.
The variables defined in the private repo template under mandatory and optional should be filled which are required for the monitoring stack. Example - slack channel name, smtp configurations etc., if you plan to use the alerting capabilities
Service Monitoring
Service monitoring is done using the service monitor files. Example - https://github.com/project-sunbird/sunbird-devops/blob/master/kubernetes/helm_charts/core/analytics/templates/serviceMonitor.yaml
Service monitor is a CRD which is setup as part of the prometheus operator chart
Service monitor uses the named port concept of Kubernetes to identify the port numbers
Backup and Restore
There are no jobs or processed that are available to perform the backup and restore
Take a look at this page to understand how you can manually do backup and restore - Prometheus PV Backup and Restore on Kubernetes
This can be automated in future to perform automated backup and restore
How to's?
Monitoring new resources
If a new node needs to be monitored then
Install node exporter on the VM using the Opsadminstration/Bootstrap Jenkins job
Add the IP of the node under the
node-exporter
ansible group in the inventoryDeploy the Kubernetes/Monitoring jenkins job with tag as
all
If a new service within Kubernetes needs to be monitored (which has the capability to directly emit prometheus metrics), then
Add a service monitor file in the helm chart (Already covered in previous section)
Redeploy the service
Scrapping new metrics
Prometheus automatically scrapes new metrics when the target is added in the configuration
A target can be
An exporter endpoint (example: node exporter endpoint)
A service monitor endpoint (example: see service monitor file)
Adding new dashboards
To add a new dashboard
Go to grafana and create the dashboard
Export the dashboard as a json
Commit the json in this folder - https://github.com/project-sunbird/sunbird-devops/tree/master/kubernetes/helm_charts/monitoring/dashboards/dashboards
Add the dashboard name in values.yaml file under a
dashboards[0-9]
dictionary. Example - https://github.com/project-sunbird/sunbird-devops/blob/master/kubernetes/helm_charts/monitoring/dashboards/values.yaml#L401-L437Redeploy the monitoring stack using the jenkins job with the tag
dashboards
Restart grafana pod manually if you don't see the changes reflected
Modify existing dashboard
To modify an existing dashboard
Export the existing dashboard as a json
Import it back to grafana with a new name and a new uuid
Modify the dashboard as per your requirements
Export the modified dashboard as a json
Rename the dashboard file name to same as what is available on github
Edit the title to same as original dashboard - https://github.com/project-sunbird/sunbird-devops/blob/master/kubernetes/helm_charts/monitoring/dashboards/dashboards/api-manager.json#L2526
Commit the modified dashboard
Redeploy the monitoring stack using the jenkins job with the tag
dashboards
Restart grafana pod manually if you don't see the changes reflected
How to add alert rules?
Add new rules files or edit the existing rules file to include your new alert
Alert rule files are here - https://github.com/project-sunbird/sunbird-devops/tree/master/kubernetes/helm_charts/monitoring/alertrules/templates
How to add new service monitor?
Add the service monitor file in the respective service helm chart templates folder
See here for an example - https://github.com/project-sunbird/sunbird-devops/blob/master/kubernetes/helm_charts/core/analytics/templates/serviceMonitor.yaml
How to add notification endpoints ? Mail, slack etc
These can be added under the alert manager configurations.
For email, see here - https://github.com/project-sunbird/sunbird-devops/blob/master/kubernetes/ansible/roles/sunbird-monitoring/templates/prometheus-operator.yaml#L71-L77
For slack, see here - https://github.com/project-sunbird/sunbird-devops/blob/master/kubernetes/ansible/roles/sunbird-monitoring/templates/prometheus-operator.yaml#L132-L137