Process General Details

Process Name

Infrastructure monitoring process

Process Description

This document details out the standard operating processes for NOC team in the below mentioned areas while monitoring/managing DIKSHA production environment.

Service down
5xx errors
4xx errors
Telemetry Pipeline errors
Publish Pipeline errors
Daily Backups
Monthly Restorations
Other critical services

Process Diagram

This section is optional. However, creating a visual representation for your process helps recall and compliance.

Refer to the sample templates here .

1. Service Down

Area of Coverage

Which Alert to be Monitored

Course of Action

Service uptime

Following are monitored

VM
Docker nodes
DBs
Managed service

All the resources should be covered from the environment

Service down

We will receive an email alert or slack notification on this, this error will also start showing the 5xx errors from this service(s)

Alerts are configured using prometheus or monet when the service is down and they will attempt to restart them if they are down. Docker swarm master will maintain the configured number of nodes for each service, On VMs monet will try restarting it twice before it gives up.

If the auto restart has happened and the services are back to normal, then send an email to notify@teamdiksha.org

If the service is either restarting continuously or unable to start, get access to the server**, then debug* and see if this can be resolved, if yes then the service is up. Do send an email to notify@teamdiksha.org

If service is still down then raise the ticket in DIKSHA Jira with S1 and assign to DIKSHA Infra Owner and send an email to notify@teamdiksha.org

Update this in ops log

*Do refer to Cook Book / Jira / SOP document on the history of same type of issue

** Get access approval from DIKSHA Infra Owner via DIKSHA Jira to servers

2. 5xx Errors

Area of Coverage

How is it Monitored

Course of Action

Service degradation

All services which are running in the application, right from APIs, DBs, etc..

Reference link here

We receive alerts from monitoring about any 5xx errors > 0.1%

This can be observed by watching the grafana nginx dashboard

Identify which service is giving us the 5xx error. Get the server access** and debug*to find the root cause to resolve the issue, if the issue is resolved, do send an email to the notify@teamdiksha.org

If the issue is not resolvable, then raise the ticket in DIKSHA Jira with S1 and assign to DIKSHA Infra Owner and send an email to notify@teamdiksha.org

Update this in ops logs

*you can start from Grafana to identify the API / Service giving 5xx error, check Kibana and logs server to find the details of the error and refer to CookBook / Jira / SOP documents on the history of same type of issue

** Get access approval from DIKSHA Infra Owner via DIKSHA Jira to servers

3. 4xx Errors

Area of Coverage

How is it Monitored

Course of Action

Client side errors

All services which are running in the application, right from APIs, DBs, etc..

Reference link here

We receive alerts from monitoring about any 4xx errors > 1%

This can be observed by watching the grafana nginx dashboard

Identify which service is giving us the 4xx error. Get the server access** and debug*to find the root cause to resolve the issue, if the issue is resolved, do send an email to the notify@teamdiksha.org

If the issue is not resolvable, then raise the ticket in DIKSHA Jira with S1 for 499*** error and S2 for any other 4xx error and assign to DIKSHA Infra Owner and send an email to notify@teamdiksha.org

Update this in ops log

*you can start from Grafana to identify the API / Service giving 4xx error, check Kibana and logs server to find the details of the error and refer to CookBook / Jira / SOP documents on the history of same type of issue

** Get access approval from DIKSHA Infra Owner via DIKSHA Jira to servers

*** 499 error refer to client closing browser before the response is sent, this generally refers to early signs of service degradation. Other 4xx errors are to be investigated for the purpose of identifying the source of client side error patterns

4. Telemetry Pipeline

Area of Coverage

How is it Monitored

Course of Action

Telemetry pipeline

* this has direct impact to reports

Dashboard link here

Lag monitoring link here

Samza jobs status

Redis cache server

Lag monitoring between each topic in pipeline

Samza jobs and Redis cache servers have alerts configured, you would receive monitoring alerts if the service is down
Lag monitoring has to be done via the grafana dashboard and manually (if dashboards are not in sync or in questions)

Identify which service is giving us the error. Get the server access** and debug* to find the root cause to resolve the issue, if the issue is resolved, do send an email to the notify@teamdiksha.org

If the issue is not resolvable, then raise the ticket in DIKSHA Jira with S1 and assign to DIKSHA Infra Owner and send an email to notify@teamdiksha.org

Update this in ops logs

*you can start from Gra

fana to identify the API / Service giving errors, check Kibana and logs server to find the details of the error and refer to CookBook / Jira / SOP documents on the history of same type of issue

** Get access approval from DIKSHA Infra Owner via DIKSHA Jira to servers

5. Publish Pipeline

Area of Coverage

How is it Monitored

Course of Action

Publish pipeline

* this has direct impact to content creation in the application

Dashboard link here

Samza jobs status

Redis cache server

#Lag monitoring between each topic in pipeline

Samza jobs and Redis cache servers have alerts configured, you would receive monitoring alerts if the service is down
Grafana dashboards is enabled for publish pipeline metrics
#Lag monitoring is not yet enabled

Identify which service is giving us the error. Get the server access** and debug*to find the root cause to resolve the issue, if the issue is resolved, do send an email to the notify@teamdiksha.org

If the issue is not resolvable, then raise the ticket in DIKSHA Jira with S1 and assign to DIKSHA Infra Owner and send an email to notify@teamdiksha.org

Update this in ops logs

*you can start from Grafana to identify the API / Service giving errors, check Kibana and logs server to find the details of the error and refer to CookBook / Jira / SOP documents on the history of same type of issue

** Get access approval from DIKSHA Infra Owner via DIKSHA Jira to servers

6. Daily Backups

Area of Coverage

How is it Monitored

Course of Action

Database backups

Reference link here

All backups are done via Jenkins jobs and managed services via its configuration

All the backups are stored in Azure Blob store

Need to check whether the files are created as per the schedule

If the backup file does not exists that is an incident

Identify which backup is not done, check logs from jenkins or get the server access** and debug*to find the root cause to resolve the issue, you can manually run the backup, if the issue is resolved, do send an email to the notify@teamdiksha.org

If the issue is not resolvable, then raise the ticket in DIKSHA Jira with S1 and assign to DIKSHA Infra Owner and send an email to notify@teamdiksha.org

Update this in ops logs

*you can start from Jenkins backup job logs to find the details of the error and refer to CookBook / Jira / SOP documents on the history of same type of issue

** Get access approval from DIKSHA Infra Owner via DIKSHA Jira to servers

7. Monthly Backups Restoration

Area of Coverage

How is it Monitored

Course of Action

Restorations of backed up databases

This activity is taken up once in a month

Reference link here

You will pick last days backups for the restoration

Backups are restored on to the pre-allocated servers based on the type and size of the database

If any backup is not restorable, then that is an issue

Get server access** for this activity

Identify which database is not restorable and debug*to find the root cause to resolve the issue, if the issue is resolved, do send email to the notify@teamdiksha.org

If the issue is not resolvable, then raise the ticket in DIKSHA Jira with S1 and assign to DIKSHA Infra Owner and send an email to notify@teamdiksha.org

Update this in ops logs

*you can check logs to find the details of the error and refer to CookBook / Jira / SOP documents on the history of same type of issue

** Get access approval from DIKSHA Infra Owner via DIKSHA Jira to servers

8. Other Critical Services

Area of Coverage	How is it Monitored	Course of Action
SMS Gateway Limit Frequency: Twice a week Reference link here	MSG91 account We have SMS credits in this account, alert if the threshold is <20% of total allocated credits	Raise S1 and assign to DIKSHA Infra Owner, if the account credits are getting exhausted, we need to purchase more credits
Email Gateway Limit Frequency: Twice a week Reference link here	SendGrid service on Azure, if the limits are exceeding for their monthly limits	Raise S1 and assign to DIKSHA Infra Owner, you may need to upgrade to the next subscription to increase the limit
Video streaming service	We use Azure Media Service for streaming, Reference link here	If they are failures, Raise S1 issue and assign to DIKSHA Infra Owner
Report jobs	Reference link here	If they are not run for the day,Raise S1 issue and assign to DIKSHA Infra Owner

PROCESS IN DIKSHA JIRA

The issues raised in Jira, needs to raised with Severity S1 or S2 based on the issue type defined above.

The ticket needs to be assigned to DIKSHA Infra Owner, in his absence the designated SPOC he nominates.

If there is any S1 which is created, do call DIKSHA Infra Owner and get the prioritisation set for the ticket immediately by him. Once it is Prioritised as P1, this needs to be followed up by NOC team every 30 mins or hourly based on the type of issue / impact of the issue to the end user.

If you are moving this issue to L3 team, post your discussions with DIKSHA Infra Owner/ SPOC, you are creating the bug in Sunbird Jira, it will be assigned to Release Management team member with S1 severity and Tag DIKSHA Infra Owner.

Once the ticket is moved to L3, there will be a hot fix released for the same, which need to be tested in pre production and then moved to production. If any S1/P1 created in the last week of the release (in UAT phase) that will be going along with the release, the same needs to be tested and pushed along with that.

All such issues needs to be verified and validated in pre production thoroughly and post explicit sign off gets moved to production.

Exceptions

This section is mandatory. If there are no exceptions to the process, mention it as such.

Process Metrics

Average percentage of 5xx errors per day - this metric indicates if there are any server side errors from our services / components / infrastructure
Max TPS per day - this metric indicates the current load against the benchmark level for the infra scale set
TOP 10 API Response time 95 percentiles - these metrics indicate service response times from server
Average percentage of 4xx errors per day - this metric indicates the client side errors, but indirectly indicate any errors from our mobile app or any other known source.

Infrastructure Monitoring Process

Process General Details

Process Diagram

1. Service Down

2. 5xx Errors

3. 4xx Errors

4. Telemetry Pipeline

5. Publish Pipeline

6. Daily Backups

7. Monthly Backups Restoration

8. Other Critical Services

PROCESS IN DIKSHA JIRA

Exceptions

Process Metrics

0 Comments