Functional Monitoring of the system
This page details the monitoring capabilities enabled for the sourcing flows. This is to enable the system predict possible issues so that they can be proactively addressed.
User Actions to monitor
Any object type (collection, content, question set etc.) is opened for editing
Any object type (collection, content, question set etc.) is saved as draft
Any object type (collection, content, question set etc.) is sent for review
Any object type (collection, content, question set etc.) is published
Dashboards
Error Dashboard
At any given point of time: Number of objects in “FAILED” state
Total Count, Count based on each primary content category
For any given time period: Number of error events from front-end (whenever user is shown an error for any of the above user actions):
Total count
Count based on each event type: Open for Edit, Save, Send for Review, Publish
Count based on each primary content category
Search Index Sync Dashboard
For any given time period
Search Index lag
Number of updates happened
% of updates that have synced with search index
Average time taken to complete the sync
Number of updates the have errors in sync
Publish pipeline Lag Dashboard
For any given time period
Publish lag
Number of objects sent for publish in that period
% of objects sent for publish in that period have completed processing
Average time taken to complete the processing
Alerts to be triggered
Failed object alert
Trigger condition: There is at least one object is in “Failed” state at given point of time
Frequency: Every 2 hours
Details in the alert: List of object ids of the objects in failed state
Action to be taken:
Investigate the failed objects and rectify them to unblock users
Identify root cause and possible actions to prevent it
Error events alert
Trigger condition: At least 10% of the user events triggered during the given time duration have errors
Frequency: Every 2 hours - data seen for last two hours
Details in the alert: For each error event
Object id, User action that triggered the error, Error detail
Action to be taken:
Identify root cause and possible actions to prevent it
Publish lag alert
Trigger condition: At least 5% of the objects sent for review or for publish have not completed the processing within 4 hours
Frequency: Every 4 hours - data seen for last four hours
Details in the alert: Following details for both Review and Publish lag
Total number of objects sent for processing
Number of objects completed processing
Number of objects not completed processing
Average time taken for processing
Object id of each object that has not been completed the processing within 4 hours along with its current state
Action to be taken:
Investigate the unprocessed objects and process them to unblock users
Identify root cause and possible actions to prevent it