Functional Monitoring of the system

This page details the monitoring capabilities enabled for the sourcing flows. This is to enable the system predict possible issues so that they can be proactively addressed.

User Actions to monitor

  1. Any object type (collection, content, question set etc.) is opened for editing

  2. Any object type (collection, content, question set etc.) is saved as draft

  3. Any object type (collection, content, question set etc.) is sent for review

  4. Any object type (collection, content, question set etc.) is published

Dashboards

Error Dashboard

  1. At any given point of time: Number of objects in “FAILED” state

    1. Total Count, Count based on each primary content category

  2. For any given time period: Number of error events from front-end (whenever user is shown an error for any of the above user actions):

    1. Total count

    2. Count based on each event type: Open for Edit, Save, Send for Review, Publish

    3. Count based on each primary content category

Search Index Sync Dashboard

  1. For any given time period

    1. Search Index lag

      1. Number of updates happened

      2. % of updates that have synced with search index

      3. Average time taken to complete the sync

      4. Number of updates the have errors in sync

Publish pipeline Lag Dashboard

  1. For any given time period

    1. Publish lag

      1. Number of objects sent for publish in that period

      2. % of objects sent for publish in that period have completed processing

      3. Average time taken to complete the processing

Alerts to be triggered

Failed object alert

  1. Trigger condition: There is at least one object is in “Failed” state at given point of time

  2. Frequency: Every 2 hours

  3. Details in the alert: List of object ids of the objects in failed state

  4. Action to be taken:

    1. Investigate the failed objects and rectify them to unblock users

    2. Identify root cause and possible actions to prevent it

Error events alert

  1. Trigger condition: At least 10% of the user events triggered during the given time duration have errors

  2. Frequency: Every 2 hours - data seen for last two hours

  3. Details in the alert: For each error event

    1. Object id, User action that triggered the error, Error detail

  4. Action to be taken:

    1. Identify root cause and possible actions to prevent it

Publish lag alert

  1. Trigger condition: At least 5% of the objects sent for review or for publish have not completed the processing within 4 hours

  2. Frequency: Every 4 hours - data seen for last four hours

  3. Details in the alert: Following details for both Review and Publish lag

    1. Total number of objects sent for processing

    2. Number of objects completed processing

    3. Number of objects not completed processing

    4. Average time taken for processing

    5. Object id of each object that has not been completed the processing within 4 hours along with its current state

  4. Action to be taken:

    1. Investigate the unprocessed objects and process them to unblock users

    2. Identify root cause and possible actions to prevent it