Design for Automated reports from Druid

Introduction:

This document describes the design to generate the data for the dashboards from Druid and export the data to cloud storage. This mainly consists of following modules

Configure Report API - which is used to request for a new report.
Disable Report API - which is used to disable report generation.
Report Data Generator - is a spark job runs as per schedule time, generates data out of druid and exports data to cloud storage for each requests enabled.

Configure Report API:

Input:

Channel Id
Query Time Interval
Report Config Json
Query Json

Request Object

{
    "id": "sunbird.analytics.report.submit",
    "ver": "1.0",
    "ts": "2019-03-07T12:40:40+05:30",
    "params": {
        "msgid": "4406df37-cd54-4d8a-ab8d-3939e0223580",
        "client_key": "analytics-team"
    },
    "request": {
        "channel_id": "",
        "interval": "Last 7 days",
        "report_config": {
            "id": "usage",
            "label": "Diksha Usage Report",
            "title": "Diksha Usage Report",
            "description": "The report provides a quick summary of the data analysed by the analytics team to track progess of Diksha across states. This report will be used to consolidate insights using various metrics on which Diksha is currently being mapped and will be shared on a weekly basis. The first section of the report will provide a snapshot of the overall health of the Diksha App. This will be followed by individual state sections that provide state-wise status of Diksha",
            "dataSource": "/usage/report.json",
            "charts": [
                {
                    "datasets": [
                        {
                            "dataExpr": "data.Number_of_downloads",
                            "label": "# of downloads"
                        }
                    ],
                    "labelsExpr": "data.Date",
                    "chartType": "line"
                },
                {
                    "datasets": [
                        {
                            "dataExpr": "data.Number_of_succesful_scans",
                            "label": "# of successful scans"
                        }
                    ],
                    "labelsExpr": "data.Date",
                    "chartType": "bar"
                }
            ],
            "table": {
                "columnsExpr": "key",
                "valuesExpr": "tableData"
            },
            "downloadUrl": "report_id1/report.csv"
        },
        "query": {

        }
    }
}

Output:

Saves report-id and job_name as "druid-reports" in cassandra platform_db.job_request table as "ENABLED"
Save "report-id.json" to azure
Takes druid query from request if any or build druid query from other inputs

Save druid query: We have following 2 approaches:
Approach 1: Save indivisual report queries seperatly in azure

druid/queries/report-id-1.txt
druid/queries/report-id-2.txt

Approach 2: Save all queries in one xml file like this

<?xml version = "1.0" encoding = "utf-8"?>
  <reports>
     <report name="report1" id="report_1" enabled="true">
       <query>
         <![CDATA[
           {
             "queryType": "groupBy",
             "dataSource": "telemetry-events",
             "granularity": "day",
             "dimensions": ["eid"],
             "aggregations": [
               { "type": "count", "name": "context_did", "fieldName": "context_did" }
             ],
             "intervals": [ "$report_interval" ]
             "filter": {
               "type": "and",
               "fields": [
                 { "type": "selector", "dimension": "eid", "value": "IMPRESSION" },
                 { "type": "selector", "dimension": "edata_type", "value": "detail" },
                 { "type": "selector", "dimension": "edata_pageid", "value": "collection-detail" },
                 { "type": "selector", "dimension": "context_pdata_id", "value": "prod.diksha.app" }
               ]
             },
             "postAggregations" : [{
               "type"   : "arithmetic",
               "name"   : "avg__edata_value",
               "fn"     : "/",
               "fields" : [
                 { "type" : "fieldAccess", "name" : "total_edata_value", "fieldName" : "total_edata_value" },
                 { "type" : "fieldAccess", "name" : "rows", "fieldName" : "rows" }
               ]
             }]
           }
         ]]>
       </query>
       <query_params>
         <param name="datasource" value="telemetry-events" optional="false"/>
         <param name="granularity" value="day" optional="false"/>
         <param name="report_interval" value="LAST_7_DAYS" 
         <param name="start_date" value="2019-03-01" optional="true"/>
         <param name="start_date" value="2019-03-07" optional="true"/>
       </query_params>
     </report>
   </reports>

Disable Report API:

Input:
report-id
Output:
saves report-id and job_name as "druid-reports" in cassandra platform_db.job_request table as "DISABLED"

Report Data Generator Data Product:

Input:
1. Get reports list from config, if null/empty search through cassandra platform_db.job_request table for job_name = "druid-reports" and status = "ENABLED"
2. Reading from config enables to re-generate reports only for certain requests.
  Example: If only report-1 & report-2 needs to be re-run -
  Step1: Update data-product config with above report ids.
  Step2: Run the spark job.

Output:
Uploads report output data to azure

Algorithm:
1. Read druid query from azure and execute for each reports in the list
2. Updates report data URL complete path in report config json in azure.