Design for automated reports from Druid

Introduction:

This document describes the design to generate the data for the portal dashboards from Druid OLAP data store and export the report data to cloud storage. This mainly consists of following modules

Create report API - This API will be used to submit a request for configuration of a new report.
Disable Report API - This API will mark an existing report as disabled and will be excluded from the list of reports to be executed.
Report Data Generator - The report data generator will be a spark job which will generate report data file by executing the query configured in the report configuration against the druid data store. The report data file will then be exported to cloud storage to complete the report execution.

Configure Report API:

Input

ReportName
ChannelId
ReportInterval (e.g. LAST_7_DAYS, LAST_MONTH, CUSTOM_DATES etc.)
JsonQuery

Request Object

{
   "id":"sunbird.analytics.report.submit",
   "ver":"1.0",
   "ts":"2019-03-07T12:40:40+05:30",
   "params":{
      "msgid":"4406df37-cd54-4d8a-ab8d-3939e0223580",
      "client_key":"analytics-team"
   },
   "request":{
      "channel_id":"in.ekstep",
      "report_name":"avg_collection_downloads"
      "report_interval":"LAST_7_DAYS",
      "query_json":{
         "queryType":"groupBy",
         "dataSource":"telemetry-events",
         "granularity":"day",
         "dimensions":[
            "eid"
         ],
         "aggregations":[
            { "type":"count", "name":"context_did", fieldName":"context_did" }
         ],
         "filter":{
            "type":"and",
            "fields":[
               { "type":"selector", "name":"eid", fieldName":"IMPRESSION" },
               { "type":"selector", "name":"edata_type", fieldName":"detail" },
               { "type":"selector", "name":"edata_pageid", fieldName":"collection-detail" },
               { "type":"selector", "name":"context_pdata_id", fieldName":"prod.diksha.app" }
            ]
         },
         "postAggregations":[
            {
               "type":"arithmetic",
               "name":"avg__edata_value",
               "fn":"/",
               "fields":[
                  { "type":"fieldAccess", "name":"total_edata_value", "fieldName":"total_edata_value" },
                  { "type":"fieldAccess", "name":"rows", "fieldName":"rows" }
               ]
            }
         ],
         "intervals":[
            "2019-02-20T00:00:00.000/2019-01-27T23:59:59.000"
         ]
      }
   }
}

Approach 1:

The individual report configurations can be saved to a Cassandra table. The druid query JSON will be saved to Azure blob storage and the following will be the fields in the report configuration table. The report generator spark job will iterate through the list of enabled reports, insert a record into the platform_db.job_request table, generate the report from druid and update the status of the execution once the execution is complete. We can also include the report execution frequency if necessary.

report_id
report_name
is_enabled
report_interval
report_configuration_location
report_output_filename

Pros:

Easier to edit the report query as it doesn't require any deployments.
Managing individual report definition is easier as they are separated out.

Cons:

As the number of reports increases, there will be too many druid query json files in the blob storage.
The report configurations need to be managed in a separate Cassandra table. New fields like execution_frequency will require changes to the Cassandra table.

Approach 2:

A single report configuration xml file which will store individual report configurations. The report configuration xml needs to be read and the new reports appended to the existing list. Each report section will be self-sufficient with all the required parameters for the report being part of the xml configuration file.

<?xml version = "1.0" encoding = "utf-8"?>
<reports>
  <report name="report1" id="report_1" enabled="true" output_filename="report1_output.json">
    <query>
      <![CDATA[
        {
           "queryType":"groupBy",
           "dataSource":"telemetry-events",
           "granularity":"day",
           "dimensions":[
              "eid"
           ],
           "aggregations":[
              { "type":"count", "name":"context_did", "fieldName":"context_did" }
           ],
           "intervals": [
              "$report_interval"
           ],
           "filter":{
              "type":"and",
              "fields":[
                 { "type":"selector", "dimension":"eid", "value":"IMPRESSION" },
                 { "type":"selector", "dimension":"edata_type", "value":"detail" },
                 { "type":"selector", "dimension":"edata_pageid", "value":"collection-detail" },
                 { "type":"selector", "dimension":"context_pdata_id", "value":"prod.diksha.app" }
              ]
           },
           "postAggregations":[{
                 "type":"arithmetic",
                 "name":"avg__edata_value",
                 "fn":"/",
                 "fields":[
                    { "type":"fieldAccess", "name":"total_edata_value", "fieldName":"total_edata_value" },
                    { "type":"fieldAccess", "name":"rows", "fieldName":"rows" }
                 ]
            }]
        }
      ]]>
   </query>
   <query_params>
     <param name="datasource" value="telemetry-events" optional="false"/>
     <param name="granularity" value="day" optional="false"/>
     <param name="report_interval" value="LAST_7_DAYS"/>
     <param name="start_date" value="2019-03-01" optional="true"/>
     <param name="start_date" value="2019-03-07" optional="true"/>
   </query_params>
 </report>
</reports>

Pros:

All report configurations are in one place and hence easier to maintain.
It is easier to apply changes to report parameters or add new parameters to an existing report.

Cons:

Readability becomes an issue in case of huge xml files.
The entire report configuration needs to be read into memory when the job scheduler executes the report every day.

Report Data Generator Data Product

Input
- Cassandra table where the reports are configured or the XML file with the report configurations.
Algorithm
- Obtain a list of reports which are enabled and extract the druid query from the configuration.
- Execute the druid query for each of the reports within the Spark job.
Output
- Uploads the generated report with the configured filename to Azure blob storage.

Disable Report API:

Input
- report_id
Output
- The report will be disabled either in the Cassandra table or the XML report configuration file.