Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This document describes the design to generate the data for the portal dashboards from Druid OLAP data store and export the report data to cloud storage. This mainly consists of following modules

  1. Create report Configure Report API - This API will be used to submit a request for configuration of a new report.
  2. Job Scheduler Engine - This Scheduler will submit the reports for execution based on execution frequency.
  3. Disable Report API - This API will mark an existing report as disabled and will be excluded from the list of reports to be executed.
  4. Report Data Generator - The report data generator will be a spark job which will generate report data file by executing the query configured in the report configuration against the druid data store. The report data file will then be exported to cloud storage to complete the report execution.


Image Added

Configure Report API:

  • Input:

- ReportName

- QueryEngine - druid/cassandra/ES

- ExecutionFrequency - Daily/Weekly/Monthly

- ChannelId

...

- ReportInterval

- Query

- OutputFormat - json/csv

  • Request Object

    Code Block
    language

...

  • theme

...

  • RDark
    borderStylesolid
    linenumberstrue
    collapsefalse
     {
      

...

  • "id":"sunbird.analytics.report.submit",

...

  • 
      "ver":"1.0",

...

  • 
      "ts":"2019-03-07T12:40:40+05:30",
      

...

  • "params":{
         

...

  • "msgid":"4406df37-cd54-4d8a-ab8d-3939e0223580",
         

...

  • "client_key":"analytics-team"
      

...

  • },
      

...

  • "request":{

...

  • 
         "channel_id":"in.ekstep",

...

  • 
         "report_name":"avg_collection_downloads",
         "query_engine": "druid",
         "execution_frequency": "Daily",
         "report_interval":"LAST_7_DAYS",
         "output_format": "json"
         "query_json":{
            

...

  • "queryType":"groupBy",

...

  • 
            "dataSource":"telemetry-events",

...

  • 
            "granularity":"day",
            

...

  • "dimensions":[
               

...

  • "eid"
            

...

  • ],
            

...

  • "aggregations":[
               

...

  • { "type":"count", "name":"context_did", fieldName":"context_did" }

...

  • 
            ],

...

  • 
            "filter":{
               

...

  • "type":"and",
               

...

  • "fields":[
                  

...

  • { "type":"selector", "name":"eid", fieldName":"IMPRESSION" },

...

  • 
                  { "type":"selector", "name":"edata_type", fieldName":"detail" },

...

  • 
                  { "type":"selector", "name":"edata_pageid", fieldName":"collection-detail" },

...

  • 
                  { "type":"selector", "name":"context_pdata_id", fieldName":"prod.diksha.app" }

...

  • 
               ]
            

...

  • },
            

...

  • "postAggregations":[
               

...

  • {
                  

...

  • "type":"arithmetic",
                  

...

  • "name":"avg__edata_value",

...

  • 
                  "fn":"/",

...

  • 
                  "fields":[

...

  • 
                     { "type":"fieldAccess", "name":"total_edata_value", "fieldName":"total_edata_value" },

...

  • 
                     { "type":"fieldAccess", "name":"rows", "fieldName":"rows" }

...

  • 
                  ]

...

  • 
               }
            

...

  • ],
            

...

  • "intervals":[

...

  • 
               "2019-02-20T00:00:00.000/2019-01-27T23:59:59.000"

...

  • 
            ]
         }
    

...

  •   }
     }
    

...

  •  

...


  • Output:

The individual report configurations can be saved to a Cassandra table. The druid query JSON will be saved to Azure blob storage and the following will be the fields in the report configuration table

...

.

...

  1. report_id
  2. report_name
  3. is_enabled
  4. report_interval
  5. report_configuration_location
  6. report_output_filename

Pros:

  1. Easier to edit the report query as it doesn't require any deployments.
  2. Managing individual report definition is easier as they are separated out.

Cons:

  1. As the number of reports increases, there will be too many druid query json files in the blob storage.
  2. The report configurations need to be managed in a separate Cassandra table. New fields like execution_frequency will require changes to the Cassandra table.

...

A single report configuration xml file which will store individual report configurations. The report configuration xml needs to be read and the new reports appended to the existing list.  Each report section will be self-sufficient with all the required parameters for the report being part of the xml configuration file. 

Code Block
themeDJangoRDark
borderStylesolidlinenumberstrue
<?xml version = "1.0" encoding = "utf-8"?>
<reports># Schema of table
   <report name="report1" id="report_1" enabled="true" output_filename="report1_output.json">
TABLE platform_db.druid_report_config (
   <query>       <![CDATA[
        {
           "queryType":"groupBy"report_id text,
           "dataSource":"telemetry-events",
           "granularity":"day",
           "dimensions":[
              "eid"
           ],
           "aggregations":[
              { "type":"count", "name":"context_did", "fieldName":"context_did" }
           ]report_name text,
           "intervals": [
              "$reportreport_interval"
           ] text,
           "filter":{
              "type":"and"query_engine text,
              "fields":[
                 { "type":"selector", "dimension":"eid", "value":"IMPRESSION" },
                 { "type":"selector", "dimension":"edata_type", "value":"detail" }execution_frequency text,
     is_enabled boolean,
                 { "type":"selector", "dimension":"edata_pageid", "value":"collection-detail" }report_query_location text,
                 { "type":"selector", "dimension":"context_pdata_id", "value":"prod.diksha.app" }
              ]
           }report_output_format text,
           "postAggregations":[{
                 "type":"arithmetic",
                 "name":"avg__edata_value",
                 "fn":"/"report_output_location text,
                 "fields":[
                    { "type":"fieldAccess", "name":"total_edata_value", "fieldName":"total_edata_value" }report_output_filename text,
     PRIMARY KEY (report_id) );
           { "type":"fieldAccess", "name":"rows", "fieldName":"rows" }
                 ]
            }]
        }
      ]]>
   </query>
   <query_params>
     <param name="datasource" value="telemetry-events" optional="false"/>
     <param name="granularity" value="day" optional="false"/>
     <param name="report_interval" value="LAST_7_DAYS"/>
     <param name="start_date" value="2019-03-01" optional="true"/>
     <param name="start_date" value="2019-03-07" optional="true"/>
   </query_params>
 </report>
</reports>

Pros:

  1. All report configurations are in one place and hence easier to maintain.
  2. It is easier to apply changes to report parameters or add new parameters to an existing report.

Cons:

  1. Readability becomes an issue in case of huge xml files.
  2. The entire report configuration needs to be read into memory when the job scheduler executes the report every day. 

Report Data Generator Data Product

  1. Input
    • Cassandra table where the reports are configured or the XML file with the report configurations.
  2. Algorithm
    • Obtain a list of reports which are enabled and extract the druid query from the configuration.
    • Execute the druid query for each of the reports within the Spark job.
  3. Output
    • Uploads the generated report with the configured filename to Azure blob storage.

Disable Report API:

...

  • report_id

...

)


  • Output location and file format of query:

    Code Block
    themeRDark
    borderStylesolid
    /druid-reports/query/druid/report-id.json
    /druid-reports/query/cassandra/report-id.cql
    


  • Report Intervals:

    Code Block
    themeRDark
    borderStylesolid
       YESTERDAY
       LAST_7_DAYS
       LAST_30_DAYS
       LAST_MONTH
       LAST_6_MONTHS
       LAST_YEAR
       StartDate(DD/MM/YYYY)-EndDate(DD/MM/YYYY)
     


Job Scheduler Engine:

  • Input:

- druid_report_config cassandra table

  • Output:

- Based on execution_frequency in platform_db.druid_report_config submits a request for execution by inserting a record in platform_db.job_request Cassandra table.

Disable Report API:

  • Input:

- report-id

  • Output:

- The report will be disabled in the platform_db.druid_report_config Cassandra table

Report Data Generator Data Product:

  • Input:

Set of Requests - i.e All records in platform_db.job_request where status=SUBMITTED and job_name=druid-reports

  • Output:

-  Report data file will be saved in Azure with specified format
platform_db.job_request table will be updated with job status and output file details will be updated in platform_db.druid_report_config

  • Output location and file format in Azure:

Once a request has been submitted and processing complete, the report data file with the name of the file being the report name suffixed with genaration date saved under :

Code Block
themeRDark
borderStylesolid
   /druid-reports/report-id/report-name-mmddyyyy.csv
   /druid-reports/report-id/report-name-mmddyyyy.json