Page Comparison

...

This document describes the design to generate the data for the portal dashboards from Druid OLAP data store and export the report data to cloud storage. This mainly consists of following modules

Create report Configure Report API - This API will be used to submit a request for configuration of a new report.
Job Scheduler Engine - This Scheduler will submit the reports for execution based on execution frequency.
Disable Report API - This API will mark an existing report as disabled and will be excluded from the list of reports to be executed.
Report Data Generator - The report data generator will be a spark job which will generate report data file by executing the query configured in the report configuration against the druid data store. The report data file will then be exported to cloud storage to complete the report execution.

Image Added

Configure Report API:

Input:

- ReportName

- QueryEngine - druid/cassandra/ES

- ExecutionFrequency - Daily/Weekly/Monthly

- ChannelId

...

- ReportInterval

- Query

- OutputFormat - json/csv

Request Object
Code Block
language

...

theme

...

RDark
borderStyle solid
linenumbers true
collapse false
{

...

"id":"sunbird.analytics.report.submit",

...

"ver":"1.0",

...

"ts":"2019-03-07T12:40:40+05:30",

...

"params":{

...

"msgid":"4406df37-cd54-4d8a-ab8d-3939e0223580",

...

"client_key":"analytics-team"

...

},

...

"request":{

...

"channel_id":"in.ekstep",

...


     "report_name":"avg_collection_downloads",
     "query_engine": "druid",
     "execution_frequency": "Daily",
     "report_interval":"LAST_7_DAYS",
     "output_format": "json"
     "query_json":{

...

"queryType":"groupBy",

...


        "dataSource":"telemetry-events",

...

"granularity":"day",

...

"dimensions":[

...

"eid"

...

],

...

"aggregations":[

...

{ "type":"count", "name":"context_did", fieldName":"context_did" }

...

],

...

"filter":{

...

"type":"and",

...

"fields":[

...

{ "type":"selector", "name":"eid", fieldName":"IMPRESSION" },

...


              { "type":"selector", "name":"edata_type", fieldName":"detail" },

...


              { "type":"selector", "name":"edata_pageid", fieldName":"collection-detail" },

...


              { "type":"selector", "name":"context_pdata_id", fieldName":"prod.diksha.app" }

...

]

...

},

...

"postAggregations":[

...

{

...

"type":"arithmetic",

...

"name":"avg__edata_value",

...

"fn":"/",

...

"fields":[

...


                 { "type":"fieldAccess", "name":"total_edata_value", "fieldName":"total_edata_value" },

...


                 { "type":"fieldAccess", "name":"rows", "fieldName":"rows" }

...

]

...

}

...

],

...

"intervals":[

...


           "2019-02-20T00:00:00.000/2019-01-27T23:59:59.000"

...

] }

...

} }

...

Output:

The individual report configurations can be saved to a Cassandra table. The druid query JSON will be saved to Azure blob storage and the following will be the fields in the report configuration table

...

.

...

report_id
report_name
is_enabled
report_interval
report_configuration_location
report_output_filename

Pros:

Easier to edit the report query as it doesn't require any deployments.
Managing individual report definition is easier as they are separated out.

Cons:

As the number of reports increases, there will be too many druid query json files in the blob storage.
The report configurations need to be managed in a separate Cassandra table. New fields like execution_frequency will require changes to the Cassandra table.

...

A single report configuration xml file which will store individual report configurations. The report configuration xml needs to be read and the new reports appended to the existing list. Each report section will be self-sufficient with all the required parameters for the report being part of the xml configuration file.

Code Block

theme	DJangoRDark
borderStyle	solid	linenumbers	true

<?xml version = "1.0" encoding = "utf-8"?>
<reports># Schema of table
   <report name="report1" id="report_1" enabled="true" output_filename="report1_output.json">
TABLE platform_db.druid_report_config (
   <query>       <![CDATA[
        {
           "queryType":"groupBy"report_id text,
           "dataSource":"telemetry-events",
           "granularity":"day",
           "dimensions":[
              "eid"
           ],
           "aggregations":[
              { "type":"count", "name":"context_did", "fieldName":"context_did" }
           ]report_name text,
           "intervals": [
              "$reportreport_interval"
           ] text,
           "filter":{
              "type":"and"query_engine text,
              "fields":[
                 { "type":"selector", "dimension":"eid", "value":"IMPRESSION" },
                 { "type":"selector", "dimension":"edata_type", "value":"detail" }execution_frequency text,
     is_enabled boolean,
                 { "type":"selector", "dimension":"edata_pageid", "value":"collection-detail" }report_query_location text,
                 { "type":"selector", "dimension":"context_pdata_id", "value":"prod.diksha.app" }
              ]
           }report_output_format text,
           "postAggregations":[{
                 "type":"arithmetic",
                 "name":"avg__edata_value",
                 "fn":"/"report_output_location text,
                 "fields":[
                    { "type":"fieldAccess", "name":"total_edata_value", "fieldName":"total_edata_value" }report_output_filename text,
     PRIMARY KEY (report_id) );
           { "type":"fieldAccess", "name":"rows", "fieldName":"rows" }
                 ]
            }]
        }
      ]]>
   </query>
   <query_params>
     <param name="datasource" value="telemetry-events" optional="false"/>
     <param name="granularity" value="day" optional="false"/>
     <param name="report_interval" value="LAST_7_DAYS"/>
     <param name="start_date" value="2019-03-01" optional="true"/>
     <param name="start_date" value="2019-03-07" optional="true"/>
   </query_params>
 </report>
</reports>

Pros:

All report configurations are in one place and hence easier to maintain.
It is easier to apply changes to report parameters or add new parameters to an existing report.

Cons:

Readability becomes an issue in case of huge xml files.
The entire report configuration needs to be read into memory when the job scheduler executes the report every day.

Report Data Generator Data Product

Input
- Cassandra table where the reports are configured or the XML file with the report configurations.
Algorithm
- Obtain a list of reports which are enabled and extract the druid query from the configuration.
- Execute the druid query for each of the reports within the Spark job.
Output
- Uploads the generated report with the configured filename to Azure blob storage.

Disable Report API:

...

report_id

...

Output location and file format of query:

Code Block

theme	RDark
borderStyle	solid

/druid-reports/query/druid/report-id.json
/druid-reports/query/cassandra/report-id.cql

Report Intervals:

Code Block

theme	RDark
borderStyle	solid

   YESTERDAY
   LAST_7_DAYS
   LAST_30_DAYS
   LAST_MONTH
   LAST_6_MONTHS
   LAST_YEAR
   StartDate(DD/MM/YYYY)-EndDate(DD/MM/YYYY)

Job Scheduler Engine:

Input:

- druid_report_config cassandra table

Output:

- Based on execution_frequency in platform_db.druid_report_config submits a request for execution by inserting a record in platform_db.job_request Cassandra table.

Disable Report API:

Input:

- report-id

Output:

- The report will be disabled in the platform_db.druid_report_config Cassandra table

Report Data Generator Data Product:

Input:

- Set of Requests - i.e All records in platform_db.job_request where status=SUBMITTED and job_name=druid-reports

Output:

- Report data file will be saved in Azure with specified format
- platform_db.job_request table will be updated with job status and output file details will be updated in platform_db.druid_report_config

Output location and file format in Azure:

Once a request has been submitted and processing complete, the report data file with the name of the file being the report name suffixed with genaration date saved under :

Code Block

theme	RDark
borderStyle	solid

   /druid-reports/report-id/report-name-mmddyyyy.csv
   /druid-reports/report-id/report-name-mmddyyyy.json

Versions Compared

Old Version 5

New Version 6

Key

Configure Report API:

Report Data Generator Data Product

Disable Report API:

Job Scheduler Engine:

Disable Report API:

Report Data Generator Data Product: