Page Comparison

Introduction:

This document describes the design to generate the data for the portal dashboards from Druid OLAP data store and export the report data to cloud storage. This mainly consists of following modules

Configure Report API - which is This API will be used to submit a request for configuration of a new report.
Job Scheduler Engine - This Scheduler will submit the reports for execution based on execution frequency.
Disable Report API - which is used to disable report generationThis API will mark an existing report as disabled and will be excluded from the list of reports to be executed.
Report Data Generator - is The report data generator will be a spark job runs as per schedule time, generates data out of druid and exports data to cloud storage for each requests enabled.

Configure Report API:

Input:

Channel Id
Query Time Interval
Report Config Json
Query Jsonwhich will generate report data file by executing the query configured in the report configuration against the druid data store. The report data file will then be exported to cloud storage to complete the report execution.

Image Added

Configure Report API:

Input parameters

Parameter	Mandatory	Description	Comments
report_name	Yes	Name of the report
query_engine	Yes	Data Source	DRUID, CASSANDRA, ELASTICSEARCH
execution_frequency	Yes	Report generation frequency	DAILY, WEEKLY, MONTHLY
report_interval	Yes	Date range for queries	YESTERDAY LAST_7_DAYS, LAST_WEEK, LAST_30_DAYS, LAST_MONTH, LAST_QUARTER, LAST_3_MONTHS, LAST_6_MONTHS LAST_YEAR
query	Yes	Query to be executed
output_format	Yes	Output format of the report	json, csv
output_file_pattern	No	Report output filename pattern	report_id and end_date from the interval are used by default {report_id}-{end_date}.{output_format} Other Supported Placeholders are: report_name timestamp
output_field_names	Yes	Output field names used in report output
group_by_fields	No	Fields by which reports are grouped by	channel_id, device_id

Request Object
Code Block
language
theme RDark
borderStyle solid
linenumbers true
collapse false
{

...

"id":

...

"sunbird.analytics.report.submit",

...

"ver":

...

"1.0",

...

"ts":

...

"2019-03-07T12:40:40+05:30",

...

"params":

...

{

...

"msgid":

...

"4406df37-cd54-4d8a-ab8d-3939e0223580",

...

"client_key":

...

"analytics-team"

...

},

...

"request":

...

{

...

"channel_id":

...

"in.ekstep",

...

"report_

...

name":

...

"avg_collection_downloads",

...

"

...

query_engine": "

...

druid",

...

"execution_frequency": "

...

DAILY",

...

"

...

report_interval":"LAST_7_DAYS", "

...

output_format": "

...

json",

...

"output_field_names": ["Average Collection Downloads"],

...

"query_json":{

...

"queryType":

...

"

...

groupBy",

...

"dataSource":

...

"

...

telemetry-events",

...

"

...

granularity":

...

"day", "

...

Output:
Saves report-id and job_name as "druid-reports" in cassandra platform_db.job_request table as "ENABLED"
Save "report-id.json" to azure
Takes druid query from request if any or build druid query from other inputs

Save druid query: We have following 2 approaches:
Approach 1: Save indivisual report queries seperatly in azure

Code Block

language

theme

RDark

borderStyle

solid

linenumbers

true

collapse

false

druid/queries/report-id-1.txt
druid/queries/report-id-2.txt

Approach 2: Save all queries in one xml file like this

Code Block

language

theme

RDark

borderStyle

solid

linenumbers

true

collapse

false

<?xml version = "1.0" encoding = "utf-8"?> <reports> <report name="report1" id="report_1" enabled="true"> <query> <![CDATA[

dimensions":[

{

"queryType": "groupBy",

"eid"

"dataSource": "telemetry-events", "granularity": "day", "dimensions": ["eid"

],

"aggregations":

{ "type":

"count", "name":

"context_did",

"

fieldName":

"context_did" }

],

"intervals": [ "$report_interval" ]

"filter":

"type":

"and",

"fields":

{ "type":

"selector", "

dimension

name":

"eid", fieldName"

value"

"IMPRESSION" },

{ "type":

"selector", "

dimension

name":

"edata_type", fieldName"

value"

"detail" },

{ "type":

"selector", "

dimension

name":

"edata_pageid", fieldName"

value"

"collection-detail" },

{ "type":

"selector", "

dimension

name":

"context_pdata_id", fieldName"

value"

"prod.diksha.app" }

]
        },

},

    "postAggregations":[

"postAggregations"

:

[

"type"

"arithmetic",

"name"

"avg__edata_value",

"fn"

"/",


              "fields"

[
                 { "type"

"fieldAccess", "name"

"total_edata_value", "fieldName"

"total_edata_value" },
                 { "type"

"fieldAccess", "name"

"rows", "fieldName"

"rows" }

]

],

}

      "intervals":[

]]>

</query>

"2019-02-20T00:00:00.000/2019-01-27T23:59:59.000"

<query_params>

 ]
     }

<param

name="datasource" value="telemetry-events" optional="false"/> <param name="granularity" value="day" optional="false"/>

Output:

The individual report configurations can be saved to a Cassandra table. The druid query JSON will be saved to Azure blob storage and the following will be the fields in the report configuration table.

Code Block

theme	FadeToGrey
borderStyle	solid

   # Schema of table
   TABLE platform_db.reports_configuration (
     report_id text, // hash

...

of

...

report_name and report_interval

...

     report_config text, // Entire JSON from

...

request
     status text,

...

report_output_location text,
     report_last_generated timestamp,

...

   PRIMARY KEY

...

(report_id) );

...

Job Scheduler Engine:

Image Added

Input:
- A list of reports in reports_configuration Cassandra table with the cron_expression which falls within the current day of execution and with status as ENABLED.
Algorithm:

- Data availability check has following 2 criteria:

1. Kafka indexing lag: check for 0 lag in druid ingestion.

...

2. Druid segments count: Segments should have been created for previous day.

- Reports based on telemetry-events will be submitted for execution upon satisfying both the criteria.

- Reports based on summary-events will be submitted for execution upon satisfying only 2nd criteria or check for files in azure.

- If report_last_generated is not equals to previous day for a report, submits the same report for all the pending dates.

Output:saves report-id

- The list of reports are submitted for execution into the platform_db.job_request Cassandra table with the status=SUBMITTED and job_name

...

=druid-reports

...

-<report-id>.

Disable Report API:

Input:

- report-id

Output:

- The report will be DISABLED in the platform_db.

...

reports_

...

configuration Cassandra table

Report Data Generator Data Product:

Input:Get reports list from config, if null/empty search through cassandra

- Set of Requests - i.e All records in platform_db.job_request

...

where status=SUBMITTED and job_name

...

starts with druid-reports

...

Output:
Uploads report output data to azure

...

Output:

- Report data file will be saved in Azure with specified format
- platform_db.job_request table will be updated with job status and output file details will be updated in platform_db.reports_configuration

Output location and file format in Azure:

Once a request has been submitted and processing complete, the report data file with the name of the file being the report id suffixed with report_interval end-date saved under :

Code Block

theme	RDark
borderStyle	solid

   /druid-reports/report_id-yyyy-mm-dd.csv
   /druid-reports/report_id-yyyy-mm-dd.json

Regenerate Report API:

Input:

- report-list

- replay-interval

Algorithm:

- Loops through replay interval, find report_output_filename with each date and resubmits to job_request table for execution.

- Backup and delete older report data files from azure.

- report-list can be empty in case of regenerating all enabled reports.

Output:

- The list of reports are submitted for execution into the platform_db.job_request Cassandra table with the status=SUBMITTED and job_name=druid-reports-<report-id>.

Versions Compared

Old Version 4

New Version Current

Key

Introduction:

Configure Report API:

Configure Report API:

Input parameters

Job Scheduler Engine:

Disable Report API:

Report Data Generator Data Product:

Regenerate Report API: