Data & Grammar Processing

Objective

This document provides the details of how the grammar and data processing is carried out using specific commands. This will explain the process of data ingestion into dataset tables which will be further used by the UI to visualise the data.

Purpose: The purpose of this document is to provide the details of how to generate dataset tables and ingest the data into those tables. It also provides information regarding dimension grammar processing and data ingestion into the dimension tables.

Grammar Processing using processing-ms CLI commands

Processing-ms has different commands to process the grammar and ingest the data into the dataset tables. The following commands are responsible for the above specified functionality.

yarn cli ingest: This command will execute the following steps

It will read the config.json file of specific state present inside the ingest folder.
Then it will process all the dimension grammar present in the dimensions folder.
The dimension grammars are stored in the “spec.dimensionGrammar” table and the dimensions tables are created in the dimensions schema.
It will also look for data files respective to each dimension grammar file name and ingest all the dimension data to the respective tables.
After the dimensions are ingested the programs array present in config.json is read and the event grammars are process from the corresponding <program-name> folder. The event grammars are stored in the spec.”EventGrammars” table.
The dataset grammars are also stored in the spec.”datasetGrammars” table and the dataset tables are created based on the combination of timeDimension, dimension and metric present in the event grammars.
In addition to the above combination of datasets created the user can also specify the combination of datasets that can be created in the whitelist array.

Config.json file

{
  "globals": {
    "onlyCreateWhitelisted": true
  },
  "dimensions": {
    "namespace": "dimensions",
    "fileNameFormat": "${dimensionName}.${index}.dimensions.data.csv",
    "input": {
      "files": "./ingest/JH/dimensions"
    }
  },
  "programs": [
    {
      "name": "DIKSHA",
      "namespace": "diksha",
      "description": "DIKSHA",
      "shouldIngestToDB": true,
      "input": {
        "files": "./ingest/JH/programs/diksha"
      },
      "./output": {
        "location": "./output/programs/diksha"
      },
      "dimensions": {
        "whitelisted": [
          "state,grade,subject,medium,board",
          "textbookdiksha,grade,subject,medium",
          "textbookdiksha,grade,subject,medium"
        ],
        "blacklisted": []
      }
    },
    {
      "name": "School Attendance",
      "namespace": "sch_att",
      "description": "School Attendance",
      "shouldIngestToDB": true,
      "input": {
        "files": "./ingest/JH/programs/school-attendance"
      },
      "./output": {
        "location": "././output/programs/school-attendance"
      },
      "dimensions": {
        "whitelisted": [
          "gender,district",
          "gender,block",
          "gender,cluster",
          "school,grade",
          "gender,school",
          "gender,school,grade",
          "schoolcategory,district",
          "schoolcategory,block",
          "schoolcategory,cluster"
        ],
        "blacklisted": []
      }
    },
    {
      "name": "PM Poshan",
      "namespace": "pm_poshan",
      "description": "PM Poshan",
      "shouldIngestToDB": true,
      "input": {
        "files": "./ingest/JH/programs/pm-poshan"
      },
      "./output": {
        "location": "./output/programs/pm-poshan"
      },
      "dimensions": {
        "whitelisted": [
          "district,categorypm"
        ],
        "blacklisted": []
      }
    },
    {
      "name": "NAS",
      "namespace": "nas",
      "description": "NAS",
      "shouldIngestToDB": true,
      "input": {
        "files": "./ingest/JH/programs/nas"
      },
      "./output": {
        "location": "./output/programs/nas"
      },
      "dimensions": {
        "whitelisted": [
          "district,lo,subject,grade",
          "state,lo,subject,grade"
        ],
        "blacklisted": []
      }
    },
    {
      "name": "UDISE",
      "namespace": "udise",
      "description": "UDISE",
      "shouldIngestToDB": true,
      "input": {
        "files": "./ingest/JH/programs/udise"
      },
      "./output": {
        "location": "./output/programs/udise"
      },
      "dimensions": {
        "whitelisted": [
          "district,categoryudise",
          "state,categoryudise"
        ],
        "blacklisted": []
      }
    },
    {
      "name": "PGI",
      "namespace": "pgi",
      "description": "PGI",
      "shouldIngestToDB": true,
      "input": {
        "files": "./ingest/JH/programs/pgi"
      },
      "./output": {
        "location": "./output/programs/pgi"
      },
      "dimensions": {
        "whitelisted": [
          "state,district,categorypgi",
          "state,categorypgi"
        ],
        "blacklisted": []
      }
    },
    {
      "name": "NISHTHA",
      "namespace": "nishtha",
      "description": "NISHTHA",
      "shouldIngestToDB": true,
      "input": {
        "files": "./ingest/JH/programs/nishtha"
      },
      "./output": {
        "location": "./output/programs/nishtha"
      },
      "dimensions": {
        "whitelisted": [
          "state,district,programnishtha",
          "state,programnishtha,coursenishtha",
          "state,programnishtha",
          "district,programnishtha"
        ],
        "blacklisted": []
      }
    },
    {
      "name": "Student Progression",
      "namespace": "student_progression",
      "description": "Student Progression",
      "shouldIngestToDB": true,
      "input": {
        "files": "./ingest/JH/programs/student-progression"
      },
      "./output": {
        "location": "./output/programs/student-progression"
      },
      "dimensions": {
        "whitelisted": [
          "school,academicyear"
        ],
        "blacklisted": []
      }
    },
    {
      "name": "School Infrastructure",
      "namespace": "school_infra",
      "description": "School Infrastructure",
      "shouldIngestToDB": true,
      "input": {
        "files": "./ingest/JH/programs/school-infra"
      },
      "./output": {
        "location": "./output/programs/school-infra"
      },
      "dimensions": {
        "whitelisted": [
          "school,academicyear"
        ],
        "blacklisted": []
      }
    },
    {
      "name": "Student Assessment",
      "namespace": "assessment",
      "description": "Student Assessment",
      "shouldIngestToDB": true,
      "input": {
        "files": "./ingest/JH/programs/student-assessment"
      },
      "./output": {
        "location": "./output/programs/student-assessment"
      },
      "dimensions": {
        "whitelisted": [
          "exam,grade,academicyear,subject,lo,school",
          "state,lo,subject,grade",
          "district,subject,grade"
        ],
        "blacklisted": []
      }
    }
  
  ]
}

yarn cli ingest-data: This command will ingest the data to the dataset tables for all the programs. It also provides an option to ingest the data for the particular program.

yarn cli ingest-data ---filter=<program_name>

The program name passed should be the same as namespace specified the config file.

It will read all the programs present in the config.json file.
It will check for the data files inside the programs folder and process the data.
Then the process dataset update request is created.
The transformer process the request and updates the data to the dataset tables created. These datasets are used for visualisation in the UI.

Delete the data in the database

yarn cli nuke-db: This command will delete all the data present in the database.