Request traceability across multiple sub-systems

Introduction:

This document describes how to trace/map the request across multiple sub-systems involved in process of serving/showing the details.

Background:

Jira Issue https://project-sunbird.atlassian.net/browse/SB-17028

Problem Statement:

  1. How to analyze the time spent to serve the request?

  2. How to analyze in which sub-system request is failing in the workflow?

Key design problems:

  1. How to associate/relate the request across multiple sub-systems?

Workflow of QR-Code scan:

Solution 1:

Use existing telemetry events

{ ... "cdata": [ ... { "type": "Trace" "id": "2d4fd89a-5ed9-44ed-8efd-9fabedbe0f03" //UUID } ], ... }

or

{ ... "eid": "IMPRESSION" "traceid": "2d4fd89a-5ed9-44ed-8efd-9fabedbe0f03" //UUID ... }

Pros:

  1. Easy to implement, because each sub-system is already logging the telemetry events

  2. With minimal efforts, we achieve visualization in Druid/Superset

Cons:

  1. The timestamp should be the same from all sub-systems (epoch time)

  2. The finding of origin/root module of the trace depends on timestamp only. If any mismatch will lose the data.

  3. Can’t send additional information required for request trace

  4. Chances of losing traceability if any sub-system is failed to send trace unique id

  5. Code change may impact on the existing traceability

  6. Defining workflows, where the trace is going to end.

Solution 2:

Define a new telemetry event for trace information(similar to the LOG telemetry event). This is because along with traceID we should able to log additional information to track the request.

Trace object should contain(Span object should be similar to OpenTracing spec)

"edata": { "id": {UUID}, // TraceId. Common for all sub-systems(this is to track entire workflow front-end to back-end) "name": "qr-scan" // service/module name to identify "span" { "traceID": {UUID}, "operationName": {string} //ex: "qr-scan" for the qr scan workflow. Use some unique string for each workflow/span "spanID": {worflowId}, // UniqueId for this workflow. Always same for multiple scans "parentSpanId": {span.id} // Optional: This is to track who is the parent of the request. Helps to create the tree structure of the trace "context": [{ "type":"", "key":"", "value":""}] // Context of the workflow, ex: "dialcode": "4XJG3F" on scan of this dial code workflow started "tags": [{"key":"", "type":"", "value":""}] // https://github.com/opentracing/specification/blob/master/semantic_conventions.md#span-tags-table } }

 

OpenTracing API “SPAN“ spec:

Use OpenTracing API to generate Trace & Span objects. Jenkins & Jaeger have built on top of these API’s.

Sample Data to Analyze: https://docs.google.com/spreadsheets/d/1v96dZ0m21OVHswfZU84frOcymYd1kVqD7SqYLCyguFE/edit?usp=sharing

font-end implementation:

Use Javascript NPM module for front-end:

https://github.com/opentracing/opentracing-javascript

 

The singleton service/module which can hold only 1 trace workflow at any time. Angular components will use the below service to start & end the trace by sending action/workflow name. The trace can contain multiple span objects(which will be an enhancement of the below single-span implementation).

 

When the user scans/search with dial-code call TraceService.startTrace('qr-scan') method from the respective angular component. This will starts the new trace object.

After getting the response from API, call the below method to end the trace.

 

Use the decorator design pattern to add trace information in the requests. Call the

Back-end implementation:

refer: Best practices of open-tracing back-end implementation https://opentracing.io/docs/best-practices/

Any API is called from front-end pass the below details in the request headers.

OpenTacing API implementation in JAVA

Tracing Server Endpoints

When a server wants to trace execution of a request, it generally needs to go through these steps:

  1. Attempt to extract a SpanContext that’s been propagated alongside the incoming request (in case the trace has already been started by the client), or start a new trace if no such propagated SpanContext could be found.

  2. Store the newly created Span in some request context that is propagated throughout the application, either by application code, or by the RPC framework.

  3. Finally, close the Span using span.finish() when the server has finished processing the request.

Extracting a SpanContext from an Incoming Request

Let’s assume that we have an HTTP server, and the SpanContext is propagated from the client via HTTP headers, accessible via request.headers:

Here we use the headers map as the carrier. The Tracer object knows which headers it needs to read in order to reconstruct the tracer state and any Baggage.

Pros:

  1. Can able to find the time taken on each sub-modules easily based on SpanId

  2. Time spent of each sub-system will be available directly in the event itself

  3. In sub-systems, we can add multiple spans

  4.  

Cons:

  1. More efforts required compared to Solution1.

  2. Multiple telemetry events will be logged for the same workflow(like impression & trace)

  3.  

Solution 3:

Integrate with external distributed tracing tools like Zipkin, Jaeger etc..

JAEGER:

Architecture diagram:

Pros:

  1. The simple configuration code changes will handle the request traceability.

  2. Jaegar UI can be used to visualize the trace events.

Cons:

  1. Workflows can’t be traceable. Each API will be considered as one trace.

  2. Connecting multiple applications/servers to the single jaeger instance to trace the workflow of the API has to explore.

Reference links:

  • Jaeger related link

Jaeger Client node: