Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

Introduction:

This document describes how to trace/map the request across multiple sub-systems involved in process of serving/showing the details.

Background:

Jira Issue https://project-sunbird.atlassian.net/browse/SB-17028

Problem Statement:

  1. How to analyze the time spent to serve the request?

  2. How to analyze in which sub-system request is failing in the workflow?

Key design problems:

  1. How to associate/relate the request across multiple sub-systems?

Workflow of QR-Code scan:

...

Solution 1:

Use existing telemetry events

...

Code Block
{
  ...
  "eid": "IMPRESSION"
  "traceid": "2d4fd89a-5ed9-44ed-8efd-9fabedbe0f03" //UUID
  ...
}

Pros:

  1. Easy to implement, because each sub-system is already logging the telemetry events

  2. With minimal efforts, we achieve visualization in Druid/Superset

Cons:

  1. The timestamp should be the same from all sub-systems (epoch time)

  2. The finding of origin/root module of the trace depends on timestamp only. If any mismatch will lose the data.

  3. Can’t send additional information required for request trace

  4. Chances of losing traceability if any sub-system is failed to send trace unique id

  5. Code change may impact on the existing traceability

  6. Defining workflows, where the trace is going to end.

Solution 2:

Define a new telemetry event for trace information(similar to the LOG telemetry event). This is because along with traceID we should able to log additional information to track the request.

...

Sample Data to Analyze: https://docs.google.com/spreadsheets/d/1v96dZ0m21OVHswfZU84frOcymYd1kVqD7SqYLCyguFE/edit?usp=sharing

font-end implementation:

Use Javascript NPM module for front-end:

...

Use the decorator design pattern to add trace information in the requests. Call the

...

Back-end implementation:

refer: Best practices of open-tracing back-end implementation https://opentracing.io/docs/best-practices/

...

Code Block
"traceId": { UUID }
"name": "qr-code" // for QR code scane. Name of the workflow to trace. 
"context": [{ "type": "key", "value":""}]

Tracing Server Endpoints

When a server wants to trace execution of a request, it generally needs to go through these steps:

  1. Attempt to extract a SpanContext that’s been propagated alongside the incoming request (in case the trace has already been started by the client), or start a new trace if no such propagated SpanContext could be found.

  2. Store the newly created Span in some request context that is propagated throughout the application, either by application code, or by the RPC framework.

  3. Finally, close the Span using span.finish() when the server has finished processing the request.

Extracting a SpanContext from an Incoming Request

Let’s assume that we have an HTTP server, and the SpanContext is propagated from the client via HTTP headers, accessible via request.headers:

...

Here we use the headers map as the carrier. The Tracer object knows which headers it needs to read in order to reconstruct the tracer state and any Baggage.

Pros:

  1. Can able to find the time taken on each sub-modules easily based on SpanId

  2. Time spent of each sub-system will be available directly in the event itself

  3. In sub-systems, we can add multiple spans

Cons:

  1. More efforts required compared to Solution1.

  2. Multiple telemetry events will be logged for the same workflow(like impression & trace)

Solution 3:

Integrate with external distributed tracing tools like Zipkin, Jaeger etc..

JAEGER:

Architecture diagram:

...

Pros:

  1. The simple configuration code changes will handle the request traceability.

  2. Jaegar UI can be used to visualize the trace events.

Cons:

  1. Workflows can’t be traceable. Each API will be considered as one trace.

  2. Connecting multiple applications/servers to the single jaeger instance to trace the workflow of the API has to explore.

Reference links:

...

Jaeger Client node: https://github.com/jaegertracing/jaeger-client-node

...

Trace ability of request across subsystems

...