/
Scaling Sunbird Infra

Scaling Sunbird Infra

Why are we scaling?

  • To serve incoming requests quickly
  • Ability to handle increase in requests compared to current number of requests.
  • Avoid single point of failure and ensure availability of services 99.9% time

How are we scaling?

  • We are scaling horizontally wherever possible. This is the main area of focus.
  • In components where horizontal scaling is not possible, we would scale it vertically

Horizontal Scale components

  • Docker swarm
  • Elasticsearch cluster
  • Consumption and creations API's
    • Development team will make a change to send / receive such requests to / from elasticsearch or Redis
    • This would reduce the stress on the database for such API's and also the response would be faster
  • YARN cluster   
  • Cassandra cluster
  • Learning and Search servers

Vertical Scale components

  • Neo4j
  • Kafka
  • Influx DB

Multi-swarm Cluster for Resiliency

Present Arch:

  • Single docker swarm cluster with multiple managers and nodes to run up to 256 containers in a overlay network. We have approx. 150 containers running to support 600 tps load

Cons:

  • Network hiccups on single swarm cluster brings down site for while
  • No automation for Infra for scaling
  • Service downtime during upgrade


Future Arch:

  • Multi-swarm cluster attaching to load-balancer
  • Ability to ad/remove cluster without down-time 
  • Blue-Green deployments are possible with Multi-swarm cluster
  • Multiple backend pools to azure load balancer
  • Each swarm cluster will be of same capacity 
  • Auto-scaling of services (increasing replicas) can be done by polling prometheus data (cadvisor metrics) 

Logging

  • Kibana will be moved to an individual VM
  • Service label will be changed to identify the log
  • Logs data from Multiple swarm clusters will be pushed to Log ElasticSearch Cluster
  • Log aggregation for KP and DP services

Monitoring

  • Upgrade Prometheus version to support hierarchical federation and snapshots
  • Enable hierarchical federation
  • Configure grafana dashboard to view metrics from each swarm cluster

CDN

Load Testing

  • A production replica infrastructure would be created (without production data) with multi-swarm and other scaled up components.
  • Various type of load testing would be performed on this infrastructure to test the capacity and stability
  • Once the data is validated and meets our standards, the same would be planned for production.

Related content

Sunbird Ed Appliancification
Sunbird Ed Appliancification
More like this
Core
Core
More like this
Basic Info
Basic Info
More like this
Core Services
Core Services
More like this
About Sunbird and Pre-requisite
About Sunbird and Pre-requisite
More like this
Running Sunbird the new way
Running Sunbird the new way
More like this