Scaling Sunbird Infra

Why are we scaling?

  • To serve incoming requests quickly
  • Ability to handle increase in requests compared to current number of requests.
  • Avoid single point of failure and ensure availability of services 99.9% time

How are we scaling?

  • We are scaling horizontally wherever possible. This is the main area of focus.
  • In components where horizontal scaling is not possible, we would scale it vertically

Horizontal Scale components

  • Docker swarm
  • Elasticsearch cluster
  • Consumption and creations API's
    • Development team will make a change to send / receive such requests to / from elasticsearch or Redis
    • This would reduce the stress on the database for such API's and also the response would be faster
  • YARN cluster   
  • Cassandra cluster
  • Learning and Search servers

Vertical Scale components

  • Neo4j
  • Kafka
  • Influx DB

Multi-swarm Cluster for Resiliency

Present Arch:

  • Single docker swarm cluster with multiple managers and nodes to run up to 256 containers in a overlay network. We have approx. 150 containers running to support 600 tps load

Cons:

  • Network hiccups on single swarm cluster brings down site for while
  • No automation for Infra for scaling
  • Service downtime during upgrade


Future Arch:

  • Multi-swarm cluster attaching to load-balancer
  • Ability to ad/remove cluster without down-time 
  • Blue-Green deployments are possible with Multi-swarm cluster
  • Multiple backend pools to azure load balancer
  • Each swarm cluster will be of same capacity 
  • Auto-scaling of services (increasing replicas) can be done by polling prometheus data (cadvisor metrics) 

Logging

  • Kibana will be moved to an individual VM
  • Service label will be changed to identify the log
  • Logs data from Multiple swarm clusters will be pushed to Log ElasticSearch Cluster
  • Log aggregation for KP and DP services

Monitoring

  • Upgrade Prometheus version to support hierarchical federation and snapshots
  • Enable hierarchical federation
  • Configure grafana dashboard to view metrics from each swarm cluster

CDN

Load Testing

  • A production replica infrastructure would be created (without production data) with multi-swarm and other scaled up components.
  • Various type of load testing would be performed on this infrastructure to test the capacity and stability
  • Once the data is validated and meets our standards, the same would be planned for production.