Architecting Large Scale Software Systems

These are notes on architecting large scale systems. (Is architecting even a word?) The notes came out of a series of recent conversations with trusted colleagues.

Large software systems have several axes to manage

amount of requests -> millions of requests per second
amount of data -> millions of data records to process
amount of code -> millions of lines of code to organize
amount of staff -> millions of communication channels

Monitoring

is the system healthy?
keep it healthy

Monitoring > Logging

log everything
keep logs for about a year
do not log confidential data such as passwords
include 1. correlation id and 2. the entire request/response payload
structured logging is a nice-to-have

Monitoring > Notifications

monitor every important part of the system
ping employees via phone, email, text
have clear plans of action; e.g. restart the service
Kubernetes has some out-of-the-box monitoring: live, healthy.

Deployment

who can deploy?
the machine needs approval from two people:
... one person (e.g. a developer) starts a deployment,
... another person must approve the deployment,
... the machine does the actual deployment via a script.
this is more secure and prevents accidents (e.g. deleted the database, oops)

Scaling

scale out is the trend
it is about load balancing
compared to one large machine, multiple small machines are
... less expensive
... more flexible
... more resilient (w/ redundancies)

Distributed Transactions

                       HasStarted    HasCompleted

database1.someTable        1              1
database2.someTable        1              0 // problem!
database3.someTable        1              1

for each distributed database table involved in a distributed transaction,
... store the transaction state: HasStarted, HasCompleted...
also include a supervisor service to check for transaction integrity
--
some transaction will fail
fail fast and on purpose:
... if something in the system fails, shoot the flare,
... sound the alarm immediately, do not keep processing

Event Sourcing

do not store state
rather, record every event that would otherwise change state,
and determine the state at a given time by replaying each event.

Redis

supports clustering for a reliable, distributed cache
multiple services read/write to a single distributed cache
more than just a cache
... supports queue and bus features
... pub-sub, in which services subscribe to a key
... consistent index increment operations (plus one)
limitation: only supported serialized data

Lots of Data - Paging, Filtering

Back 1 2 3 4 5 6 7 8 9 10 .. 11,320 Next // no!
Back 1 2 3 4 5 6 7 8 9 10 Next // yes

exposing the total number of available pages in the UI is expensive,
because counting the total number of records entails scanning the entire data store,
and it's non-trivial to cache the count on a dynamic query
a better alternative is to expose a limited number of pages with back/next
on the other hand, filtering is cheap, especially with DB indexing

Message Queue

we mostly read them
uses a pull not a push model
usually support message persistence after something has handled the message

Message Bus

this is pub/sub (publish/subscribe)
multiple listeners/subscribers
usually do not support message persistence
compared to object-oriented event handling,
... publishing to a message bus can be harder to debug at runtime,
... because of difficultly finding the message handler(s) and stepping into code;
... static analysis does not support "find all message bus subscribers"

Managing Thousands of Lines of Code

reduce coupling (cliched, but true)
define clean lines among functional areas,
even though we own the entire code base,
... break code out into libraries;
... pretend we are creating a set of public APIs
... focus on opaque APIs that hide their implementation details.
in a multi-layered/multi-tiered architecture,
... avoid depending on services/libraries that are more than one layer away,
... strive for dependency structures that look like linked lists

Traffic Localization, Distributed Systems

direct the user to the closest server
e.g Azure Traffic Manager does this; CloudFlare (probably) provides it too
when the request arrives
... ping available servers
... look at the latency
... direct the request to the correct server
this is basically a performance enhancing DNS service

Challenge: region specific transactions

e.g. payment processing
... payments in Australia
... payments in United Kingdom
... payments in United States
localize the algorithms for payment process,
scale payment servers based on regional demand,
reuse shopping cart and invoicing systems for all regions,
doing this is is harder when payment/cart/invoicing are coupled
one viable approach: break out a service per payment region

MonoRepos vs MicroRepos

the mono repositories require more investment in tooling
without that, it's too easy to do stupid things in a mono repository

Politics in Large Organizations

as the system grows, change becomes more difficult,
the corporate politics can become reactionary,
the code base's technical debt can become unwieldy and brittle,
focus on making small, impactful changes,
large systems are sobering - it's like steering an enormous ship
mitigation:
- avoid coupling from the start
- the harder it is to introduce coupling, the less likely people will do it
- so, make coupling painful and more difficult:
  - separate repositories
  - code analysis
  - dependency structure tests
  - code inspection
- one choice is among:
  - an above average system that prevents bad patterns
  - an above average team that prevents bad patterns
  - a mix of both is probably the best