Image by janeb13 at https://pixabay.com/en/europa-sailboat-fs-senator-brockes-1185800/
These are notes on architecting large scale systems. (Is architecting even a word?) The notes came out of a series of recent conversations with trusted colleagues.
Large software systems have several axes to manage
- amount of requests -> millions of requests per second
- amount of data -> millions of data records to process
- amount of code -> millions of lines of code to organize
- amount of staff -> millions of communication channels
Monitoring
- is the system healthy?
- keep it healthy
Monitoring > Logging
- log everything
- keep logs for about a year
- do not log confidential data such as passwords
- include 1. correlation id and 2. the entire request/response payload
- structured logging is a nice-to-have
Monitoring > Notifications
- monitor every important part of the system
- ping employees via phone, email, text
- have clear plans of action; e.g. restart the service
- Kubernetes has some out-of-the-box monitoring: live, healthy.
Deployment
- who can deploy?
- the machine needs approval from two people:
- ... one person (e.g. a developer) starts a deployment,
- ... another person must approve the deployment,
- ... the machine does the actual deployment via a script.
- this is more secure and prevents accidents (e.g. deleted the database, oops)
Scaling
- scale out is the trend
- it is about load balancing
- compared to one large machine, multiple small machines are
- ... less expensive
- ... more flexible
- ... more resilient (w/ redundancies)
Distributed Transactions
HasStarted HasCompleted
database1.someTable 1 1
database2.someTable 1 0 // problem!
database3.someTable 1 1
- for each distributed database table involved in a distributed transaction,
- ... store the transaction state: HasStarted, HasCompleted...
- also include a supervisor service to check for transaction integrity
- --
- some transaction will fail
- fail fast and on purpose:
- ... if something in the system fails, shoot the flare,
- ... sound the alarm immediately, do not keep processing
Event Sourcing
- do not store state
- rather, record every event that would otherwise change state,
- and determine the state at a given time by replaying each event.
Redis
- supports clustering for a reliable, distributed cache
- multiple services read/write to a single distributed cache
- more than just a cache
- ... supports queue and bus features
- ... pub-sub, in which services subscribe to a key
- ... consistent index increment operations (plus one)
- limitation: only supported serialized data
Lots of Data - Paging, Filtering
Back 1 2 3 4 5 6 7 8 9 10 .. 11,320 Next // no!
Back 1 2 3 4 5 6 7 8 9 10 Next // yes
- exposing the total number of available pages in the UI is expensive,
- because counting the total number of records entails scanning the entire data store,
- and it's non-trivial to cache the count on a dynamic query
- a better alternative is to expose a limited number of pages with back/next
- on the other hand, filtering is cheap, especially with DB indexing
Message Queue
- we mostly read them
- uses a pull not a push model
- usually support message persistence after something has handled the message
Message Bus
- this is pub/sub (publish/subscribe)
- multiple listeners/subscribers
- usually do not support message persistence
- compared to object-oriented event handling,
- ... publishing to a message bus can be harder to debug at runtime,
- ... because of difficultly finding the message handler(s) and stepping into code;
- ... static analysis does not support "find all message bus subscribers"
Managing Thousands of Lines of Code
- reduce coupling (cliched, but true)
- define clean lines among functional areas,
- even though we own the entire code base,
- ... break code out into libraries;
- ... pretend we are creating a set of public APIs
- ... focus on opaque APIs that hide their implementation details.
- in a multi-layered/multi-tiered architecture,
- ... avoid depending on services/libraries that are more than one layer away,
- ... strive for dependency structures that look like linked lists
Traffic Localization, Distributed Systems
- direct the user to the closest server
- e.g Azure Traffic Manager does this; CloudFlare (probably) provides it too
- when the request arrives
- ... ping available servers
- ... look at the latency
- ... direct the request to the correct server
- this is basically a performance enhancing DNS service
Challenge: region specific transactions
- e.g. payment processing
- ... payments in Australia
- ... payments in United Kingdom
- ... payments in United States
- localize the algorithms for payment process,
- scale payment servers based on regional demand,
- reuse shopping cart and invoicing systems for all regions,
- doing this is is harder when payment/cart/invoicing are coupled
- one viable approach: break out a service per payment region
MonoRepos vs MicroRepos
- the mono repositories require more investment in tooling
- without that, it's too easy to do stupid things in a mono repository
Politics in Large Organizations
- as the system grows, change becomes more difficult,
- the corporate politics can become reactionary,
- the code base's technical debt can become unwieldy and brittle,
- focus on making small, impactful changes,
- large systems are sobering - it's like steering an enormous ship
- mitigation:
- avoid coupling from the start
- the harder it is to introduce coupling, the less likely people will do it
- so, make coupling painful and more difficult:
- separate repositories
- code analysis
- dependency structure tests
- code inspection
- one choice is among:
- an above average system that prevents bad patterns
- an above average team that prevents bad patterns
- a mix of both is probably the best