Distributed Systems

Overview

A distributed system is a network of computers that are working together towards a common goal, appearing to the end-user as a single computer. The computers are peers, working together to make the system more tolerant to failures and more available.

Many distributed systems today are used for data-intensive applications, where the quantity, complexity, or speed at which data is changing would be a bottleneck if it weren’t for a distributed system. Three ways to describe a distributed system are in terms of reliability, scalability, and maintainability.

Reliability

Reliability means that you system will continue to work correctly even in the face of hardware faults, software faults, or human error. For a system to be reliable, it should be able to:

Function as the user expects
Tolerate user mistakes or using the software in unexpected ways
Performance that is good enough for its use case under expected load and data volume
Prevents unauthorized access and abuse

A reliable system should be fault-tolerant, which means that it anticipates and handles faults. Faults can be anything from exceptions in your code to hardware failures. This is different from a failure, where the entire system stops working. It is impossible to anticipate every fault, so it’s usually best to handle faults that could lead to failures.

Classes of Faults

Hardware Faults: These can typically be handled by introducing redundancy into the system (i.e., backup hard drives in a RAID configuration)
Systematic Errors: These can be hard to anticipate and may be correlated across nodes. Examples are bugs that occur in a dependency, like the Linux Kernel, introducing a fault into all systems running that version of Linux. These bugs often lie dormant for a long time until they’re triggered by unusual circumstances.

How to Make Systems More Reliable

Use the Chaos Monkey approach of deliberately introducing faults to your system to see where it fails, and then handling that issue.
Design systems that minimize the opportunity for human error, like providing a fully featured non-production sandbox environment for people to explore safely without affecting users.
Test thoroughly from unit tests to integration tests to manual tests.
Roll out new code gradually so that unexpected bugs only impact small numbers of users.
Make it very easy and quick to roll back changes when needed to recover faster.
Set up detailed and clear monitoring like performance metrics and error rates.

Scalability

As the system grows in volume, traffic, or complexity it should be able to keep up with the growth. You can measure scalability in terms of:

Latency: The duration a request is waiting to be handled
Response Time: What the client sees as the actual time to process a request

Maintainability

Many engineers over time should be able to work productively on the system.

A maintanable system is the following:

Easy for the operations team to keep running
Easy for new engineers to understand
Easy for engineers to make changes to

Operations

An operations team working on a distributed system is responsible for:

Monitoring the health of the system and quickly restoring service if it is in a bad state
Tracking down the causes of system failures or degraded performance
Keeping software and platforms up to date, like security patches
Seeing how different systems affect each other
Anticipating future problems before they occur
Establishing good practices and tools for deployment, configuration management, and more
Performing complex maintenance tasks
Maintaining system security
Defining processes to make operations predictable and keep the production environment stable
Documenting and sharing knowledge about the system

Complexity

Complexity is unnecessary if it isn’t inherent to the problem your software is trying to solve, but arises in the implementation Moseley and Marks. The ease with which you can modify a system is linked to how simple it is. Easy-to-understand symptoms are easier to modify than complex ones. Symptoms of complexity in your system include:

Tight coupling of modules
Tangled dependencies
Inconsistent naming and terminology
Hacks to solve performance problems
Special cases to work around issues

You can deal with complexity with abstractions that hide implementation detail behind a clean, simple-to-understance interface.

Distributed Systems

Overview

Reliability

Classes of Faults

How to Make Systems More Reliable

Scalability

Maintainability

Operations

Complexity

Learning Resources

Articles

Books

Helpful Sites

Videos