The biggest challenges in distributed systems

6 min readMay 12, 2021

Distributed Systems is a not a simple concept based on software architectures (especially micro services). Physically, a distributed system is an ensemble of physical machines that communicate over network links. In other words, a distributed system is composed of software processes that communicate via IPC mechanisms and are hosted on machines. If you focus only on the implementation then you need to change your perspective a little bit more. it wouldn’t be wrong to say like “a distributed system is a set of loosely-coupled components that can be deployed and scaled independently called services”

In fact, a distributed system actually tries to deal with five different, unique and cruel challenges.

Communication
Coordination
Scalability
Resiliency
Maintenance

Communication

Our nodes, machines or services need to communicate over the network with each other.

First of all, you must know that how OSI works. In fact, the OSI model does not perform any functions in the networking process. It is a conceptual framework so you can better understand the complex interactions that are happening.

Network protocols are arranged in a stack, where each layer builds on the abstraction provided by the layer below, and lower layers are closer to the hardware.

Models / Strategies

If you want to build a distributed systems, you must build reliable and secure communication model. Also, you must consider your communication style between your clients and your services. You should decide your strategy consider your own business. You can use direct or indirect (or both) communication model(s).

API

Your services exposes their own operations to its consumers via a set of interface implemented by its business logic. Your clients can not access these operations directly. You must translate your messages received from IPC mechanisms to interface calls.

Choose your IPC technologies carefully.

Coordination

You need to define precisely what can and can’t happen in a distributed systems. On the other hand, You should build a system model that is encodes expectation about the behavior of processes, communication links, and timing. You have only nine model to build a distributed system. You must known every model to decide your own.

Communication Link Models

fair-loss link model : Channels delivers any message sent with non-zero probability (no network partitions)
reliable link model: Channels that deliver any message sent exactly once
authenticated-reliable link model: same as reliable link model, but additionally assumes that the receiver can authenticate sender.

Process failures Models

arbitrary-fault model: also known as “Byzantine” model, assumes that a process can deviate from its algorithm in arbitrary ways, leading to crashed or unexpected behavior due to bugs or malicious activity.
crash-recovery model: assumes that a process does not deviate from its algorithm, but can crash and restart at any time, losing its in-memory state.
crash-stop model: assumes that a process does not deviate from its algorithm, but if it crashes it never comes back on line.

Timing Models

synchronous model: assumes that sending a message or executing operation never takes over a certain amount of time. (unrealistic)
a-synchronous model: assumes that sending a message or executing an operation on a process can take an unbounded amount of time. (has a lot of problems)
partially synchronous model: assumes that the system behaves synchronously most of the time, but occasionally it can regress to an a-synchronous model. (practical)

Scalability

A scalable service or application can increase its capacity as its load increases.The simple way to do that is by scaling up and running the service or application on more expensive hardware, but that only brings you so for since the application will eventually reach a performance ceiling.

The alternative to scaling up is scaling out by distributing the load over multiple nodes. You can use scaling out with only three way; functional decomposition, partitioning and duplication.

it should not be forgotten that the performance of a distributed system represents how efficiently it handles load, and it’s generally measured with throughput and response time. As you know, Throughput is the number of operations processed per second, and response time is the total elapsed time for each request.

Resiliency

A distributed system is resilient when it can continue to do its job even when failures happen. Any failure that can happen will eventually occur. Every component of a system has a probability of failing. No matter how small that probability is, the more components there are, and the more operations the system performs, the higher the absolute number of failures becomes.

If the system is not resilient to failures, which only increase as the application scales out to handle more load, its availability will inevitably drop. The name of failure is not important. (Hardware failures, software crashes, memory leaks or whatever.) You must guarantee at least just two nines.

Most Common Causes of Failures

Single point of failure
Unreliable network
Slow processes
Unexpected load
Cascading failures

Recipes

Downstream resiliency : Timeout(s), Retry Mechanisms, Circuit Breaker
Upstream resiliency: Load shedding, Load leveling, Rate-Limiting, Bulkhead, Health Endpoints, Watch-dogs

Maintenance

Actually Maintenance is a more difficult problem than the others.

You must believe test pyramid first. The longer it takes to detect a bug, the more expensive it becomes to fix it. Testing is all about catching bugs as early as possible, allowing developers to change the implementation with confidence the existing functionality won’t break, increasing the speed of refactorings, shipping new features, and other changes.

Also, you should focus on continuous delivery and deployment. I do not digging CI/CD in this article as this topic is too long.

The other key point is continuous monitoring, it is primarily used to detect failures that impact users in production and trigger notifications(alerts) sent to human operators responsible for mitigating them. The other critical use-case for monitoring is to provide a high-level overview of the system’s health through dashboards.

p.s. do not use black-box monitoring. Also you must consider the following topics;

metrics: is a numeric representation of information measured over a time interval and represented as a time-series, like the number of requests handled by a service. High-cardinality metrics make it easy to slice and dice the data, and eliminate the instrumentation cost of manually creating a metric for each label combination.
service-level indicators: (SLI) is a metric that measures one aspect of the level of service provided by a service to its users, like the response time, error rate, or tps. SLIs are typically aggregated over a rolling time window and represented with a summary statistic, like average or percentile.
service-level objects: (SLO) defines a range of acceptable values for an SLI within which the service is considered to be in a healthy state. (also you can set expectations)
alerts
dashboards
observability: a distributed system is never healthy (100%) at any given time. So, You must need logs and traces to makes an hypothesis and tries to validate it.

In conclusion,

You can be sure that you will face the challenges that I mentioned the above when designing a distributed system. The depth and types of these problems and their solutions vary according to your approach, your business domain and your boundaries.

You can get more detailed information from the links below.

“http://lass.cs.umass.edu/~shenoy/courses/spring07/lectures/Lec06.pdf”

“http://lass.cs.umass.edu/~shenoy/courses/spring18/lectures/Lec08_notes.pdf”

“https://users.cs.northwestern.edu/~fabianb/classes/msit-p2p-w08/lectures/01-4-IPC.pdf”

“https://www.ida.liu.se/~TDDD25/lectures/lect3.pdf”

“https://people.kth.se/~johanmon/courses/id2201/lectures/coordination-I.pdf”

“https://cs.gmu.edu/~setia/cs571-F02/slides/lec11.pdf”

“http://www.inf.unibz.it/~nutt/Teaching/DSs0910/DSsSlides/7-coordinationAndAgreement-2.pdf”

“https://www.math.unipd.it/~abujari/fis1819/lecSlides/scalability.pdf”

“https://insights.sei.cmu.edu/blog/system-resilience-part-5-commonly-used-system-resilience-techniques/”

“https://www.researchgate.net/publication/260914470_Resilience_in_Large_Scale_Distributed_Systems”

“https://www.researchgate.net/publication/334374473_Distributed_Systems_Maximizing_Resilience”