I was introduced to distributed systems through my OS course. I found it fascinating to understand how the power of multiple computers is harnessed to carry out large-scale computations and store data across multiple different servers. Through this multi-part blog post series, I hope to share some of my learnings about the topic in a digestible form.
So what are distributed systems?
Distributed systems are defined as “a collection of autonomous computing elements that appear to its users as a single coherent system.” [1] In large-scale systems, computation can be largely classified into two processing solutions - distributed or centralized computations.
Distributed systems do not typically (some exceptions exist) share memory or clocks, making it essential for them to communicate with each other over the network. With improvements in network speeds and architectures, distributed computing has become mainstream within large businesses and organizations where maintaining high uptime and fast computation speed is crucial to the success of their products and services.
Why are distributed systems important?
Distributed systems power almost all of the websites and services that we use today. Whether it is placing an order through Amazon or scrolling through your social media feed, distributed systems power the infrastructure that makes it possible for thousands of users to simultaneously access the services without causing a crash.
What are some concepts that distributed systems rely on?
Communication via Remote Procedure Calls
Distributed systems rely on communication between computers to accomplish a certain task. For multiple computers to reliably communicate with each other, there needs to be a standard communication protocol. Moreover, if there isn’t a standard protocol, a programmer needs to spend their time packaging code that can run on different devices. Out of this need to have an abstraction for running code on different machines, remote procedure call (RPC) came into the mainstream. The goal of RPC is to make the process of executing code on a remote machine as simple as calling a local function. [4]
There are two components of RPC - a stub generator and a run time library. The stub generator takes the input of what methods need to be executed and generates code for the client and server. It also generates code to communicate across the network and infrastructure for packaging and unpackaging the package being sent across the network.
The runtime library is defined as a collection of routines and services that facilitate the network communications that support the RPC mechanism [4]. A brief overview of how communication between client and server works through RPC:
Figure 1: Representation of Remote Procedure Call
The client makes a local procedure call to the client stub and passes parameters to the stub. The stub packages these parameters and this process is called marshalling.
It then makes a system call to send this message to the server machine and the server’s operating system retrieves these packages and sends them to the server’s stub. Here, the parameters sent by the client are unpackaged and this process is called unmarshalling.
Now the server node does its computation or processing as directed by the client and packages the result just as was described in the above steps for the client.
Fault Tolerance
Fault tolerance is defined as the ability of a system to continue operating without interruption despite the failure of one or more of its components [7]. For a distributed system to be fault tolerant, it needs to first successfully detect which component has failed and have a plan for recovering from the component failure. For systems to be fault tolerant, they need to have high reliability and availability. But how do systems achieve this? One way of maintaining high reliability is by creating redundant systems.
Redundancy can be defined as the duplication of critical components like data or services to ensure that another component is ready to take over in case of a component failure [7]. A redundant setup has backup infrastructure resources that are available to take over from checkpoints when any one main component fails.
The major types of redundancies are hardware, data, time, and software redundancies. Hardware fault tolerance refers to the ability to continue computing despite component failures. Different distributed systems tackle this problem differently but they base their understanding on modeling hardware reliability through metrics like reliability function, Markov models, and failure rates. [3] When it comes to data redundancy, multiple copies of the same data are maintained in different databases to ensure that no data is lost due to corruption or human error.
Scalability
Scalable systems can accommodate a larger number of users than originally planned without requiring significant modifications to the procedure of the service. [5] Systems can become more scalable through replication and optimization of individual components. A more formal definition of scalable is “a situation in which the throughput changes roughly in proportion to the change in the number of units or the size of inputs.” [6]
The different types of scalability are load, space, space-time, and structural scalability. Load scalability is defined as the ability of a system to function without excessive resource consumption and delays during periods of light, moderate, or heavy loads. Space scalability refers to the ability of a system to be efficient with its storage capacity and the memory taken by the items stored in the system should ideally grow at a rate that is less than linear memory complexity. [6]
Space-time scalability is similar to space scalability but has the additional component of the system being able to execute functions in a runtime that is able to handle a large magnitude of stored items. Structural scalability refers to the structure not posing restrictions on the growth of the system. For example, the number of digits in a phone number of a particular country should not run out or else it would require a revamp of the entire phone number system or would require a new system to coexist with the older one.
Conclusion
Distributed systems are not only about carrying tasks over different computers but also about fulfilling certain characteristics like scalability, fault tolerance, coordination, and replication. An ideal distributed system performs well across all those fronts, but as we all know ideal behavior is hard to achieve in practicality. An engineer designing a practical distributed system ensures that the system checks a set of boxes like proper logging and checkpoints, replication of data, reducing load on the system, and ensuring that the system scales in periods of high traffic or usage. After covering all fronts, the engineer can only really hope for the best. Much of the use of distributed systems has been recent, especially around the last three decades or so with the rise of the internet. So there is a lot of testing out new stuff and inventing solutions as you go.
References
[1] M. van Steen and A. S. Tanenbaum, “A brief introduction to distributed systems,” Computing, vol. 98, no. 10, pp. 967–1009, Oct. 2016, doi: 10.1007/s00607-016-0508-7.
[2] Ian Gorton, 1. Introduction to Scalable Systems. Accessed: Dec. 17, 2023. [Online]. Available: https://learning.oreilly.com/library/view/foundations-of-scalable/9781098106058/ch01.html
[3] Remzi H. Arpaci-Dusseau, Andrea C. Arpaci-Dusseau, Operating Systems: Three Easy Pieces. Accessed: Nov. 11, 2023. [Online]. Available: https://pages.cs.wisc.edu/~remzi/OSTEP/
[4] Paul Krzyzanowski, “Remote Procedure Calls.” Accessed: Dec. 18, 2023. [Online]. Available: https://people.cs.rutgers.edu/~pxk/417/notes/rpc.html
[5] “Scalable definition by The Linux Information Project.” Accessed: Dec. 18, 2023. [Online]. Available: https://www.linfo.org/scalable.html
[6] M. van Steen and A.S. Tanenbaum, Distributed Systems, 4th ed., distributed-systems.net, 2023.
[7] A. B. Bondi, “Characteristics of scalability and their impact on performance,” in Proceedings of the 2nd international workshop on Software and performance, Ottawa Ontario Canada: ACM, Sep. 2000, pp. 195–203. doi: 10.1145/350391.350432.