Complexity
In most of the good design reviews I am a part of, a common question gets asked - how complex is this? Sometimes you will hear engineers debate about how simple or complex something is. Sometimes you will hear about trading off one type of complexity for another. But seldomly do here you what their definition of complexity is.
Everything should be made as simple as possible, but no simpler - Albert Einstein
Einstein puts us in a conundrum. How do we know when things are simple enough? Amazon has a leadership principle - Invent and Simplify. But, it does not put limits or constraints on what simple means. How can we quantify simple? Amazon attempts to define complex problems in their promotion criteria. Dumbed down, its essentially complex problems have visible constraints (compliance/financial/performance), and significantly complex problems have substantial constraints that conflict. hmm. Quantifying simplicity or complexity is a complex task by definition.
Before diving too much deeper, lets look at a few examples of what I consider as complex systems that I have worked on and some of the tradeoffs we have had to make.
Example 1: A team is evaluating the benefits of a monolith versus microservices. On one hand putting all of our code on one machine is not great. But, on the other hand having too many moving pieces with something like every function being coded as a service in lambda is similarly not great. Somewhere in that spectrum is the sweet spot. I have worked on a handful of monoliths that we had to split up for various reasons. Some because we could not physically buy larger servers. Others because deploying to a multiple thousand server wide fleet causes problems - you can no longer blue/green deploy so you need to roll out changes - but now your software needs to gracefully handle multiple versions of the same software running concurrently. Having hundreds of containers all running and trying to monitor, alarm, and scale them is similarly challenging. At least in my experience - its the scaling one that will get you.
Example 2: A team is trying to measure accuracy. Imagine you have a system that you are trying to replicate, but due to reasons (scale/cost/performance) you do not want to replicate everything. How do you make sure that you replicated everything you wanted to? Well, that’s fairly simple with sequence markers. How do you make sure that you replicated it correctly? Hmm. At first we relied on strict testing, which was great while the problem space was small. Then we relied on sampling, but how can you measure accuracy with sampling? We looked into other forms as well, such as formal reasoning, static reasoning, and invariants. One really funny thing happened with sampling - we commonly found bugs with the code we used to measure accuracy - so does that mean there could be a bug on both sides that cancels out? We had a complex system to begin with - and a really complex system once we tried to measure it.
Example 3: A team is trying to launch a new product to a customer. One option is to use existing systems that were not really built to support this use case, but you can add this use case into those systems. The difficulty is you need a month or two of work in each of a dozen systems for a total of say 18 months of engineering effort split across those dozen teams. Alternatively, you can bypass all of those systems but at the tradeoff of recreating some of the functionality that exists in those systems. This alternative only takes ~9 months because your requirements do not require all of the bells and whistles those components offer. What option do you choose? If effort was not a factor would that change your opinion?
I have not found a compelling source which attempts to outline different types of complexities and their tradeoffs. Here is my attempt at a new standard.
Systemic complexity. A measure of how complex a given system is from the outside. Service Oriented Architecture dictates that a system is a self contained black box with a contract. A good measure of systemic complexity is how simple your API contract is. Can users understand it without reading tons of documentation? Do users need to interface with dozens of APIs to perform simple things? Do users need to handle strange errors and edge conditions from your API? Do users need to interact with other APIs to understand things about your API? Some also refer to this as encapsulated complexity (as the converse) - because you are deciding how best to encapsulate your complexity from a consumer.
Architectural complexity. A measure of how complex a given system is from the inside. If your service is self contained and does not rely on any external dependencies, then it is simple. If your service relies on hundreds of dependencies, then it is complex. If you are using a language like Javascript it is really easy to npm install that new package - but it comes with a price, what are its dependencies? In the distributed systems world, it can get even more complex. Sure, you can integrate with Service A, but Service A is using Service B and Service C - so your system just took on (potentially unknown) new dependencies. And, it gets worse when Service B has its own dependencies.
Functional complexity. A measure of how easy it is to explain what your system does. Can you concretely define your system in a sentence or two? Can you draw a nice flowchart that shows where your system fits in? Or, is your system more like a conglomerate - where you do a little bit of everything? Well, we sell books, but we also sell book accessories and well actually any physical good, we also sell some digital goods, and we have a streaming service for some of those digital goods, ohh - and we sell software as a service, and …
This is not some magical CAP theorem where you can only pick two. But, it is a lens to hopefully balance some tradeoffs. My recommendation for an engineering team is to balance these three forms of complexity. You need to balance how complex something is from the outside as well as how it is from the inside. A really easy contract from the outside that requires an impossible algorithm that magically knows everything is not reasonable. Conversely, keeping your system simple and forcing that complexity on your customers is similarly not reasonable. Taking on new requirements that are close to your core offerings should not introduce too much architectural or systemic complexity; but, taking on a new requirement outside your area will probably increase complexity in all three. Find a healthy balance and push the complexity off where you can. Need to design a leader election algorithm? Don’t. Its hard. Harder than you think. Use an atomic lock with a distributed system like dynamo or redis and push that complexity on someone that solved that problem before you.
One example of something done right is the internet (web2). The internet is very complex on many different levels. But, it excels in all three forms of this complexity. Using the internet is relatively simple - buy a modem, connect to an ISP - and you are done. Under the hood there is a lot more going on from ISPs, to fiber, to routing, to timing, etc… but the interface is simple. The inside of the internet is relatively simple too - DNS servers hooked up to actual servers. Under the hood it is more complex - these servers are not actually physical servers but CDNs and load balances, but it seems simple. Its also really easy to explain - connects everything together. This was not always the case, but my guess is that the internet excelled because it simplified all forms of complexity.
https://xkcd.com/927/ https://en.wikipedia.org/wiki/Turtles_all_the_way_down