November 27th, 2021 End of November Rambling By Jeffrey M. Barber

I’m in a wandering mood with my writing. This document is less informational and more of a conversation with myself about designing for hitting an actual milestone… Getting stuck out of mud is hard, and my recent burst of productivity is due to writing this out.

The core question emerges of what am I optimizing for. Am I optimizing to build out my own product? Am I optimizing to make this software integrate easily with existing infrastructure? Am I optimizing for my own joyful coding? Am I optimizing for my timeline or your timeline? These questions are worth pondering, and I think I need to optimize for finishing something that I can use. OK…

Adama is at a phase where I need to start buying some stuff since I probably shouldn’t invent yet another streaming protocol among other things (I want to, but I need to focus). I love to invent and start news ideas to make marginal improvements everywhere and just doing things for the lolz, but finishing is the hard part requiring discipline and focus. My mission here is to figure out the gap of things to build so I can simply use Adama for the next thing to build.

The first issue is how I want to handle persistence of the state backing a game/app. Of course, I want to build a raft logger myself, but I need to focus. Buying now requires either tight coupling or adding an abstraction layer. For instance, if I just made it work with DynamoDB, Aurora, RDS, or S3, then I can move much faster thanks to AWS. It’s tempting, but that tight of coupling feels wrong. It makes me feel icky. Ironically, just using AWS may be the right thing to move on quickly. However, an interesting question is how to make it easy to bring this into existing infrastructure.

I think we can all agree that tight-coupling is not ideal, so this will be part of how I structure the core of Adama. Adama will simply use interfaces for all external needs. Scaling up would require the use of network resources, so I should ensure that all interfaces are asynchronous by default. As I did this transformation for the core of the document, I noticed it revealed tricky issues with respect to threading and what the guarantees are. This is an area to focus on since threads are hard.

Continuing the focus on durability and persistence, I could create various implementations of the interface using the raw disk (which is partially implemented… poorly), sqlite, an RDBMS like MySQL or PostgreSQL, or some cloud solution from AWS or Azure. There are so many options! While the java level interface ensures isolation of Adama’s core from any particular implementation, the question then is what and how should developers should implement those interfaces for their situation.

A thought I had was what if I simply used gRPC to implement the interface and then let developers just stand up endpoints to handle it. Conceptually, this feels simple especially since I’m already handling asynchronous failures. However, the concept of an “endpoint” is deceptively simple. Admittedly, this is an inevitable destination for many companies since micro-service orchestration is a trend now which does have benefits around how we organize people to build software; containerization is here to stay. However, it is not clear which platform will become a standard if such a standard ever emerges.

Since I’m not looking to add to my resume, I am going to ignore the containers and focus on JVM fundamentals.

While containers are an inevitable destination for many companies, this suddenly pulls way too much into scope. For example, let’s just consider what happens when you split the failure domain of a single process into two. Your error rate now has the potential to double, and the only way to handle that is to scale both sides of the split. Chances are, one side is stateless and another side is stateful. Fortunately, the stateless side can scale up a great deal whilst the stateful side requires sticky routing along with replication sticky routing. With luck, you can simply buy the stateful side from the cloud or some existing package. Sadly, buying is a commitment which brings us back to the original why I’m thinking about all this shit in the first place. On this path, it feels like there is marginal technical value for separating my service from the actual persistence tier with a stateless proxy.

The answer for persistence is therefore to keep the interface open and don’t try to build an all-encompassing service. Instead, I need to layer the packages such that an integrator (like yourself?) can easily bring in the core and integrate into their ecosystem with minimum fuss. That is, the core is mostly monolithic in design as a library without none of the networking or operational burden. This also makes it useful outside of a service and within an application, so that is rather neat as well.

All of this thinking to answer a simple question. I’ll admit that I have a bit of analysis paralysis with deciding a direction. As I seek to build alone for a few years, a bit of solid thinking can help me maintain a proper focus. The same analysis done for persistence also applies to other needs like emitting metrics of authenticating requests. I build the core as this agnostic thing, and then I build a sample integration using AWS to minimize my execution and opportunity cost with questionable operational cost.

At this point, I could build a durable and operable service node using AWS and move onto the next issue which is handling requests directly from browsers. This boils down to building or buying some kind of WebSocket solution. Either I bake a WebSocket server into my service (like it is now) or I use a gateway, message broker, or build a separate proxy. In complete honesty, the available market of things to buy feels scant and incompatible with my goals. Interestingly enough, this feels like an opportunity.

The big issue which is making buying hard is finding a solution with reasonable guarantees around stickiness. That is, Adama is a stateful service, and the challenge now is the match making of requests within a WebSocket connection to the appropriate service. There are other complications as well, but this is the biggest one at the moment since the emerging application protocol within the WebSocket connection multiplexes many Adama streams. OK, fair enough, I’ll just build a WebSocket proxy to front-load Adama. This has the nice benefit that it reduces the memory pressure on Adama as browser sockets are not exactly cheap. Similar to the Adama service, I’ll make this new web proxy driven by a variety of interfaces which people can leverage later for their particular needs.

The remaining challenge is the match making between this web socket fleet and the Adama fleet. If I scale the Adama fleet vertically with a single machine, then the match making is easy as cake. I can simply leverage gRPC with streaming capacity and have a multitude of clients coalescing many WebSocket connections into a few gRPC streams. This also has the benefit of offloading authentication, access control, and absorbing denial of service attacks. There are a bunch of features that could be built into this proxy, but the focus should be on simply building it along with a simple authentication mechanism as I will be the only consumer… for now.

However, I find myself back at that pesky problem of separating failure domains and all that containerization business. However, the key difference is that I really only need two tiers which feels manageable.

What is great about the many WebSocket to singular Adama service is it mirrors a classical web stack, and growth of Adama can follow that pattern. The key trade-off is that reliability of the entire service depends on that single host. This conflicts with my #1 goal in life of great sleep. There are three ways I could address this situation.

The first is to have a singular host responsible for a shard, and then manage the mapping of games to hosts. This has the benefit of simplicity, but the reliability of an experience depends entirely on a single host. The aggregate reliability then depends on the number of good hosts over total hosts. So, if a host experiences a restart, then the user experience will suffer with latency. If a host dies for good, then the experience is stuck requiring human intervention. The really hard problem at hand is how to detect the host failure, then migrate the experiences to minimize the latency. A host that appears to be dead may actually be a zombie and still doing work within itself, so we must be mindful of this.

Ultimately, trying to maintain a single machine for a game is going to be hard, but it is a great way to start. The second way to address this to shoot for a single machine but have the protocol between Adama and the durability store handle contention with a concurrent “compare and set” operator like “append this data change only if the sequencer is 100”. This would ensure that a single machine would win, and the losers can redirect traffic. The key failure mode manifests in latency which is preferable to a dead experience. This would enable a second machine to stand up and take over should a host fail permanently.

It is worth noting that this places the “single host” burden on the durability tier, and this is an inescapable problem when consistency is required. Since my goal is to focus, then this may be the right approach. The tricky bit is appropriately handling the failures in a coherent way.

The third way is to build the durability layer myself with replication. Now, this is obviously very expensive in terms of my time and focus. However, it provides insight into a potential optimization on the second method. Since the contention on the durability tier will create user pain, it may be worth it to keep an Adama host primed as a backup. Here, we leverage chain replication where the first Adama host replicates to a second Adama host which then leverages the durability solution. This adds latency to every action, but it minimizes the risk of a large latency spike during expected failures. This kind of optimization allows us to be less conservative with the failure detection and be resilient during normal operations which yields great sleep.

With all this thinking, my attention to focus on the first year means I really only need a single host. Yet, I must ensure the contract between Adama and the persistence store is sound with transactions as I may need to be reactive to shenanigans.

Since I am splitting up the proxy tier and the Adama tier, then it makes sense to add a static mapping between the proxy tier and Adama tier. This is a low effort way to scale the system up where I put myself into the real fight for great sleep when machines start to fail left and right. I also feel confident in open sourcing all this work as I’m not building a cluster management system. Getting the contract correct between Adama and the persistence layer is an important step for great sleep, but such matters will come to head at the appropriate time.

All this wandering and thinking has been useful for my focus. This shifts my execution strategy towards some kind of public launch around slimming down the current offering, use AWS for all hard problems, bias against sleep and great reliability by assuming a single machine is responsible for a game until hard restarts happen. My focus is to build the platform that I need to get to the next project.