Microservice Architecture at Medium

In the monolithic storage model, the recommendation service has direct access to the same persistent storage that the monolithic app does. This is a bad idea because:

Caching can be tricky. If the recommendation service shares the same cache as the monolithic app, we will have to duplicate the cache implementation details in the recommendation service as well; if the recommendation service uses its own cache, we won’t know when to invalidate its cache when the monolithic app updates the post data.
If the monolithic app decides to change to use RDS instead of DynamoDB to store post data, we will have to reimplement the logic in the recommendation service and all other services that access the post data as well.
The monolithic app has complex logic to interpret the post data, e.g., how to decide if a post should not be viewable to a given user. We have to reimplement those logics in the recommendation service. Once the monolithic app changes or adds new logics, we need to make the same changes everywhere as well.
The recommendation service is stuck with DynamoDB even if it is the wrong option for its own data access pattern.

In the decoupled storage model, the recommendation service does not have direct access to the post data, neither do any other new services. The implementation details of post data are retained in only one service. There are different ways of achieving this.

Ideally, there should be a Post Service that owns the post data and other services can only access post data through the Post Service’s APIs. However, it could be an expensive upfront investment to build new services for all core data models.

There are a couple of more pragmatic ways when staffing is limited. They could be actually better ways depending on the data access pattern. In Option B, the monolithic app lets the recommendation services know when relevant post data is updated. Usually, this doesn’t have to happen immediately, so we can offload it to the queuing system. In Option C, an ETL pipeline generates a read-only copy of the post data for the recommendation service, plus potentially other data that is useful for recommendations as well. In both options, the recommendation service owns its data completely, so it has the flexibility to cache the data or use whatever database technologies that fit the best.

Decouple “Building a Service” and “Running Services”

If building microservices is hard, running services is often even harder. It slows the engineering teams down when running services is coupled with building each service and teams have to keep reinventing the ways of doing it. We want to let each service focus on its own work and not worry about the complex matter of how to run services, including networking, communication protocols, deployment, observability, etc. The service management should be completely decoupled from each individual service’s implementation.

The strategy of decoupling “building a service” and “running services” is to make running-services tasks service-technology-agnostic and opinionated, so that app engineers can fully focus on each service’s own business logic.

Thanks to the recent technology advancements in containerization, container-orchestration, service mesh, application performance monitoring, etc, the decoupling of “running service” becomes more achievable than ever.

Networking. Networking (e.g., service discovery, routing, load balancing, traffic routing, etc) is a critical part of running services. The traditional approach is to provide libraries for every platform/language. It works but is not ideal because applications still need a non-trivial amount of work to integrate and maintain the libraries. More often than not, applications still need to implement some of the logic separately. The modern solution is to run services in a Service Mesh. At Medium, we use Istio and Envoy as sidecar proxy. Application engineers who build services don’t need to worry about the networking at all.

Communication Protocol. No matter which tech stacks or languages you choose to build microservices, it is extremely important to start with a mature RPC solution that is efficient, typed, cross-platform and requires the minimum amount of development overhead. RPC solutions that support backward-compatibility also make it safer to deploy services even with dependencies among them. At Medium, we chose gRPC.

A common alternative is REST+JSON over HTTP, which has been the blessed solution for server communication for a long time. However, although that stack is great for the browsers to talk to servers, it is inefficient for server-to-server communication, especially when we need to send a large number of requests. Without automatically generated stubs and boilerplate code, we will have to manually implement the server/client code. Reliable RPC implementation is more than just wrapping a network client. In addition, REST is “opinionated”, but it can be difficult to always get everyone to agree on every detail, e.g., is this call really REST, or just an RPC? Is this thing a resource or is it an operation? etc.

Deployment. Having a consistent way to build, test, package, deploy and manage services is very important. All of Medium’s microservices run in containers. Currently, our orchestration system is a mix of AWS ECS and Kubernetes, but moving towards Kubernetes only.

We built our own system to build, test, package and deploy services, called BBFD. It strikes a balance between working consistently across services and giving individual service the flexibility of adopting different technology stack. The way it works is it lets each service provide the basic information, e.g., the port to listen to, the commands to build/test/start the service, etc., and BBFD will take care of the rest.

Thorough and Consistent Observability

Observability includes the processes, conventions, and tooling that let us understand how the system is working and triage issues when it isn’t working. Observability includes logging, performance tracking, metrics, dashboards, alerting, and is super critical for the microservice architecture to succeed.

When we move from one single service to a distributed system with many services, two things can happen:

We lose observability because it becomes harder to do or easier to be overlooked.
Different teams reinvent the wheel and we end up with fragmented observability, which is essentially low observability because it is hard to use fragmented data to connect the dots or triage any issues.

It is very important to have good and consistent observability from the beginning, so our DevOps team came up with a strategy for consistent observability and built tools in support of achieving that. Every service gets detailed DataDog dashboards, alerts, and log search automatically, which are also consistent across all services. We also heavily use LightStep to understand the performance of the systems.

Not Every New Service Needs to be Built from Scratch

In microservice architecture, each service does one thing and does it really well. Notice that it has nothing to do with how to build a service. If you migrate from a monolithic service, keep in mind that a microservice doesn’t always have to be built from scratch if you can peel it off from the monolithic app.

Here we take a pragmatic approach. Whether we should build a service from scratch depends on two factors: (1) how well Node.js is suited for the task and (2) how much it costs to reimplement in a different tech stack.

If Node.js is a good technical option and the existing implementation is in a good shape, we peel the code off from the monolithic app and create a microservice with it. Even with the same implementation, we will still get all the benefits of microservice architecture.

Our monolithic Node.js monolithic app was architected in a way that make it relatively easy for us to build separate services with the existing implementation. We will discuss how to properly architect a monolithic later in this post.

Respect Failures Because They Will Happen

In a distributed environment, more things can fail, and they will. Failures of mission-critical services, when not handled well, could be catastrophic. We should always think about how to test failures and gracefully handle failures.

First and foremost, we should expect everything will fail at some point.
For RPC calls, put extra effort to handle failure cases.
Make sure we have good observability (mentioned above) to failures when they happen.
Always test failures when bringing a new service online. It should be part of the new service check-list.
Build auto-recovery if possible.

Avoid Microservice Syndromes from Day One

Microservice is not a panacea — it solves some problems, but creates some others, which we call “microservice syndromes”. If we don’t think about them from day one, things can get messy fast and it costs more if we take care of them later. Here are some of the common symptoms.

Poorly modeled microservices cause more harm than good, especially when you have more than a couple of them.
Allow too many different choices of languages/technology, which increase the operational cost and fragment the engineering organization.
Couple running services with building services, which dramatically increases the complexity of each service and slow the team down.
Overlook data modeling and end up with microservices with monolithic data storage.
Lack of observability, which makes it difficult to triage performance issues or failures.
When facing a problem, teams tend to create a new service instead of fixing the existing one even though the latter may be a better option.
Even though the services are loosely coupled, lack of a holistic picture of the whole system could be problematic.