A wealth of interesting technologies and methodologies has risen in recent years under the “Cloud Native” umbrella name, and their impact in our lives as developers has been very deep.
We were once used to have big monolithic applications, hosted on enterprise application servers deployed on virtualized (and frequently expensive) hardware; now we have containers, cheap cloud computing/storage/everything, better implementations of agile methodologies, powerful architectural patterns like microservices, wonderful schedulers like Kubernetes, we have those “unicorn-like” creatures we call DevOps Engineers (or should we call them SREs? that’s another story) and so on….
All these technologies/methodologies allow us to build software that is extremely flexible, scalable, powerful, available, maintainable in ways that we couldn’t predict years ago.
But as one saying goes “power is nothing without control”, and control is really something that we need: so there are Service Meshes to the rescue!
What is a Service Mesh?
Before introducing the concept of Service Meshes, let’s take a small step back and talk about microservices.
This architectural pattern is founded on very solid principles that draw inspiration from Service Oriented Architectures (SOA), and ultimately allows developers to write software that is highly decoupled, maintainable and resilient.
On the other hand, there also is a great challenge that teams must deal with: how to ensure robust, observable, measurable, secure, actionable communication between microservices?
While in old architectures the data-flow was constrained within application tiers (for example in a classic 3-tier application: UI, business and data layers) or completely absorbed by a monolith, the use of microservices along with containers and schedulers has brought an extreme dynamism to this field: containers (i.e. microservices) come and go, can have performance problems, can go offline or work in a networking-degraded mode, can be spawned on entirely different data-centers across different cloud vendors….and all of this while we need to make our application respond within an agreed level of service, without any downtime.
Let’s face it: making distributed applications is hard because communication happens on networks that are unreliable by their very nature.
In response to all these needs, service meshes have entered the Cloud Native space to offer a solution that allows us to deeply understand, measure, and act upon the communication layer of our microservice architecture. Let’s quote their definition from 
“A service mesh is a dedicated infrastructure layer that adds features to a network between services. It allows to control traffic and gain insights throughout the system. […] In contrast to libraries, which are used for similar functionality, a service mesh does not require code changes. Instead, it adds a layer of additional containers that implement the features reliably and agnostic to technology or programming language.”
We can go further in our definition by saying that a service mesh is composed of two layers, the data plane and the control plane.
The data plane is made of a number of service proxies deployed alongside every microservice, following what is called the “sidecar pattern” (see ). Those proxies manage a wide range of cross-cutting concerns, like traffic control, monitoring, observability, security and they do that on behalf of microservices, without almost touching the business code.
The control plane, on the other hand, is a different layer that manages the configuration of service proxies, and also can gather telemetry data that they emit. It ensures that any change that we apply to mesh behaviour is automatically distributed to the service proxies, that will behave accordingly (see Figure 1)
Benefits/Drawbacks of using a service mesh
A service mesh, as we briefly discussed in previous paragraph, brings an undeniable value within a software architecture as it can greatly improve the control, security, observability and reliability of the services.
This is especially true for microservices architectures, as it embraces their distributed nature and helps focusing on networking rather than business concerns. A service mesh allows you to:
- Manage the traffic between services with advanced routing capabilities
- Setup secure mutual TLS communication by providing a dedicated CA infrastructure
- Setup RBAC authentication/authorization strategies by ensuring service identity
- Test your overall architecture’s integrity by injecting failures or delays
- Ensure resilience by using modern patterns like rate limiting, circuit breaking, retries/timeouts
- Integrate with modern observability platforms (like Prometheus or Jaeger/Zipkin) by generating metrics/traces and set alarms on them
- Enhance your delivery capabilities by easing the adoption of strategies like canary releases, A/B testing or progressive delivery (with tools like Argo Rollouts, Flagger or Iter8)
All these benefits come at a cost, however, as a service meshes also bring some drawbacks to the equation:
- they introduce operational complexity because they require a change in your infrastructure
- they bring new technology/concepts that must be assimilated by teams and add a new layer of “cognitive” challenges.
- they uses proxies that introduce latencies and resource consumption of cpu/memory, that could not be negligible in your architecture
- they require code modification if you want to properly implement tracing and logging, as every microservice is treated agnostically as a black box and cannot expose “business” data without code modification
Last but not least, we should remember that service meshes do somewhat overlap with other techniques, for example plain software libraries or API gateways (see  for an interesting discussion on this topic)
In conclusion, it’s fair to say that the benefits of introducing a service mesh surely outweigh the drawbacks, especially in modern microservice architectures. Though there are many considerations to be taken, service meshes are here to stay and to become the long-term companions of microservice architectures.
Service Mesh Landscape
Current offering of service mesh implementations is becoming more and more diverse. A wide range of solutions exists, and each of them has its strengths and weaknesses (see  and  for detailed comparisons)
To fight all possible downsides caused by fragmentation and to ensure a healthy collaboration, an important effort has been recently made by major actors in the service mesh space that has given birth to the Service Mesh Interface (or SMI) specification ().
Quoting its definition, Service Mesh Interface provides:
- A standard interface for service meshes on Kubernetes
- A basic feature set for the most common service mesh use cases
- Flexibility to support new service mesh capabilities over time
- Space for the ecosystem to innovate with service mesh technology
The introduction of SMI is an important milestone that should allow the service mesh technology to evolve in a healthy way.
Istio, or a Tale of Service Mesh selection
When we decided at Radicalbit to pick a service mesh to improve our microservice architecture, we considered many aspects:
- Number of features
- Performance results
- Community/Technical support
- Widespread usage
- Compatibility with Service Mesh Interface
After a thorough review, we decided to pick Istio, which in our opinion was the most balanced choice above aforementioned aspects.
Istio () is by far the most popular and featureful service mesh implementation. It has been jointly developed by Google and IBM and (like Kubernetes itself) it’s an open source project that has received all the expertise developed by Google in their internal infrastructure.
Istio architecture, visible in Figure 2, perfectly resembles [we will focus here only in Kubernetes usage of Istio], which has been anyway designed to adapt to other types of deployments the definition of service mesh introduced in the previous paragraph: it uses Envoy () , a very powerful and widely adopted service proxy, to be the “sidecar container” installed in every Kubernetes Pod. All Envoy proxies together implement the data plane, so they govern the flow of data between every microservice.
The control plane is instead served by a single process (called istiod) that communicates with the Envoy proxies to distribute configuration, receive recorded network traffic and telemetry data, and manage certificates issued by Istio’s own internal Certification Authority.
Istio embodies all great features that a service mesh should have:
- telemetry reporting, with custom dedicated dashboards and alerting
- tracing/logging features
- traffic routing/mirroring features
- resiliency features (like circuit breaking, timeouts, retries, etc. etc.)
- mTLS and service identity support for authentication and authorization
- is designed to be platform-independent and as such can also manage “legacy” infrastructure like virtual machines
Another important area that Istio covers is infrastructure governance, as it adds integration with well established applications like:
- Prometheus, Grafana, Alertmanager to receive, plot and alerts on telemetry received by Envoy proxies
- Jaeger backend and UI to help introduce tracing capabilities
- Kiali, a powerful service mesh that allows to view generated dependency graphs of microservices, as well as a lot of useful information like latencies, traffic rates and overall health of the services.
How Radicalbit uses Istio
The adoption of Istio in Radicalbit has steadily proceeded in recent months.
We are using service meshes in our production Kubernetes clusters for measuring our performances and to observe traffic in our network, and we are also experimenting how to use Istio to improve our delivery process.
We’ll publish some “deep-dives” on our architecture in a future blog post, so…stay tuned!