At Intenseye, we follow trends & hype technologies, and apply best practices while using it. We run hundreds of microservices on Kubernetes written with Scala, Go, Python etc. and most of them use gRPC.
gRPC is a modern open source high performance remote procedure call (RPC) framework which uses HTTP/2 for transport. HTTP/2 supports making multiple requests over a single TCP connection to reduce the number of round trips. This is where the problem emerges; load balance. Once the connection is established, all requests will get pinned to a single destination pod. Thus, we won't have balanced load. We need a L7 aware load balancer, instead of L4. You can read the details of problem from here later on.
We were seeking a solution for one other problem; secure transport between microservices. We have tens of components with running hundreds of microservices in total. Configuring TLS between all of them one by one was intimidating and would have been time consuming.
We also needed a monitoring system and traffic metrics from all of these components and microservices. We would like to observe success/failure rates, RPS of pods, who talks with who in which frequency etc.
There was a single solution for these 3 problems we have: Service Mesh
What is a Service Mesh?
A service mesh is a tool for adding observability, security, and reliability features to applications by inserting these features at the platform layer rather than the application layer. (source)
The service mesh is typically implemented as a scalable set of network proxies deployed alongside application code; a pattern called a sidecar. These proxies handle the communication between the microservices and allow to control traffic and gain insights throughout the system. Service Mesh offers great features, such as traffic metrics, circuit braking, mTLS, traffic split, retry & timeout, A/B routing etc.
We started digging the details of service meshes and evaluated the features that are important to us, how can we benefit from them etc. As service meshes impact the latency and resource consumption, these disadvantages have to be measured too. Since we had 500+ of pods, the resource cost would be 500 x sidecar. Plus, we were racing against time so the latency should have been minimal.
After some research and PoC, we decided to use Linkerd2 between Istio, Consul and Linkerd2. I must say, servicemesh.es helped us a lot to gain knowledge about service mesh and to compare the features between them.
We chose Linkerd2 because of 3 reasons compared to Istio and Consul, asides from the features we are seeking. (L7 LB, mTLS, traffic metrics etc.):
- Lightweight (low CPU & memory consumption)
- Low latency
- Latency aware LB
Istio had great deal of nice features (thanks to Envoy proxy), but we did not need all of them. Also it's sidecar proxy CPU & memory consumption was high compared to Linkerd2. Consul was using same sidecar proxy, so we eliminated it as well. Here is the detailed explanation why Linkerd2 is using it's own proxy instead of Envoy. Plus, Linkerd2 was so easy to use. Istio documentation was so overwhelming.
Linkerd rhymes with “Cardi B”.
The “d” is pronounced separately, as in “Linker-DEE”. (source)
Problem #1: gRPC Load Balance
As you can see in the graph, some pods were like scapegoat and some were like sloth. After mesh, everything was fine.
Problem #2: mTLS
Thanks to mTLS feature of Linkerd2, we secured internal communication between microservices just like a finger snap as Thanos did. Linkerd2 automatically rotates certificates every 24 hours. Also you can use cert-manager to rotate the issuer certificate and private key.
Problem #3: Traffic Monitoring
Linkerd2 is bundled with Prometheus and Grafana, yet you can bring your own instance and configure it via official documentation. We followed the documentation and started to use our existing instances. Now we have great metrics from each meshed pod and we have better observability over the cluster.
Thanks to Linkerd2, we solved our problems and now living happy ever after. The documentation was very clear, getting started page was so easy to follow (+ they had demo application.) Of course, not everything was bright and shine. We faced few issues while meshing pods or after mesh but we solved them as well. Even we opened an issue on GitHub and got help.
So this post is the first part of our service mesh journey which is about "What is Service Mesh and why we chose Linkerd2?" In second part, we will be talking about the problems we faced and how we solved them.