In the first part of our service mesh journey, we've talked about "What is Service Mesh and why we chose Linkerd2?". In this second part, we will be talking about the problems we faced and how we solved them.
Problem #1: Apache ZooKeeper Leader Election
At Intenseye, we use Apache Pulsar instead of Apache Kafka the traditional queuing system.
Apache Pulsar is a cloud-native, multi-tenant, high-performance distributed messaging and streaming platform originally created at Yahoo!
Apache Pulsar uses Apache Zookeeper for metadata storage, cluster configuration, and coordination. After we meshed Zookeeper with Linkerd2, K8S restarted the pods one-by-one but they stuck in "CrashloopBackOff".
We've checked the logs and saw that ZooKeeper cannot communicate with other cluster members. We dig more and found out Zookeeper nodes are unable to elect a leader due to mesh. ZooKeeper servers listen on three ports: 2181 for client connections, 2888 for follower connections - if they are the leader, and 3888 for other server connections during the leader election phase.
Then, we've checked the documentation and found "skip-inbound-ports" and "skip-outbound-ports" parameters of Linkerd2 sidecar. Once we add 2888 and 3888 ports to skip for inbound/outbound, then establishing quorum did work. Since these ports are for internal Zookeeper pod communication, it was OK to skip mesh.
This is a common problem for all service meshes if you use an application with leader election, such as; Pulsar, Kafka etc. and this is the workaround.
Problem #2: Fail-Fast Logs
We've started meshing our applications one-by-one. Everything was fine, until meshing one of the AI service, let's call it application-a. We have another application running as 500+ lightweight pods, let's call it application-b, which are making requests with gRPC to application-a.
After 1 or 2 mins of successful mesh, we were seeing hundreds of
Failed to proxy request: HTTP Logical service in fail-fast errors in application-b. We've checked the source code of
linkerd-proxy repository, we've found the place where this log printed but couldn't understand the error message. I mean, what is
HTTP Logical service?
So we've opened an issue to Linkerd2 via GitHub. They took a close interest in the issue and tried to help us to solve it, even released special package to debug the issue.
After all discussions, it turns out the value of "max_concurrent_streams" set on application-a, which was 10, was not enough to handle requests. Linkerd2 made it visible. We've increased the value from 10 to 100. No more fail-fast errors.
Special thanks to @olix0r and @hawkw from Linkerd2 team.
Problem #3: Outbound Connection Before Sidecar Initialization
We have few applications making HTTP call during application start-up. It needs to fetch some information before serving requests. So the application was trying to make outbound connection before the Linkerd2 sidecar initialization, hence it was failing. K8S was restarting the application container (not the sidecar container) and during that time the sidecar was ready. So it was running fine after 1 application container restart.
Again, this is another common problem for all service meshes. There is no elegant solution for this. The very simple solution is putting "sleep" during startup. There is an open issue on GitHub and a solution offered from Linkerd2 guys, which I believe requires much more work than putting "sleep".
We left it as is. 1 automatic application container restart was solving the problem already.
Problem #4: Prometheus
Prometheus is an open-source cloud-native application used for monitoring and alerting. It records real-time metrics in a time series database with flexible queries and real-time alerting.
Linkerd2 stores its metrics in a Prometheus instance that runs as an extension. Even though Linkerd2 comes with its own Prometheus instance, we were using Prometheus already in our platform, and we just needed to re-use it.
Linkerd2 had beautifully documented tutorial which allows you to bring your own Prometheus instance. We followed it and everything was working fine, until we mesh an application which uses "PushGateway" from Prometheus to push our own internal metrics asides from the ones produced by Linkerd2.
PushGateway is an intermediary service which allows you to push metrics from jobs which cannot be scraped/pulled.
After mesh, the 500+ lightweight pods started to push metrics over sidecar proxy. We started having memory issues on PushGateway side and we skipped the mesh for 9091 (PushGateway port) from 500+ pods.
Not everything is easy-peasy as Arya killed the Night King. Making changes in a running systems has always been difficult. We knew we would be facing some obstacles while integrating with Linkerd2, yet we solved our problems one-by-one.
Now we're living happily ever after.