𓃰 The Diligent Developer

Nothing changing is a change in itself

13 Dec, 2023

I'm in a video call with 15 other software engineers, managers, devops. Two of them are from my company, the rest are from the client and their systems integrator.

The system has been down for 3 days, and they are not happy. Tens of thousands of dollars are lost every day of downtime.

They pinged us 3 days ago saying there was some issue, and after checking our monitoring we responded with an everything looks fine on our side.

During the following days the issue escalated more and more, and their conclusion was that it was our fault.

Their thinking was simple: we haven't changed anything in 1 month, no new code, no new deployments. Therefore, it must be someone else's fault.

I am still pretty sure that our systems are working fine, but it is difficult to get out of the situation. I ask for proof. Have you added some debug messages that show that the request is being sent, but no response is being received? That is in my mind the simplest way to see where the program gets hanged.

Their answer is that this is not even necessary, because this is a proxy microservice that only calls our service and then returns. Now that, to me, is a red flag. In three days of outage, the least you could do is confirm your suspicion of where the problem is; specially if it takes 5 minutes to confirm, before pulling 15 people into a meeting.

I still don't believe the problem is ours, so I ask to see the code. My hope is to find out that their proxy microservice is in fact doing something else, and that is what times out.

They screen-share the code. It turns out it actually is very simple: a TypeScript implementation of a gRPC API which forwarded all requests to our remote server. Nothing else.

And as the client's VP stresses once and again, they haven't changed the service for 1.5 months, so it can't be their fault.

That finally gets me thinking... 1.5 months? The project started a few months ago, and I bet this is the longest time the service had been running without a restart.

Now, there is a reason why restarting the service might help. This is a microservice that acts as a proxy, and I know gRPC and JavaScript don't work extremely well together, it is not one of the best supported languages. After running for a long time, processing millions of requests, some answered, some timed out, some streaming that may never be closed, etc. it may happen that some buffer got filled up somewhere or the list of open connections grew too large or something, and then the service stops answering any request.

The bad thing is that it would look like the problem is that the external service is not answering, the good thing is that a simple restart would fix the issue.

Have you tried to restart the service?

No. They restart the service immediately, and everything is working again.

I cheered, but not everyone was as happy as me.


Now this is not only a funny story, but there are some things we can learn.

  1. When you face a bug or an incident, it is important to explore by your own means as much as you can what the issue may be. In this case: they could have just restarted the service and added debug messages. Adding other people in the loop always adds delays, and you should only do it when you are blocked or it takes a huge amount of time to proceed by yourself.
  2. Be careful with microservices architectures in large environments. One of the reasons why the service wasn't restarted is that the developer didn't have access to a restart button. They need to contact the devops team in order to get a restart, or a new deployment going. This makes sense to some extent, but it can add too much friction to just debug the simplest of issues.
  3. Be careful with simple and stupid microservices. We need to call an external API? Let's add a simple API proxy microservice. There may be reasons to do this, but in 99% of the cases I would suggest you use a library for this, and not a microservice. The code may look so simple that you think it is completely robust. But running a separate microservice for this adds a bunch of complexity: more network hops, separate logs, a single bottleneck for all other microservices using the API. So you should provide a real benefit to you application in exchange of all this addeed complexity, and a dumb API proxy doesn't.