The importance of ThreadDump"S"
The Problem
Application that becomes unresponsive is always a pain and find what is going wrong is one of the hardest activity that developers, architects and operators have to deal with.
Recently I had the "plaesure" to see all the VM where an application was deployed crashing for unknown reason: the application did not respond anymore and tomcat was not able to serve any request.
Usually if we look just at the machine, we do not have any information about what is wrong: log stop to be written..we just see that CPU is increasing its usage (and RAM of course).
To really understand what is going wrong, it is important to have a Thread Dump of the JVM to see effectively what each thread is doing at the moment. But there is a problem on that:
if you just take one thread dump you have just a snapshot of the application at a specific time while you are missing the "evolution" of the application
Thread dump overview
For example just looking on the first Thread Dump received, it was not clear what is going wrong. To analyze it, we use one of the best tool which is fastThread.io
The report shows that there are many many TimedWaiting Threads and there are also 2 blocked thread. So what people usually do is to observe the blocked threads and to find what is the problem. This is the principal guilty of the application situation, at the first glance.
Thread Dump mistake
And this is a huge mistake. Thread dumps are ok only if you take more than one and can compare all of them. This is important because you can see the evolution of the application. Blocked doesn't mean that we have a problem, maybe they are using the same resource and in a while the resource will be free to be used, so it's just an instant situation.
Collecting other 2 dumps waiting 2 minutes among them (but you can also have more dumps with less waiting time) you can zip together and upload to the tool and see the comparison:
And if you scroll the comparison and go to CPU spike we see that
So there was 5 threads that were in infinite loop (they are executing the same code, but in different point this is why they are not grouped all together).
The real problem
Infinite loop is not always pure evil, but in this case the loop was related to listing element in a List structure. But at the end of each loop, the list is changed adding one element. The loop was not implemented just using iterator (this prevent list change) but using indexes. So the problem was:
- Condition on list was always true. The JVM need to check the dimension of the list every loop
- An element is added at the end of the loop to the list. So the list is increasing the RAM occupied
- GC is not able to find elements to be discarded because - after million of loops - the list is referring all the elements that can't be removed
Commenti
Posta un commento