I want to talk about an underappreciated gem of the JVM ecosystem - async profiler. The common perception is that you only need to use a profiler if you’re a perf eng. So here's a few stories of how fitting a profiler into daily workflows helped me to be a better software eng 1/N
Sometimes it’s not even necessary to perform any type of performance analysis, and even just looking at the data might yield quick wins. As an engineer who’s working on the code, you often already have a mental model of the code and expectations of where the time should be spent.
For instance, you usually don’t expect a solid chunk of CPU time to be spent in String#intern. In this case, the win was indeed very quick. The only change that was needed is flipping an option in the jackson config. (github.com/FasterXML/jack…)
You probably don’t expect the classpath to be scanned when a new WebSocket connection is established either. A single flamegraph is worth a thousand words when you open an issue or raise a PR to an OSS project. (github.com/eclipse/jetty.…)
It's usually easy to track down the issue when you release a new version of your own application. git bisect can give you the answer in logN time. However, when upgrading 3rd software, you can't just simply rollback. You have to figure out what's causing the unexpected behaviour.
After a JDK upgrade, a team noticed that one of the services started using twice more instances. Comparing the flame graphs showed that the service started spending crazy 70% of the CPU time doing GC.
Since async profiler shows GC CPU usage down to a single method, it helped to figure out the exact issue. After figuring out the change and changing the GC options accordingly, and the CPU usage went back to normal.
Many of us are on-call for the services that we are working on. You build it, you run it. The usual signals that you’re in this situation include: it’s the middle of the night, your pager is going off, the name of the alert contains the word CPU and the number is close to 100%.
It’s one of the most typical scenarios where having a profiler ready to go on a production system can save the day. The flame graph might not always tell you what exactly is wrong, but it will tell you why the CPU hitting 100% which is usually enough to mitigate or fix the issue.
How do you approach a new feature as an eng? Applying engineering rigour generously doesn't always lead to an optimal result. For instance, you can safely change the logic that's called once from the admin page, but have to be careful with the code that's called on each request.
Inspecting flame graphs can help you to build a runtime mental model of your service. By combining the metrics and the data from a profiler can guide your decisions on how much performance rigour you need to apply to a change you're working on.
If profilers are so useful for all software engineers, why doesn’t everyone use them as a part of their day-to-day job? Maybe this is due to a high entry bar. There is little use of getting flame graphs from your laptop since it doesn’t represent the load profile of your prod env
First of all, make it accessible - getting a flame graph from a running production service should be easy, safe, and secure. Close to zero overhead of the async profiler is the key here. Make it easy to use, and users will come. Give it a try, github.com/jvm-profiling-….
Btw, if you want to be a part of the team which uses async profiler and various other observability tools actively and empowers other engineers to use it? Check out careers.canva.com or DM me, I'll be happy to chat more about what we do 🔥
• • •
Missing some Tweet in this thread? You can try to
force a refresh
It's amazing to see that ZGC (JEP 377) and Shenandoah (JEP 379) are going to become non-experimental in 15! However, we've been successfully using ZGC for our gateway components since when JDK 13 was released. 1/n
After running an experiment, and switching the fleet of our gateway components to ZGC, we saw the GC pauses went down from 30-60ms down to less than 2ms. Here's some data from one of the instances I looked up right now. 2/n
Not only this reduced latency for the requests affected by GC, but also eliminated some unfortunate side-effects, e.g. jumps in the request queue size which would cause spinning up more threads than necessary 3/n