Tweet

Sergey Tselovalnikov

15 Sep, 14 tweets, 5 min read

I want to talk about an underappreciated gem of the JVM ecosystem - async profiler. The common perception is that you only need to use a profiler if you’re a perf eng. So here's a few stories of how fitting a profiler into daily workflows helped me to be a better software eng 1/N

Sometimes it’s not even necessary to perform any type of performance analysis, and even just looking at the data might yield quick wins. As an engineer who’s working on the code, you often already have a mental model of the code and expectations of where the time should be spent.

For instance, you usually don’t expect a solid chunk of CPU time to be spent in String#intern. In this case, the win was indeed very quick. The only change that was needed is flipping an option in the jackson config. (github.com/FasterXML/jack…)

You probably don’t expect the classpath to be scanned when a new WebSocket connection is established either. A single flamegraph is worth a thousand words when you open an issue or raise a PR to an OSS project. (github.com/eclipse/jetty.…)

It's usually easy to track down the issue when you release a new version of your own application. git bisect can give you the answer in logN time. However, when upgrading 3rd software, you can't just simply rollback. You have to figure out what's causing the unexpected behaviour.

After a JDK upgrade, a team noticed that one of the services started using twice more instances. Comparing the flame graphs showed that the service started spending crazy 70% of the CPU time doing GC.

Since async profiler shows GC CPU usage down to a single method, it helped to figure out the exact issue. After figuring out the change and changing the GC options accordingly, and the CPU usage went back to normal.

Many of us are on-call for the services that we are working on. You build it, you run it. The usual signals that you’re in this situation include: it’s the middle of the night, your pager is going off, the name of the alert contains the word CPU and the number is close to 100%.

It’s one of the most typical scenarios where having a profiler ready to go on a production system can save the day. The flame graph might not always tell you what exactly is wrong, but it will tell you why the CPU hitting 100% which is usually enough to mitigate or fix the issue.

How do you approach a new feature as an eng? Applying engineering rigour generously doesn't always lead to an optimal result. For instance, you can safely change the logic that's called once from the admin page, but have to be careful with the code that's called on each request.

Inspecting flame graphs can help you to build a runtime mental model of your service. By combining the metrics and the data from a profiler can guide your decisions on how much performance rigour you need to apply to a change you're working on.

If profilers are so useful for all software engineers, why doesn’t everyone use them as a part of their day-to-day job? Maybe this is due to a high entry bar. There is little use of getting flame graphs from your laptop since it doesn’t represent the load profile of your prod env

First of all, make it accessible - getting a flame graph from a running production service should be easy, safe, and secure. Close to zero overhead of the async profiler is the key here. Make it easy to use, and users will come. Give it a try, github.com/jvm-profiling-….

Btw, if you want to be a part of the team which uses async profiler and various other observability tools actively and empowers other engineers to use it? Check out careers.canva.com or DM me, I'll be happy to chat more about what we do 🔥

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Sergey Tselovalnikov

Try unrolling a thread yourself!

More from @SerCeMan

Sergey Tselovalnikov

Did Thread Reader help you today?

Like this author's thread?