Profile picture
dreid @dreid
, 22 tweets, 4 min read Read on Twitter
I used to run a static website w/ Twisted on PyPy on a Rackspace cloud server.

I also ran a ZNC bouncer on the same host. Sometimes my bouncer would get restarted for reasons but it automatically restarted so I didn't really notice too much.
Eventually Rackspace rolled out host monitoring and put metrics graphs in the cloud server admin UI. After a few weeks of having metrics for this host I discovered a problem.
A very distinctive sawtooth pattern on the memory graph. Perfectly linear growth over several days and then a steep drop.

Clearly something was "leaking" memory.

What kind of leak though?*

* blog.nelhage.com/post/three-kin…
And what was leaking?

3 Options:
- ZNC (Big pile of C++ including some weird ass 3rd party extensions for doing push notifications, probably a type-1 leak.)
- Twisted (A whole bunch of python, a type-2 or type-3 leak?)
- PyPy (I can't even begin to describe this, any type?)
Ruling out ZNC ended up being easy. It's memory usage was quite low but PyPy was using multiple gigabytes. Which is good because I definitely would not have invested any effort into figuring out why ZNC was leaking.
Ruling out Twisted was less easy but do-able. static file serving hadn't changed much in a decade and I was familiar enough with the code to be pretty confident that it didn't have any of the things you'd usually suspect of a python memory leak.
So that left PyPy, the most recent version of course, which had recently had some amazing GC work done to it. Was this a bug in the PyPy GC? Maybe, but IDK the PyPy developers are really smart but also all software is fallible.
After talking about it with @alex_gaynor and on the PyPy IRC channel we were able to use some PyPy specific tooling to determine that PyPy didn't think it had nearly as much memory as htop did.
That would make this a Type-3 memory leak ("Free but unused or unusable memory") And also pretty conclusively ruled out Twisted as at fault.

By this time we'd also figured out how to reliably reproduce the memory leak locally. Running a static server and curl in a loop.
How do you find a memory leak like this?

You can't use any of PyPy's tooling because it doesn't even know about the memory.

Generally to find a memory leak you have to be able to identify the kind of thing(s) that are being leaked.
By this time there are a couple of facts that are relevant.

The growth is perfectly linear, and the growth is only a few bytes per request.

Also the PyPy GC uses arenas where objects are grouped by size.
So we started the reproducer up locally. Ran the process up to about 4GB of memory, then did the only thing that we had left to do… we took a core dump.

So now we have ~4GB of leaked objects in a core file.
"Let's open it in a hex editor"
We decide the best course of action is to full screen a terminal window, and run hexdump over the core dump.

To do what? Literally just look for some sort of pattern in the mess.
Eventually we find some, I feel like I really need to stress that we were literally just scrolling through the file as fast as terminal would let us to try to see something like a pattern. This is was not a disciplined or particularly sophisticated technique, we just stared.
So we find some repetition of the right size. And thanks to @alex_gaynor's knowledge of PyPy internals we're able to eventually back out these raw bytes to the kind of object PyPy thinks it is. We get an answer so obvious that I'm a little ashamed I didn't think of it earlier.
We were leaking *closed* sockets.

Every curl opened a socket, made a request, closed a socket. And here was an arena full of closed sockets.
PyPy's GC did have a bug, something to do with the process mostly allocating objects of the same small size and not allocating enough other stuff to trigger GC. (Or something, I'm not a GC wizard.)
So why did my absolute nothing of a static website that got literally no traffic demonstrate this problem?

I had configured an HTTP monitoring check. So every 5 minutes a single request would be made and a single socket would be opened then closed and leaked.
I don't really have a specific takeaway here, read the threat linked from the first tweet and consider the ways that those tweets apply to this story.

But keep in mind…

All software is fallible, sometimes the bug will be in the compiler, runtime, libc, or kernel.
Also I hope you never have to stare at a core dump in `hexdump` and hope that you'll find some kind of pattern.
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to dreid
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($3.00/month or $30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!