After complaining that numpy took many hours to solve a 64k x 64k matrix, I broke out cuSolver, Nvidia's GPU linear algebra library. A 32k matrix gets solved (LU decomp) over 1000x faster than base numpy (with MKL not loving my AMD CPU), but a 64k matrix of floats is too big \
to solve directly on my 24 GB Titan RTX card. The nice thing about working with a low level library is that you have to explicitly allocate the temporary working buffers, so when it doesn't fit on the device, I can put it in pinned host memory or on my other card connected \
by NVLink. The 64k matrix gets solved in 109 s with nvlink memory, which is still 200x faster. At 32k, the comparison is:
Local mem: 2.2
Nvlink mem: 21.7
Host mem: 80.8
Clearly very bandwidth bound! There is probably a super-linear speedup for explicit multi-gpu computation. \
The beauty of numpy is that one line of python equals a couple pages of ugly C code to use cuSolver. Pytorch would go fast, but still hit the out of memory wall before 64k.
One hiccup in the process: The cusolverDnSgetrf_bufferSize() function fills in an int for the buffer primitive count, and fails when it is over 2 billion, so I had to extrapolate myself. It should be a size_t! @NVIDIAHPCDev

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with John Carmack

John Carmack Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @ID_AA_Carmack

27 Apr
The Imperial College epidemic simulation code that I helped a little on is now public: github.com/mrc-ide/covid-… I am a strong proponent of public code for models that may influence policy, and while this is a "release" rather than a "live" depot, it is a Good Thing.
Before the GitHub team started working on the code it was a single 15k line C file that had been worked on for a decade, and some of the functions looked like they were machine translated from Fortran. There are some tropes about academic code that have grains of truth, but \
it turned out that it fared a lot better going through the gauntlet of code analysis tools I hit it with than a lot of more modern code. There is something to be said for straightforward C code. Bugs were found and fixed, but generally in paths that weren't enabled or hit. \
Read 4 tweets
11 Apr
AMD 3990 CPU scaling tests: Because of the Windows group limit of 64 CPUs, just firing up a lot of C++ std::threads didn't work great:

128 t = 67 s
64 t = 63 s
32 t = 84 s
16 t = 160 s
8 t = 312 s

32 to 64 threads wasn't a big boost, and 64 to 128 was slower. However! \
Setting the group explicitly let it scale all the way up:

128 t = 38 s
64 t = 48 s
32 t = 84 s
16 t = 160 s
8 t = 312 s

Notably, because each group gets 32 hyperthreaded cores, 64 threads across 2 groups on an unloaded system is much faster because they are all alone on a core\
instead of shared two to a core. That means that if you don't want to add the windows group code, you are better off disabling hyperthreading and having 64 single thread cores in a single group.

I expected this code to be memory bound sooner, I'm impressed with the scalability!
Read 4 tweets
27 Mar
Just before travel started getting locked down I was at a small gathering of people brought together to talk about the future of computing. One of the after-dinner topics was more general predictions a decade ahead, and the first question was "Person on Mars in 10 years?"
Many thought it would happen, but despite wishing it, I think the odds are <50% in that window. As a clarifying tactic, I said "Lets put money on it. $10k says it doesn't happen by then." It was *striking* how opinions got immediately reevaluated much more critically. This was
a room full of world class engineers, but the gap between abstract belief and careful consideration was large. I try this tactic often, because I think people will argue, sometimes passionately and occasionally belligerently, for positions that, if they really bring all their
Read 4 tweets
11 Feb
Let’s be scientific and run experiments. @boztank said unequivocally that FB does not do this, but it would not surprise me if there were Android apps that did listen surreptitiously and sold data. The claim from someone in that thread was that a group of people could put \
\ their phones on the table (presumably turned off), have a conversation, then have ads related to their conversation start appearing in their feeds. Key to making this more convincing would be making sure the topic of conversation was about something that the people had \
\ positively never searched for directly. There are a bunch of obscure, high value keywords that would be catnip to something like this. A third party could pick a keyword, verify that it produces ads after searching for it normally, then verbally bring it into the conversation \
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!