, 40 tweets, 7 min read Read on Twitter
Pumping myself full of coffee this morning to prepare for a @nanopore workshop on applying CUDA to bioinformatics, or as I explained to my sister "using computer graphics cards to science more fast"
Things are looking up after yesterday's disasters. I've made it to the venue and it's quite fancy.
I'm bad with names and faces, but it's all the more jarring when everybody recognises me from Twitter
And we're off
"Prepare for some unexpected consequences in non-serial programming"
"We are never quelled by the availability of new hardware, we just find harder problems"
It's not about running everything on the GPU, one of the skills is identifying subproblems that can be efficiently run off the CPU
We are learning to program with CUDA, but there are "drop-in" libraries for a bunch of different applications (deep learning, signal processing, linear algebra etc). Be aware of NCCL - for exchanging data between your GPUs (I think)
A middle ground is OpenACC. Pragma based approach that hints to the compiler that something could be parallelized. Best if you want to optimise legacy code.
Not limited to C if you want to use GPU. There's Python and Fortran (and others). No love in the room for Fortran which is a language I actually enjoyed using in a previous life on my industrial year.
If nothing else I want to leave this course with the ability to confidently spell nvidia
After a little registration faff, I have provisioned a machine for some exercises
Success!
Seems legit
YEAH GOT ME SOME KERNELS RUNNING ON MY BLOCKS AND THREADS
"using trivial sizes here for ease of explanation"
tutorial being very clear that tiny puny numbers like one thousand concurrent tasks are just for demonstration purposes
Made it through the first tutorial, some gotchas:
* Forgetting to call cudaDeviceSynchronize()
* Spelling synchronize with an s
* Typing two chevrons instead of three for the kernel exec configuation
* Not taking responsibility for writing outside an array (boo!)
Onto the hard challenges
I HAVE ACCELERATED 2D MATRIX MULTIPLICATION
things are heating up, my GPU at 54C now
we're back from lunch, i turned down the strange looking black-bread burger because someone decided to put salmon in it, luckily for me, there was a hearty soup
going more into the specifics of memory management for GPU now
NO MUSIC
The concept of "unified memory" is useful. We can share data between the CPU and GPU with behaviour roughly consistent with malloc(). The first time you try and hit the memory with GPU, you get a page fault which triggers migration of data to GPU devices. Some cost to this...
...The gotcha is that if you go back and hit that memory with the CPU, you'll migrate it again! We can see this with a small example where we initialise an array on the CPU and use it on the GPU.
In second example, memory initialisation is done on GPU with initWith. Without pagefaults we save some precious time inside the addVectorsInto kernel.
A WHOPPING 122 MILLISECONDS
THINK OF ALL THE THINGS THAT COULD BE DONE IN THAT TIME NOW
If we know where we want the data to be, we can avoid page faults by forcing a background migration of the data, in larger contiguous chunks with asynchronous prefetching using `cudaMemPrefetchAsync`
LOOK AT ALL THIS TIME WE'RE NOT WASTING
Reached the end of the exercise, I'm hyped, we have a fun challenge:
>> Your end goal is to profile an accurate saxpy kernel, without modifying N, to run in under 50us.
>> Some bugs have been placed in this codebase for your edification.
if you don't care about the answer being right, you can forget the cudaDeviceSynchronize() and just finish the program before the GPU is done in 14ms
we can get to 1ms just by tuning the execution configuration to fit the properties of the GPU device, leveraging that there are a fixed number of "streaming multiprocessors" and that threads are executed in "warps" of 32
although i suspect this might be because i've made it ignore half the matrix
but again, if you want a very bad approximation we can stop here
got it down to ~16us by initialising the memory on-GPU and some tuning of the execution configuration, still a bit unclear how to pick a good number of blocks/threads (beyond blocks being a multiple of available streaming multiprocs and threads being a multiple of warp size)
kernels are executed in serial streams, by default, kernels are executed on the aptly named default stream. the default stream is special: it blocks. we can hook kernels up to "non-default" streams that can overlap eachother, you can see this in the visual profiler tool!
We finish the day with an assessment challenge - accelerate an n-body simulator. After applying some of what we've learned I could get the program to calculate 6 billion interactions per second, but the improvements needed to meet the assessment criteria of 30B/s were elusive...
Spotting my questionable dismantling and remantling of the simulator, an instructor suggested the number of blocks was perhaps small. Indeed! We're up to 48 billion interactions per second. Shows even with a good algorithm, you need to tune the execution configuration carefully!
Out and about now. Heading to mystery dinner location. Getting lots of funny looks and people asking if I'm cold.
I'm being drip fed the wine. Someone has clearly informed the restaurant about my free wine problem.
Only one piece of free bread. Send help. I think @pathogenomenick, @LoiMai and the rest of the EBAME gang can attest to my wine and bread eating skills. Especially in the adversity I faced by not liking seafood at a seafood buffet.
Problem solved. Someone who is driving has donated their wine glass and it is getting magically refilled.
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Sam Nicholls
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!