tutorial being very clear that tiny puny numbers like one thousand concurrent tasks are just for demonstration purposes
* Forgetting to call cudaDeviceSynchronize()
* Spelling synchronize with an s
* Typing two chevrons instead of three for the kernel exec configuation
* Not taking responsibility for writing outside an array (boo!)
I HAVE ACCELERATED 2D MATRIX MULTIPLICATION
>> Your end goal is to profile an accurate saxpy kernel, without modifying N, to run in under 50us.
but again, if you want a very bad approximation we can stop here