Q5K: Quake level viewer in 5K LUTs on a low cost, low power ice40 up5k #fpga! Custom #GPU, @risc_v CPU and SOC, capable of rendering #Quake's level with lightmaps.
How? Thread 👇
(Written in #Silice, here running on the #mch2022 badge fpga)
The (tiny) GPU is my DMC-1 (Doom-Meets-Comanche) GPU, which also powers the Doomchip-onice demos (remember? Doom with a terrain!!).
It targets the ice40 UP5K, an entry-level fpga with great support from the Open Source toolchain #yosys/#nextpnr. 2/n
There were four main hardware changes to enable Quake level rendering: 1) 32-bits per-column depth, 2) streaming of level data from QPI memory (SPIflash on icebreaker, PSRAM on mch2022 badge), 3) multi-texturing for lightmaps (!!)
3/n
For 1) (32-bits depth), I realized that I had still some BRAM left after other optimizations. This was more than welcome to improve depth bit width, which was otherwise too limited. 4/n
Side note: Quake level rendering could use the BSP ordering and no depth buffer. But the depth buffer allows to work from the visibility list directly, and paves the way to having entities in there. 5/n
The second change (streaming) was essential to enable walkthrough entire levels. The UP5K has 128KB of SPRAM - already great - but far from enough for entire levels. The data easily fits in QPI memory however. 6/n
There were two difficulties for streaming. First, the QPI memory is already used for textures by the GPU, and thus CPU accesses can only happen when the GPU is done with the previous frame. 7/n
Second, my CPU QPI access was a slow byte by byte access in a for loop, used only for booting (loading code once from QPI memory to fpga SPRAM). I replaced that with a fast hardware burst load, directly from QPI memory to SPRAM, bypassing the CPU. 8/n
The CPU is paused during these transfers: the SPRAM is being written to every cycle, so the CPU cannot fetch instructions. I modified my ice-v dual with a stall signal, pausing both cores. Transfers are now fast! (1 byte per cycle at 25MHz). 9/n
Lightmaps were an interesting challenge. First, I had to extract them properly from the game. For this I referred directly to the source code of the Quake lighting tool (in @fabynou's branch 👍: github.com/fabiensanglard… )
10/n
I then packed the lightmaps in textures (with tons of padding, but this overhead is not a concern in the 'large' QPI memory). 11/n
I made sure I could render lightmaps with correct texture coordinates, so that the light patterns are properly localized. 12/n
But then there was the big question. How to blend the lightmaps with the textures? This is a form of multitexturing. At first this seemed a roadblock. But an early design decision saved me here! 13/n
Indeed my per-column buffers store 12 bits per pixel: a palette index byte, and a 4-bits light level. That is because I do not use the Doom/Quake palette trick for lighting, but instead dim the actual RGB values after palette lookup, when sending RGB data to the screen. 14/n
So to blend the lightmaps in, all I needed to do was to write to these 4-bits in a second pass (I now use 8 bits for smoother light levels). This took only minor changes to the GPU, then sending each rasterized column span twice: first for textures, second for lightmaps. 15/n
And voila! I made a (tiny) GPU+SOC that renders Quake levels in 5K LUTs, on a low cost, quite slow #fpga. Rendering is laggy and has issues, and there is plenty of room for CPU-side optimizations. Nevertheless it is already fun to explore the levels of this great classic! 16/n
Here's the #fpga resource usage. Design uses two clocks, and validates a bit below 25/50 MHz. I overclock without trouble at 33/66 MHz. CPU has two cores, but each take 4 cycles per instruction, so they only effectively run at 8.25 MHz ... sharing 128KB RAM. Not a lot! 17/n
Btw, Q5K runs happily on the icebreaker by @1bitsquared! On SPIflash the texture accesses are faster, but the screen uses a SPI interface instead of parallel, and that slows down the overall rendering. 18/n
The SOC uses my tiny @risc_v dual core processor, the ice-v-dual. A compact, simple to hack CPU I detailed in this video:
19/n
I also made a video on the DMC-1 GPU and the doomchip-onice. The global architecture is still the same, but I'll have to do an update for the latest features! 20/n
Many thanks to @tnt and the #mch2022 badge team for hardware + discussions, as well as @fabynou for great resources on Doom and Quake (e.g. Quake engine code review fabiensanglard.net/quakeSource/in… )
My main reference to decode Quake's bsp files was the Quake unofficial specs 3.4.
21/n
Also wanted to point out another GPU-on-fpga project rendering Quake levels: jbush001.github.io/2015/06/11/qua…
This explores a much more advanced GPGPU architecture and targets larger, more powerful FPGAs. Very interesting, check it out!
22/n
Thanks for reading this far! Source code is coming soon in #Silice's projects, as well as more details (so much more to discuss).
If you are interested, RT, stars, and likes are all welcome encouragements!
1/ @tinytapeout 7 just closed and these two tiles host my design! If it works, it will generate explorable terrain 'voxels' similar to the VoxelSpace Comanche 1992 game engine.
2/ The design is written in #Silice and exported to Verilog, and then synthesized to #ASIC through the amazing #TinyTapeout framework.
The terrain renderer fits on only two tiles, using some tricks 😎. Write-up pending.
3/ See also my original terrain demo running great on #fpga + VGA.
(Warning: the TinyTapeout design won't be fast, around 5 FPS, due to favoring size over perf. + IO frequency limitations ).
a5k: Another World on a chip! This is a hardware remake of the Another World VM and renderer (no traditional CPU), that fits on a UP5K #fpga (5K LUTs, 128KB SPRAM).
Thread! (1/n)
(Written in #Silice, running on @1bitsquared icebreaker + VGA PMOD, intro only, no audio)
2/ Another World by @EricChahi was one of my favorite #Amiga500 games. It's a great game with beautiful polygon-based graphics. Its architecture is also fascinating: the whole game runs in a custom VM
As the game turned 31 I thought a hardware version would make a great present!
3/ @fabynou has several great blog entries detailing the game inner workings, with links to additional resources including source code. I won't go into much details here so check it out for an in-depth overview! fabiensanglard.net/another_world_…
How much DooM can fit in a USB port? Quite a bit it turns out! A minuscule #Fomu#fpga board hosts my hardware/software re-implementation of the DooM render loop in the confines of a USB port (uses ~4200 LUTs and < 128 kB of internal RAM). (1/n)
This is a tiny piece of DooM in a 2.1x2.7 mm #fpga. That is pretty small! (can you see it below on the #Fomu board? you might have to zoom ...).
I created within a #riscv computer with specialized texturing and column drawing hardware. Designed to render DooM 1994 levels! (2/n)
The OLED screen is connected to the #Fomu through jumper wires soldered on the pads (a trick inspired by @brunolevy01 Fomu vga mod). (3/n)
The DooM-chip! It will run E1M1 till the end of times (or till power runs out, whichever comes first).
Algorithm is burned into wires, LUTs and flip-flops on an #FPGA: no CPU, no opcodes, no instruction counter.
Running on Altera CycloneV + SDRAM. (1/n)
Everything is described in a language I am working on: SDRAM controller, divider, BSP traversal, texture unit, etc.
Main renderer (w/o data) is 666 lines of code (!).
A great test case, made quite a few improvements, fixed some issues, learned a lot on CycloneV + Quartus.
(2/n)
Rendering uses the original BSP tree (of course!) but is modified to better fit a hardware implementation ; columns are raycast and drawn immediately front-to-back, stopping as soon as fully filled.
(3/n)
Wolfenstein 3D render loop in pure hardware! No CPU, no instruction pointer, no opcodes, only wires and flip-flops. Here runs on a Mojo V3 board (Xilinx Spartan 6) + SDRAM. Reading @fabynou black books while learning about #FPGA could only lead to this ;-)
(1/n)
Implemented from scratch using my language, from the SDRAM double-framebuffer to the Wolf3D DDA algorithm (and this is the original one; fixed point, DDA loop with only adds and shifts, tangent table!). 320x200, 256 18-bits colors palette and VGA output -- old school!
(2/n)