WARNING! Do not connect the modified lamp to the mains! Use a 5-30V DC source!
Introduction – Inspiration from a Fake News
Last year, several websites reported the very impressive news of someone being able to port Doom to a pregnancy test.
Sadly, this achievement was a fake news.
In fact, as clearly reported in Foone’s tweets, the author was trying to find what was inside an elecronic pregnancy tester. Since the electronics was not repogrammable, and the LCD was able to show just few segments, Foone replaced the display with an OLED display, and used a Teensy 3.2 board (featuring a 72 MHz Cortex M4). Later, Foone was also experimenting in streaming videos to the OLED display, attached to the Teensy, via the USB port. To do this, Foone used the SDL library to downscale and dither a window containing the video to be streamed, and sent it to the USB port, probably using the virtual CDC USB COM interface.
Among the streamed videos, there was a Doom demo playthrough. This tweet became viral, so later Foone ran Doom on a PC, connected to it a Bluetooth keyboard, and streamed the real-time gameplay video to the OLED display that was put in the pregnancy test. In other words, unlike what the news websites reported, Doom was ported neither to the pregnancy test, nor to the Cortex M4 board, as it was running on a PC. Incidentally, Foone also streamed Skirim opening sequence to the same display too. Despite the author was very clear about what it was done, the “press”, instead, was not that much…
Despite being a fake news, this event was very inspiring…
We wanted to challenge ourselves once more, this time by bringing our contribution to the list of unusual things Doom has been ported to (this time for real!)…
The Challenge
The rules were:
- Find an off-the-shelf device, not meant to play Doom or games in general.
- The chosen device should have a microcontroller with reasonably limited computing power and/or memory, with respect to vanilla DOOM’s minimum requirements (DOOM runs at an acceptable frame rate even on a 486@33MHz[1], equipped with 4MB of RAM). As an example, we must exclude modern digital cameras, which have multi hundreds MHz system on a chip, with some tens MB of RAM.
- We must use exactly the microcontroller embedded on the chosen device. No replacement is possible. No additional microcontroller can be added. Overclocking (e.g. even if limited to just some peripherals or buses) might be possible though, provided we do no need any cooling techniques.
- Additional flash or memory card can be added to store WAD files.
- A color display can be added if the selected off-the-shelf device does not feature one. The resolution should be high enough to be able to play Doom, decently. For instance, a 32×16 pixel screen is too small, but 128 x 64 could be enough. On the other hand, too big resolutions would certainly require a very powerful microcontroller, against rule 2.
- Input device can be anything, so additional electronics can be added for that purpose.
- The power supply can be changed, if needed.
- We wanted the engine to be as close as the original (vanilla) shareware Doom. Being able to play episode 1 map 1 of Shareware Doom (E1M1) is the minimum goal, even if we do not hide we dreamed of being able to play the full shareware episode with no restrictions on all maps.
- There are no requirements on audio, but sound effects would be really a plus. If implemented, there are no restrictions on the audio subsystem.
- No need of multiplayer support.
Needless to say, challenge accepted!
Finding a Candidate Device to Run Doom: the IKEA TRÅDFRI GU10 RGB Lamp
Finding a good candidate device for this challenge was not easy, at least in principle. Nowadays, there are a lot of devices and appliances that have enough memory and processing power to run Doom, with minimal firmware modifications (i.e. limited to low level hardware access functions). However, such powerful devices are excluded by rule 2. Still, we wanted this task to be neither too easy, nor impossible, so a realistic processing power and memory size is required too. By chance we stumbled across the IKEA TRÅDFRI Zigbee lamps, and to our surprise we found they sported a powerful 40-MHz Cortex M4 based RF module, but with only 32 kB of RAM. A 40-MHz Cortex M4 is surely enough, but what about RAM?
To see if 32kB was enough, we searched if low-memory ports were already available. Eventually, we came across to this forum page, where a user was trying to porting Doom to an unspecified microcontroller, featuring only 256kB of RAM but with a lot of flash memory. This page pointed us to the excellent Doomhack’s open source Doom port to Game Boy Advance (GBA), based on PrBoom. This work must not to be confused with the official GBA Doom, in our opinion inferior to what Doomhack has done.
The GBA, however, has up to 384 kB of RAM (divided in 256kB of main memory, 96kB of framebuffer memory, and 32kB of on-chip fast ram, so called IWRAM), and up to 32MB memory mapped “ROM” on the cartridge, therefore we initially dropped the idea of porting Doom to the Ikea Tradfri lamp.
However, some months later we found (see here and here), again by chance, that the newest RGB GU10 IKEA TRÅDFRI LED1923R5 lamps have a better MCU, featuring 96 + 12 kB of RAM (108 kB in total), 1MB of Flash, and an 80MHz Cortex M33. More precisely, this new IKEA lamp uses a MGM210L RF module from Silicon Labs, which is based on their EFR32MG21 RF microcontroller.
This microcontroller still has less than one third of the GBA RAM, but perhaps with some imprudence, we thought that it might have been enough to run at least the first map (E1M1). This assumption was supported also by the documentation in the GBA Doom port github repository. In fact, with our extreme pleasure, we noted that Doomhack documented the memory usage of each level in a convenient spreadsheet.
From his data, we found that E1M1 was just short of 90 kB RAM (87kB), low enough, for our device equipped with 108 kB.
Unfortunately, we later realized (but it was too late!) that the figures reported in the spreadsheet likely did not take into account:
- IWRAM usage (32kB, including stack)
- the frame buffer (38kB @8bpp on the GBA, in our case, as shown later, 20kB)
- (not sure about this one, but extremely probable, by looking at the Z_zone.c code) ZMalloc + malloc overhead (the Zone memory implementation in PrBoom calls malloc(). The zone memory data structure overhead is 20 bytes per block, the malloc() overhead is 8 bytes per block. This mean that each block has a 28-bytes overhead. Noticeably, several hundreds of blocks are allocated, accounting for more than 15kB of overhead in complex levels such as E1M6).
This means that level E1M1 required not just 90kB (87.4 kB actually), but something larger than 160 kB. Still, we thought that, we could trade off some computing power of the 80MHZ Cortex M33 to save memory, as explained later.
Display
As mentioned, we could add a display if the chosen device had none, so we went for a widespread and cheap color 1.8” TFT 160×128 SPI display, featuring the well known ILI9163 or ST7735S controller. These controllers are compatible, and they are both specified to run at 16 MHz. However, we found that they can be clocked much faster: in this project, we overclocked it at 30 MHz, and found no glitches at all.
External Flash Memory
As said, the final processor has only 1MB of Flash, not enough to store a WAD file (at least 4.1MB for the shareware version). By the way using Slade, a popular WAD editor, we also found that 1 MB would not have been enough even for E1M1 alone. Therefore, an external memory was required. Memory cards are ruled out, because they are not well suited for random reads. The filesystem also just complicates and slows down everything. Due to the low pin count of the MGM210L module (and absence of external memory bus), the only solution was to use an external SPI memory. We chose the popular W25Q64 flash memory. Due to current IC shortage, we were only about to find the 8MB version, and not the 16 MB version (W25Q128).
Input Device
The RF module has already… RF capability, so we could have implemented a radio interface, to support a wireless keyboard. However this would have require some RAM , therefore we decided to simply use a wired keyboard. To play Doom, the bare minimum number of keys is 7 or 8, a number that fits very well with an 8-bit parallel-input serial-output shift register. This allows to cut down the number of required I/O pins (you will see later that this required only one additional dedicated pin).
Solving One issue at a Time
We did not try to port the game directly to the EFR32MG21, as too many issues still needed to be solved. In fact, not only we needed to optimize RAM, but also we needed to find a way to quickly access all WAD data (map, graphics, etc). This latter issue is not present on the GBA, as its processor has memory mapped access to the cartridge ROM/Flash, which can be as large as 32MB, i.e. big enough to store the full commercial WAD. However, as mentioned before, our system needs an external SPI flash memory, which means slower access speed, with respect to internal/parallel memory mapped flash.
If we had tried to directly port GBA Doom to the EFR32MG21, we would have been stuck for ages, trying both to optimize RAM usage and figure out how to quickly read data from SPI. All this, without being able to directly verifying that what we were doing was actually working. This would have been very boring, without any immediate self-gratification.
Therefore, we started porting DOOM to a powerful Cortex M7 STM32H743. This, not only has 1MB of RAM, but also features memory mapped QSPI flash read mode. In other words, with this processor, we can use a cheap QSPI flash, and read it seamlessly as if it were internal flash (memory mapped).
This helped us in many ways:
- To get a working starting point, we (almost) only needed to modify low-level hardware-related I/O and graphics functions. This was very quick and easy to achieve, and it was very satisfying, once we saw Doom running.
- We did not have to deal with SPI access functions from the beginning, so no code modification was initially required.
- Once Doom was running, we could focus in RAM optimization, being able to check if our efforts yielded a working solution.
- Once the enough RAM optimization was achieved, we started progressively wrapping all WAD data memory-mapped accesses with SPI access functions.
- By changing CPU or external flash speed, we could determine which were the performance-limiting factors.
As we expected, RAM usage was well beyond the 108 kB limit, even for level E1M1. In particular, it was close to 160kB considering stack, global data, 20-kB frame buffer, dynamic memory allocation (Z_Malloc + malloc) overhead, and very badly bloated ST HAL library.
RAM Optimization
As we originally stated, our goal was being able to play at least E1M1 on the IKEA lamp, so once we reached the 108kB RAM usage we could have stopped.
However, once we hit that goal, we found that there were still a lot of things we could optimize, allowing other levels to run on the EFR32MG21. Therefore, we continued optimizing until the RAM usage (both dynamic and static, including stack and frame buffer) was below the critical 108 kB threshold for all levels.
Here is a non-exhaustive list of optimizations. Some of them were actually implemented after we started porting to the ERF32MG21.
- Doom code used extensively 32 bit integers, where 16 or 8 bits were enough. Similarly, many 16-bit variables were used, when less bits were enough. Bitfields were used as well for those integer requiring fewer values.
- Enums were used very frequently, but these are usually implemented as 32-bit integer. Instead of using enum data types, we trimmed the data size to be enough to store the largest value.
- We reordered many structure members. Structure reordering saves space due to lack of padding bytes required to align 16 or 32 bit members.
- In the GBA port, several tables were copied to the IWRAM region, to increase speed. We kept them in flash.
- We reduced the number of openings (1280), drawsegs (128), and visprites (64). In a small screen this will not make too much difference. We did not find any issue so far.
- Pointers. This would deserve an article on its own, but the key point is: why do we need a 32-bit pointer, if our RAM is less than 128kB ?
Furthermore, almost all pointers point to 4-byte aligned structures. This means that a 16 bit pointer would be enough (we do not need to store the lower 2 bits, being all zeroes). Therefore, many pointers were replaced by 16-bit short pointers. Setting and getting data will of course require more CPU cycle (To convert a 32 byte pointer to a 16 bit and vice versa), but having a 80MHz MCU makes this not a huge issue. Remember when we said that sometimes we can sacrifice CPU power for RAM and vice versa? Well this is the perfect example. This is probably the most important memory optimization. - We reverted PrBoom Z_Zone to an implementation much closer to the original one (removing useless stuff like Zone Id, and using 16 bit pointers, to save several kB when you allocate many buffers). Why? Because PrBoom’s Z_Malloc() uses the stock malloc/free functions, which on their own store memory block data (8 bytes, typically), further wasting memory.
- The object structure (mobj_t) in Doom is huge. On the GBA port was 140-byte large, and we reduced it to 92 bytes, already saving many kB on the more complex levels, like E1M6 (463 objects). However, some objects, like bonuses and decorations, are static, and do not need all the information required by enemies, bullets, etc. Therefore, we created also a static mobj type, which cuts to half the memory requirements (its size is 44 bytes). In some levels, there are more than 200 of such objects. These optimizations saved more than 30kB of RAM in E1M6.
- We used memory pools (used only for msecnode_t data in the GBA port) for objects (mobj_t and static_mobj_t). This reduces the dynamic allocation overhead to 1 byte/object for pools as small as 16 entries. To achieve this goal we had also to optimize memory pool allocation system as well.
- When possible, 8 or 16-bit array indexes were used, instead of pointers.
- Some information, such as switches textures can change during gameplay, so they need to reside in RAM. However, the number of switches in the game is very limited, compared to actual number of sides on the level. Static walls can reside in internal or even external flash. Therefore, for the switches, we created arrays to store changeable texture information, while other constant information are read from flash.
- There are some data in the WAD file, which require some additional calculations before they are used. Such data cannot be easily calculated at compile time, such as memory pointers (for instance, this occurs with some sprite data structures). In the GBA port, such data are kept in RAM, even though they will remain constant as long as you don’t change the WAD file. Instead, in our case we can save the results in the internal Flash (our programming algorithm will check if programming is actually needed or if the same data is already present), saving a lot of kB.
- Similarly, some level-dependent data are copied to flash, see later, for more discussion about this.
- The 160×128 pixel display would require a 20kB buffer @ 8 bpp. We cut 5kB, by considering that the 3D view takes only the first 96 pixels. Therefore, we first calculate and render the 3D scene (only 160×96 pixel), we send the result to the display, and then we draw the status bar, sending the remaining 160×32 pixels. This saves 5 precious kB of RAM, without affecting the performances. To be honest, we also tried to use a 5kB frame buffer (i.e. saving 15kB), however it had two big drawbacks: the 3D scene had to be calculated 3 times (with some tricks it was not 3 time slower), and the tearing effect was very noticeable and annoying. By the way, the display has 16-bit native pixel format, so 8 to 16 bit conversion is required (using the palette). This is done on the fly, before sending each pixel, as the 80 MHz CM33 is very fast.
- doomhack’s GBA Doom code used 16 kB on the IWRAM to implement a software cache, to speed-up composite texture rendering. We cannot afford such huge cache, so we completely removed this feature. This, however has also a positive side effect. In fact, probably to increase hit/miss ratio, mip-mapping was used on composite textures, lowering the details even at close distance (as a minimum, columns were sampled at 2x). Since cache has been removed, there is no point of enabling mip-mapping, as this would not give any performance boosting. Alternatively, disabling mip-mapping (i.e. increasing detail) will not affect the performance.
Porting to the EFR32MG21
When we achieved a good level of memory optimization, we underclocked our Cortex M7 to match the Cortex M33 performance. We also disabled the data cache, highlighting a very big issue: the speed is extremely dependent on WAD data access, rather than actual CPU speed.
As mentioned, we also implemented some wrapper functions, to retrieve WAD data using SPI Flash read commands, rather than using the memory mapped mode. This was required because the EFR32MG21 does not have memory mapped readout mode of SPI flash memories. This problem is exacerbated by other two factors:
- Maximum SPI clock speed is (theoretically) 20MHz
- The EFR32MG21 has no QSPI interface.
This means that we could access our SPI flash with only 2.5 MB/s maximum throughput, and we found that this might lead to a single digit frame rate.
Shall this stop us? Of course not!
1. Overcoming SPI Speed Limitation
The SPI clock speed is limited to 20MHz because the peripheral bus speed, according to the EFR32MG21 datasheet is limited to 50MHz (and since it is tied to the CPU clock by an integer division factor, this means 40MHz). However, we found this figure to be very conservative, and we found no issues by overclocking it to 80MHz, at least at room temperature.
This already give us a 2x in the bandwidth: 5MB/s! Can we do better? Yes!
The EFR32MG21 has three USARTs. One o tem is used for printing debug messages and downloading the WAD file to the external flash. We have two USART left… and the peripheral reflex system! With these two, we can synchronize the two USARTs (working in SPI mode) so that they read data at the same time, actually creating a double SPI (DSPI). Luckily, the QSPI memory also supports dual read, and in this way, we can reach 10MB/s!
Noticeably, the two SPIs need to be kept synchronized, so we need to read data with precise timings. We tried several strategies, but most of them failed to get a reliable readout. For instance, we verified that the LDMA of EFR32MG21 cannot sustain the required readout speed when we overclock the bus and get a 40MHz SPI (instead, with 40 MHz bus clock, i.e. 20 MHz SPI, it works fine as per specification). The same issue occurs if you try to poll the USART status register, even using optimized assembly code.
The only approach we found it works reliably required a little trick.
The issue is that every time we try to read any USART register (not only data, but also status register), some wait states are introduced, probably due to CPU to peripheral bus clock synchronization. This is very common to many microcontrollers. Noticeably, from our measurements, we found that synchronization is introduced even if the CPU and the peripheral have the same clock speed.
Therefore, if we poll the USART status or interrupt flag register to see if some data has been received, we stall the CPU for several cycles, leaving us too little time to read new data from both USARTs (which, in turn will stall for some other cycles), before new data becomes available. This ends in desynchronization, because the first USART will start sending new data, while the second one is halted, as its RX FIFO is still full.
The trick that helped us to achieve reliable synchronizations is using the WFI instruction.
When we want to read data from the SPI, we mask (disable) interrupts[2], but we still enable interrupt signal generation when RX data becomes available on one USART (we just need to check only one of them, as they are synchronized).
When RX data is available, the corresponding interrupt will be in the pending state, but its handler will not be called. We can exploit this, to achieve precise synchronization, by using the WFI instruction (wait for interrupt). Once this instruction is executed, the CPU will sleep, until an interrupt signal is generated. Therefore, after executing the WFI instruction, when data becomes available the CPU will quickly wake up, so we can read data from both USART RX FIFOs and manually clear the CM33 interrupt pending status register. With this procedure, we do not need to access the peripheral bus just to assess if data is available, but only when we actually read the received data.
Allowing to know when we need to read data without actually accessing the peripheral status register was the crucial point here.
Noticeably, with this emulated dual SPI approach, the minimum readout granularity is 2-bytes, and we had to interleave data when writing, and addresses when reading.
Interleaving data is not an issue, as it is done only when we download the WAD file and during save games. Interleaving addresses is not a big issue as well, as the Cortex M33 can efficiently handle the interleaving algorithm, without wasting too many cycles. Furthermore, addresses are set quite rarely, if compared to data readout. In fact, address is automatically incremented in the SPI flash, so reading a contiguous block of data requires setting the address only for the first access.
2. Caching to Flash
As mentioned above, we decided to put some constant data to flash. These data might be WAD or level dependent.
WAD-dependent data will remain constant as long as the WAD file is not changed, so they are stored in a region we called immutable flash region. Despite these data are always recalculated every time the microcontroller is reset, programming occurs only if the current stored value is different. This preserves flash lifetime, and saves boot up time too.
Level-dependent data are for instance constant map data, floor/ceilings and wall textures. Some map data are constant, so they can be stored in flash, instead of RAM, saving some kB. Graphics data are cached in flash to increase speed. Each level has its sets of textures, while the amount of internal flash is limited. The floor/ceiling (flats) algorithm would suffer too much if we had to fetch data from external flash, so external flash data fetch for flats has not been implemented. Therefore, we first store all the flats, and then we leave the remaining flash space to cache as many wall textures used by the level as possible. This procedure is done every time a level is started, however, if the same level is to be restarted, programming does not occur.
This means that changing level frequently will impact on flash reliability. In fact, flash memory has a limited number of program-erase cycles (10k), after which the manufacturer does not guarantee a 10-year retention. However, some considerations are worth to be drawn:
- Code region is almost never rewritten (only when you flash the code, for an upgrade).
- Immutable flash region is erased-reprogrammed only when a new wad is uploaded. It is very unlikely that you will upload 10000 times a new WAD (it takes ages!).
- Level dependent data means that you need to change 10000 times, before going out of spec. If you are a hardcore runner, you might finish each level with an average time below 1 minute (some speedrunners finished some levels below 20 seconds, however you must then add the time to load next map!). 10000 minutes means 166 hours non-stop.
- Exceeding 10000 program-erase cycles, it does not mean that the flash suddenly stops working. It means that the manufacturer does not guarantee 10-year data retention. However, level flash data is refreshed every time you change level. It is very unlikely that you will keep playing the same level for 10 years!
4. Audio and Support to All the Shareware Levels
We were proud and impressed when we managed to run Doom on the xMG21, not just the first level, but all the shareware levels. With this we already won our challenge, but there were still 2 issues we wanted to fix:
- Audio was missing. We can live without Doom’s music, but what is a game without sound effects?
- E1M6 ran fine in the lowest skill level, but in “ultra violence” mode ran out of memory after few minutes.
Problem 2 was also exacerbated when we introduced audio, as it takes 2kB of RAM.
In fact, we cannot use interrupts because we use the WFI instruction to keep the two SPI synchronized. Luckily, we can use DMA, so we just need to load data in one buffer, and the DMA will periodically transfer data to the compare register of one timer working in PWM mode. The buffer can be refreshed only once per frame, and the lowest frame rate we want to support without audio glitches determines the minimum number of samples the buffer must contain. Vice-versa, the size of the buffer determines the lowest frame rate below which we will start hearing audio glitches.
With a 1024-sample buffer, we need to have at least a frame rate of 10.8 fps. This frame rate is easily achievable, so 1024 was the chosen buffer length. However, since we are using more than one channel, samples must be at 16 bit (even if at the end the PWM outputs with less than 16 bits), therefore the buffer size is 2048.
The second issue was solved by further improving RAM usage, saving few more kB. Still, we are very close to the limit in E1M6, with ultra violence mode, even though we did not get any out of memory issue.
We are planning to add some other features that allow to recover some memory, when needed (e.g. by removing enemy corpses if we ran out of memories).
The Hardware
Let’s step back from software, and focus on the hardware. The block diagram and circuit level schematics of the prototype are shown below.
In the Ikea lamp, the module MGM210L is soldered on a board, which already carries the 30V to 5V (and 3.3V) voltage regulator. That module could work with a much lower voltage (5V too!), but for this you need to remove resistor R25 (see below). Its function is to turn off the DC-DC buck converter, when the input voltage is too low for the lamp to operate.
The I/O pins of the module are connected to the display, the external flash, a coupling capacitor for audio (adding a low pass filter will improve audio quality), and a widespread parallel input serial output shift register (74HC165). The latter is used to allow up to 8 keys. If you want more keys, you can cascade two of these ICs.
Noticeably, the load signal of the 74HC165 is connected to the D-C pin of the display. When the display is not selected, its D-C signal can change state without affecting display content, and we exploit this to load key states.
A capacitor blocks the DC component of the PWM output. This is fed directly to the amplified speakers, without low pass filter (LPF), for two main reasons:
- There was no enough space in the prototyping board.
- We rely on the input low pass filter (and limited bandwidth) of the amplified speakers. Current PWM frequency is very high.
When we will route the PCB, we will put also the LPF, as most of the components, including the big shift register IC, will be SMD.
Finally, you might wonder why we put a diode: this is because if you fit all the circuitry inside the lamp, you have 50% of chances (and for Murphy’s law, even more) of feeding the supply with the wrong polarity.
How To Build
WARNING: THE FINAL DEVICE RUNS ONLY AT A 30V MAXIMUM INPUT DC VOLTAGE. DO NOT PLUG THE MODIFIED LAMP IN AN AC MAINS POWERED SOCKET, AS IT WILL INSTANTLY BLOW OFF. FURTHERMORE DO NOT USE THE ORIGINAL HIGH VOLTAGE AC TO DC CONVERTER, IT FEATURES NO ISOLATION!
First of all, you need an IKEA Tradfri GU10 RGB lamp. Actually, you do not need necessarily to buy that lamp, because what you really need is the MGM210L module (or an EFR32MG21AxxxF1024, if you feel brave). You can buy a MGM210L module from Mouser, Farnell, Digikey, etc. If you decide to but it, you need to find a way to provide 5V and 3.3V, required by the display module, and RF + shift register, respectively.
You can find here step by step the disassembly of the lamp. After you got the lamp, you can use a cutter, to pop out the plastic top part of the lamp. You might also need to cut the glue which secures it in place. After that, remove the two little screws, and with a small plier remove the LED PCBs.
You can then remove the metal heat spreader, to reveal the high voltage AC to DC converter. Using a plier you can pull it (the board is connected to the GU10 contacts using a 2-terminal header) out of the lamp. Here are all the components of our disassembled lamp.
Note: avoid dropping accidentally on the floor like we did, the lamp outer case is made of glass…
The high voltage AC-DC board also carries the low voltage DCDC and RF board. You can separate them easily using a soldering iron.
The 2-contact connector of the AC-DC board can be reused if you want to fit everything in the lamp case.
Now, as we mentioned previously, you should remove R25, so that the DCDC module will work with input voltages as low as 5V.
Then, we need to get the supply and I-O lines out of that PCB. For this purpose, we used a prototyping board that we shaped to accept the DC-DC board with the RF module.
After the DC-DC board is secured with some tape, solder some small wires from the DC-DC contacts and the module itself, like we did in the board.
The wires are attached to three headers, which allows to plug the module to another prototyping board, which will contain everything else for this project.
We admit, the second board is a mess of wires, and components, so we strongly suggest you route your PCB (or wait until we finalize the PCB design and upload it the github repository!). As you might see, there are also some SMD components soldered on the solder side. On the component side, we placed the 74HC165 and the 8 MB flash IC. (We soldered wires to connect it!)
For the audio part, we did not have enough space to put a JACK connector, therefore we just provided a 2-pin header. We also suggest to put a low pass filter after, to improve audio quality.
The keyboard PCB is very simple, just notice that it contains 330-Ohm series resistors.
The microcontroller module can now be fitted on the carrier board as shown below.
You must connect the keyboard and display now, to program and upload WAD file. Connect to the leftmost header connector an SWD programmer and a USB to TTL UART device (your SWD programmer might embed the UART functionality too).
Programming and WAD transfer
Follow these steps:
- The device is programmed using any JLink compatible SWD programmer. The project on the github repository can be compiled using Silicon Labs’s Simplicity Studio V5. NOTE: at the end of programming, you might get an error. Ignore it, it will work.
- Note: un the github repository, the already converted shareware DOOM1.wad (mg21DOOM1.wad) is present, so you can skip this step and go directly to the next step! If you have a different WAD than the shareware version, you need to convert it to a particular format, compatible with this port. For this purpose, use the mg21wadutil command-line utility (which is also a modified version of doomhack’s GBAwadutil).
To convert, open a dos prompt and type:mg21wadutil.exe <input file> <output file>
For instance:mg21wadutil.exe doom1.wad mg21doom1.wad
- The converted WAD must be sent to the internal flash via YMODEM protocol (XMODEM supported too). For this operation you need an USB to TTL UART converter, with RX and TX connected as shown before.
To upload the wad, power up the device, and keep pressed the buttons “use”, “change weapon” and “alt”, to initiate Ymodem reception. Use Teraterm and send the file via YMODEM protocol. The COM settings are the de facto standard 115200 bps, 8 bit, No parity, 1 stop bit.
Be aware that the upload process is very slow, and it will take more than 10 minutes for the shareware WAD. In the meantime go and have some coffee or tea… Furthermore, the first packet will trigger also external flash erase, therefore it might take some tens of seconds to be actually sent. Since each packet also require programming and CRC calculation, you can expect about 6-7 kB/s.
After the download has been complete reset the device, and you should see DOOM running!
Mounting
We soldered the connector that connected the high voltage board to the GU10 contacts is soldered to a third PCB, which allows for an easy connection.
Then fix the two modules to the lamp with some means… we used tape…
Then find a suitable lamp holder. PLEASE NOTE THAT IT MUST BE POWERED WITH A DC VOLTAGE NO HIGHER THAN 30V!
Now it is ready: an IKEA lamp running Doom!
Some Screenshots
Let’s show some work in action!
NOTE! In all the screenshots the frame rate is shown (10x, i.e. 300 means 30 fps) in the AMMO counter.
For those who are interested, all the development has been done using a Silicon Labs WSTK devboard, using a MGM210 radio module. Since pins are routed differently, in the code you might find that we used the WSTK define, to determine how pins should be configured. If you look closer, you’ll noticed we used different resistor values for key filters: 390 Ohm and 15kOhm, instead of 330 and 10kOhm, respectively. These values are of course not critical.
Performance
From a 80MHz Cortex M33 one would expect a constant 35 fps (max frame cap) on a display as small as 160×128 (actually rendered screen is 160×96, as 32 pixels are used for the status bar). In fact, such computing power is enough for a 320×200 display (assuming unlimited SPI bandwidth). The game is definitely playable, reaching more than 30 fps in many cases, and almost always is above 20 fps, even on complex areas. The major issue here is the slow is external flash access. In fact, even if we can reach a peak 10MB/s, we need to take into account that random access is much slower, and flash access it is done by software and not by hardware.
In particular, the game performance does not depend too much on the level complexity, but on the amount graphics data that need to be drawn but it is not present in the internal flash. This means sprites and uncached textures. For instance, when there are many sprites on screen, we experienced frame rates as low as 16 fps.
Comparison with GBA Port
On the GBA, Doom inevitably run slower, because the GBA features a much less powerful MCU: only a 16.7 MHz ARM7TDMI. An ARM7TDMI feature about 0.9 DMIPS/MHz, whereas a CM33 about 1.5 DMIPS/MHz. This means that our system is about 8 times faster. What is missing in our system is a large and fast memory mapped flash memory (and a bigger RAM).
In terms of display size, the GBA has a 240×160 resolution (actual 3D rendering size 240×128 due to the status bar), however it runs in low detail mode, i.e. the number of computed pixels is cut by half. At the end, the comparison is between 120×128 pixels (GBA) and 160 x 96 (our display): exactly the same number of 3D rendered pixels.
The GBA port is also missing the Z-depth lighting feature (sprites, walls, floor and ceilings getting darker with distance). This was probably due to save some RAM. In fact, the original Doom used about 11kB of RAM for this feature. We implemented it using less than 3kB of flash and 8 bytes of RAM!
Composite textures were rendered on the GBA port using a mip-mapping like effect, to probably to increase software cache hits. We do not have RAM for software cache (on the GBA port it occupied 16 kB of IWRAM), therefore we restored full detail rendering, with no penalties (note! This has been upgraded after creating the video and taking the pictures, so in the pictures you’ll still see low details on composite textures!).
The GBA port does not support well demos: we did not bother to fix this issue, because demo playback also automatically switch levels, reducing flash lifetime.
However, the GBA port supports larger WAD files and levels. We did not try other wads, beside the shareware, as we had only an 8MB Flash. We will try in the future with a 16 MB flash IC.
Download Full Project
The full, project, including the wad converter, the converted shareware DOOM1.wad and the schematics can be found in the github repository by clicking on this link.
Video
Here is a small video, showing how our setup performs on level E1M6.
Lamp or Light Bulb?
Well, technically speaking, it is a lamp, so throughout the article and the video, that’s the term we used. But you can call it light bulb and no one will blame you for this…
Conclusion
Not only we succeed in our challenge, but also we completed all the optional points, i.e. being able to play the full shareware levels, we kept all the Doom’s engine features, we restored even Z-depth light, and we got audio too!
We think we have still some room of improvement, both in terms of performance and RAM usage. We need also to improve the status bar. As for now, it is just a zoomed-out version of the GBA port status bar, small text is not readable, and the face aspect ration is wrong. Furthermore, a nice thing might be using a round display… we shall see in the future!
As a last remark, this port can be used as base for any other microcontroller featuring a limited RAM amount.
[1] For more information about frame rate at various configuration, see this excellent list https://www.complang.tuwien.ac.at/misc/doombench.html
[2] Actually, we do not use interrupt handlers at all in this port, so we might keep interrupts disabled forever.