8-bit Tile-Based GPU Emulator
Under construction. Just getting a multithreaded canvas going atm. Have most of the background circuit emulated, but not connected to the canvas yet.
Under construction. Just getting a multithreaded canvas going atm. Have most of the background circuit emulated, but not connected to the canvas yet.
This post continues the description of designing a tile-based graphics circuit. In the last post I described the design, function, and timing of a VGA-like display. The simulated display is 128x128 pixels and supports only 1 bit color. Although limited, this display is sufficient to design and debug the GPU.
The key challenge in designing the tile-based GPU is to stream pixels in step with the display draw timing. Instead of disecting the complete GPU from the top down, which is quite complex, I will describe how I designed it from the bottom up. My primary goal was to design the digital logic necessary for a simple teletype terminal, and that is where I will begin.
At the outset, it should be noted that tile-based graphics are fundamentally a type of memory compression strategy. Rendering a full display requires (resolution) x (color depth) bits of data. Conceptually, a frame buffer of that size can be used to store that information and stream it to the display, as is done on modern graphics hardware. However, memory was extremely precious when tile-based graphics were first developed, and frame buffers were not a feasible option. By partitioning a display into a regular grid of reusable tiles, the memory needed to encode a full display can be reduced at the cost of some flexibility in what can be displayed. I will call this type of full-display tile-based encoding a background.
The tile-based GPU is composed of 4 major circuits; a general timing circuit, a background circuit, an object circuit, and a compositing circuit. The general timing circuit orchestrates GPU timing. The background circuit is used to render full display images. The object circuit is used to render individual tiles. Finally, the compositing circuit incorporates output from the background and object circuits and generates the resulting pixel data.
The gist of a background circuit is to introduce a level of indirection in the rendering process. A background is composed of 3 concepts: a grid, tiles, and color palettes. The grid is an array of elements that encodes
Tiles are small rectangular arrays that encode the "shape" of graphics. As an added layer of indirection, rather than encoding a color, a tile encodes a color index. The color index is used to look up the appropriate color from a color palette. Thus, a grid element encodes (at minimum) a tile index and a palette index.
The GPU uses 8x8 pixel tiles. Given the operation of the display, only a single row of a tile needs to be drawn at a time. That gives a budget of 8 pixel clock ticks to access the necessary data for a tile before it's time to start on the next tile. As long as all memory accesses for a single tile take 8 or fewer pixel clock ticks, it's ok if the process to draw a tile takes more than 8 ticks to complete. This is because the operation of the GPU is pipelined. The latency to draw a tile is the total number of ticks from the start of the process to when the first pixels of the tile are ready to draw.
In order to draw a tile, 3 different pieces of information need to be collected; the grid, tile, and color data. The grid and tile data take up a substantial amount of memory, and therefore are stored in external VRAM (video RAM). The palette data is much smaller, and is stored in internal CRAM (color RAM). This arrangement allows simultaneous access to VRAM and CRAM. Grid data is composed of 2 bytes, requiring 2 VRAM accesses. A row of tile data is composed of 1 or 2 bytes depending on bit-depth, requiring 1 or 2 VRAM accesses. This means it only takes a maximium of 4 ticks to access all of the data needed for 1 tile. With 4 ticks remaining in the budget, a second background can be implemented, which will be covered in the next post. As a side note, retro video game consoles often had different graphics modes which allocated the pixel clock budget in different ways to realize different numbers of backgrounds with different tile bit-depths and other capabilities (e.g. per-tile scrolling). The famous SNES mode 7 was just one of 8 different modes the console could operate in, albeit a highly specialized one that worked differently than the rest.
So, now that the pixel clock ticks have been budgeted, how does the GPU decide what grid element, tile, and color to access? The most basic background functionality Given the display is 128x128 pixels and a tile is 8x8 pixels, a grid must be 16x16 tiles. The GPU tracks what the current hcount and vcount are. Using the hcount and vcount, the needed grid element can be determined. Once the grid element is in hand, the GPU can use the tile index to access the needed tile, and the vcount to access the correct tile row. The grid element also encodes what palette to use when drawing the tile. Combining the palette index with the color index, the corresponding color is then accessed from CRAM. This color is then output to the display.
The tile information is loaded into 2 shift registers. With each pixel clock tick, the shift registers shift out the next bits of the tile color index.
since hcount changes ever pixel tick, and it takes a few ticks to load all the data to render a tile, the counter has to be latched in bgx and bgy equations.... grid_base_address + (hcount / 8) + 16 * (vcount / 8) grid element encodes tile_index and palette_index grid element := 0bTTTTTTTT 0b----PPPP 1-bit color index: tile encoding: 0bCCCCCCCC 0bCCCCCCCC 0bCCCCCCCC 0bCCCCCCCC 0bCCCCCCCC 0bCCCCCCCC 0bCCCCCCCC 0bCCCCCCCC VRAM address = tile_base_address + tile_index * 8 + (hcount % 8) CRAM address = palette_index * 2 + color_index 2-bit color index: tile encoding: first plane 0bCCCCCCCC 0bCCCCCCCC 0bCCCCCCCC 0bCCCCCCCC 0bCCCCCCCC 0bCCCCCCCC 0bCCCCCCCC 0bCCCCCCCC second plane 0bCCCCCCCC 0bCCCCCCCC 0bCCCCCCCC 0bCCCCCCCC 0bCCCCCCCC 0bCCCCCCCC 0bCCCCCCCC 0bCCCCCCCC VRAM address = tile_base_address + tile_index * 16 + (hcount % 8) * 2 CRAM address = palette_index * 4 + color_index color encoding := 0bBBGGGRRR, which is decoded as 0bRRR, 0bGGG, 0bBB0 could have encoded each bit-plane one after the other (actually, i may have), which allows the same data to be used in 1 or 2 bit mode
This is the beginning of a series of blog posts detailing the design of a tile-based graphics circuit, pictured below:
My goal for designing this circuit was to work out how tile-based graphics hardware functions. The design itself is entirely original, but the capabilities are modeled after various 2D video game consoles I grew up playing, namely the SNES. That was not my original intention! At the outset I simply wanted to work out the digital logic needed to realize a teletype terminal's display. However, the resulting circuit wasn't very complicated, which encouraged me to add features. Things spiraled until I had a design somewhere in capability between an NES and SNES, being closer to the latter. With a design now in hand, I'd like to realize it in software, FPGA, and hardware. The journey so far has been very interesting, and I'd like share the story with others. I have no idea where else you can find this kind of information!
Before continuing, I should mention that the circuit was designed using Logisim.
| Parameter | Value |
|---|---|
| Display Resolution | 128x128 or 128x112 pixels |
| Color output | 9 bit RGB |
| Backgrounds | 2 |
| Background Grid Size | 16x16 units |
| Background Array Size | 1x1, 1x2, 2x1, or 2x2 grids |
| Background Capabilities | HV scroll, HV flip, 1-bit priority |
| Objects | 64 |
| Object Size | 1x1 units |
| Objects Units Per Scanline | 16 |
| Objects Tiles Per Scanline | 24 |
| Object Capabilities | HV scroll, HV flip, 2-bit priority |
| Tile Resolution | 8x8 pixels |
| Tile Colors | 2 or 4 colors |
| Tile Palettes | 16 |
| Units | 1x1 or 2x2 tiles |
| Compositing Capabilities | windowing, priority, blending, dithering, fade |
| Object RAM (ORAM) | 256 bytes (internal) |
| Color RAM (CRAM) | 64 bytes (internal) |
| Video RAM (VRAM) | 8192 bytes (external) |
When designing a tile-based GPU, the very first thing one must decided upon is what display technology to target. To the uninitiated, it is not obvious that the display's signal timing largely informs the design of a tile-based GPU. Old video game consoles targeted CRTs, which aren't widely available anymore. However, many of the same concepts are applicable to VGA, which some computer monitors still support, either natively or through the DVI port. Therefore, I decided to target VGA-like displays.
Simulating a full VGA display in Logisim would be computationally expensive, so I decided to implement a caricature of a VGA display. This caricature retains all of the features of a VGA display, just with non-standard timing.
The display I designed is the most Logisim-oriented component of this work, and takes some explanation. Logisim provides an LED matrix component that can be used to realize a 1-bit display (which is better than nothing!). An LED matrix can be up to 32x32 pixels. Each row and column in an LED matrix has a corresponding 1-bit input. For compactness, Logisim bundles all row and column inputs together into a row and column bus. In this arrangement, the first bit of the row and column bus corresponds to the bottom row and left column, respectively. To turn an LED matrix pixel on, the corresponding row and column input must BOTH be driven high at the same time. This causes the LED to illuminate for a component-specified number of ticks, which allows an image to persist after it's been drawn.
An LED display with arbitrary resolution can be realized by tiling LED matrices. For this project, I decided to work with 8-bit numbers as much as possible, which can encode up to 256 numbers. I settled on a 128x128 pixel display (for reasons that will become apparent in a moment), which is composed of 4x4 LED matrices. I've hidden the bus lines behind the LED matrices, mouse over the image to see them.
The LED display is conveniently packaged into its own component.
With a fat LED matrix in hand, we need a circuit to drive it! This is where I take inspiration from VGA. A VGA display is driven by only 5 control lines; 3 analog lines for red, green and blue color information, and 2 digital lines called /hsync and /vsync. In addition, a VGA display contains an internal pixel clock. VGA displays draw pixels one at a time, and the pixel clock is incremented exactly once for each pixel drawn. Image data is streamed to a VGA display as a sequence of pixels organized into a sequence of lines. Pixels are drawn as they arrive from left to right, and lines are drawn from top to bottom. This puts the origin of the image in the top left corner of the display. VGA also includes horizontal and vertical blanking periods when no pixel data is drawn. These periods are referred to as hblank and vblank. In CRTs, hblank and vblank gives the electron gun time to reposition itself in preparation to draw the next line or frame, respectively. The synchronization signals /hsync and /vsync are asserted during their respective blanking periods to synchronize GPU and display operation. (As an aside, the slashes preceding /hsync and /vsync indicate that they are active when pulled low).
VGA specifies different signal timing for different resolution displays. There is no 128x128 pixel VGA display, so I invented my own display timing. Adhering fetishistically to 8-bits, my display timing is framed around 256x256 pixels. In my design, the horizontal position is labeled hcount and the vertical position vcount. Hblank occurs during hcount [0x80, 0xFF] and /hsync is expected to be asserted during hcount [0xC0, 0xDF]. Likewise, vblank occurs during vcount [0x80, 0xFF] and /vsync is expected to be asserted during vcount [0xC0, 0xC7]. Perhaps unexpectedly, the display only spends 1/4 of its time drawing anything.
Before looking at the implementation of the display circuit, a brief description of Logisim counters is needed. Counters are registers with the added ability to be incremented or decremented when clocked. Two control lines control the function of a counter:
| ctrl0 | ctrl1 | Function |
|---|---|---|
| 0 | 0 | disabled |
| 1 | 0 | load data |
| 0 | 1 | increment |
| 1 | 1 | decrement |
The GPU only uses incrementing counters. When a counter overflows it rolls over to 0 and outputs a carry. This allows for multiple counters to be chained together.
Continuing on, the heart of the display circuit is an internal pixel clock driving a pair of 8-bit pixel counters uncreatively named hcount and vcount. The counters track what pixel incoming pixel data will be drawn to on the next tick. The pixel clock increments hcount every tick, while vcount only increments when hcount overflows and outputs a carry (that is, at the end of a line). When vcount overflows the display cycle starts anew. In this way all 65536 pixels of the timing diagram are visited in order.
As mentioned before, the pixel clock is independent of the master clock used to run the GPU. Operation between the GPU and display is synchronized by the /hsync and /vsync control lines, mimicking VGA operation. When /hsync and /vsync are asserted, the counters are loaded with counts corresponding to the position after their respective synchronization periods. Upon releasing the syncs, the counters are loaded with a known count, and thus an external circuit (i.e. the GPU) can operate in lockstep with the display. An AND gate has been added before vcount to ensure incrementing doesn't cause an erroneous decrement during loading by /vsync (see the counter function table).
Pixel data is only accepted by the display when it is not blanking. In general, detecting the blanking periods requires testing the counters against prescribed ranges using comparators. However, by judicious choice of timing, detecting the blanking periods in this design is as easy as examining the high bit of the counters. In this way pixel data is gated by the 3-way AND in the middle of the circuit. Leveraging powers of 2 like this is a common theme in my design to simplify implementation.
The output of the circuit is particular to the operation of the Logisim LED matrix. Bits 0-4 of the counters are used to select which of the 32 pixels in a constituent LED matrix row or column is being addressed. These bits are fed into two decoders to implement a one-hot encoding of the pixel being addressed within a single LED matrix. A minor detail is that I've reversed the ordering of the rows so that the first bit corresponds to the top row of an LED matrix. Bits 5-6 of the counters are used to select which of the 4 LED matrices in a row or column is being addressed. These bits are fed into two demultiplexers that route the one-hot encodings to the correct LED matrix. With the correct pixel addressed, the actual output is modulated by using the gated pixel data to enable/disable the one-hot encoders. To prevent transient behavior from incorrectly illuminating pixels, registers have been placed before the demultiplexers to latch output data.
Like the LED matrix, the display circuit is packaged into its own component. The hblank and vblank outputs have been added for visual debugging, and serve no functional purpose.
Putting the LED matrix and display circuit together completes the display. A final detail is that the persistence time for the LED matrices has been set to just under the number of ticks needed to render a full frame. This allows a drawn image to persist until the next image is just about to overwrite it, similar to how CRT phosphors glow for some time after being stimulated.
While the display circuit is completely independent of the GPU, its timing provides the scaffolding upon which the GPU design turns. I'll begin describing the basic function of the GPU in the next post.
Topos is founded.