|
3-D Rendering Acceleration |
|||
|
Home LFSR Design >> << Rambus for FPGAs
Usenet Postings |
Newsgroups: comp.graphics.algorithms,comp.arch.fpga
Subject: Re: FPGA accelerated engines for volume rendering
Date: 22 Mar 1995 05:47:46 GMT
In <BENEDETT.95Mar20152030@caliban.dsi.unimo.it>
benedett-@caliban.dsi.unimo.it (Arrigo Benedetti) writes:
>I'm looking for references to implementations of hardware accelerators for
volume
>rendering algorithms (or other computationally intensive graphics
algorithm)
>based on FPGA's.
I suspect this is not the volume rendering you mean, but maybe you'll find
it interesting anyway, a kind of software/hardware practice and
experience, if you will.
A while back, I did a design for a Gouraud shaded Z-buffered rendering
accelerator, whose datapath is compiled into a Xilinx XC4003A. Sure, it's
probably the most well understood graphics rendering problem, and my
implementation is simple at best (e.g. no blending, no textures), but I
wanted to see how far one could get, at home, on a hobbyist scale.
The inner loop (one scan line) of this simple polygon rendering algorithm
is:
// interpolate left to right, in (r,g,b) and z, and update
// pixels for which z is closer than zbuf[x]:
... set up fixed point z, dz, r, dr, g, dg, b, db ...
for (x = xleft; x < xright; x++) {
if (z < zbuf[x]) { // Z-buffer check
zbuf[x] = z; // update Z-buffer
buf[x] = pixel(r,g,b); // update image
}
// advance interpolants
z += dz; r += dr; g += dg; b += db;
}
When attached to 32-bits of DRAM or VRAM, and assuming a 16-bit Z-buffer,
this design required three passes, fast page mode streaming over memory, to
render a span of pixels across one scan line of a polygon. That is, I
implement the above as three passes :-
bit closer[];
// Pass 1: (check two Z-values per iteration)
// initialize z0, z1, dz0, dz1
for (x = xleft; x < xright; x += 2) {
closer[x] = (z0 < zbuf[x]);
closer[x+1] = (z1 < zbuf[x+1]);
z0 += dz0; z1 += dz1;
}
// Pass 2: (update up to two Z-values per iteration)
// reinitialize z0, z1, dz0, dz1
for (x = xleft; x < xright; x += 2) {
if (closer[x]) zbuf[x] = z0;
if (closer[x+1]) zbuf[x+1] = z1;
z0 += dz0; z1 += dz1;
}
// Pass 3: (update zero or one pixel value per iteration)
// initialize r, g, b, dr, dg, db
for (x = xleft; x < xright; x++) {
if (closer[x]) buf[x] = pixel(r,g,b);
r += dr; g += dg; b += db;
}
.. in hardware, in each case doing one loop iteration per clock (50 ns
clock).
((I separated passes 1 and 2 because I thought it would be easier to do
separate read and write passes on the Z-buffer memory, pipelined, rather
than one pass with lots of back to back read/modify/write traffic.))
Amortized cost: 100 ns/pixel, several times faster than an R4000 software
approach, even assuming packing several 8.8 bit fixed point interpolants
per 64-bit register.
Besides address sequencing and DRAM/VRAM control, the hardware to do the
above is only two 24-bit accumulators (for the 16.8 bit fixed point
interpolations of z0 and z1, and reused for 'r' and 'g' interpolation), one
16-bit accumulator (for the 8.8 bit fixed point interpolation of 'b'), and
two 16-bit magnitude comparators (for comparing zbuf[i] and zbuf[i+1] with
z0 and z1), plus a 64-by-2 bit SRAM to buffer closer/farther values (wider
polygons would be divided into abutting narrow ones). All of which fits
nicely in a "3000-gate" XC4003A.
((An "accumulator" in Xilinx-speak is an adder whose output is captured in
a register "sum", and whose inputs are sum and another register "delta", so
that "sum += delta" is formed each clock.))
I also considered using 16-bits/pixel (565 RGB) and adding error
distribution "dithering" to propagate the error at each pixel to later
pixels on the same line. This would require another adder at each
accumulator.
In my first couple of nights using ViewLogic, XBLOX, and XACT
1.4-something, I was able to design and compile the datapath of the above.
Unfortunately at that point I got stuck, trying to determine how to
interface an R3081 and then an R4000 to the FPGA, and so never did get the
darn rendering engine built. (The R4000 bus protocols are nontrivial,
especially when trying to interface to an FPGA with its own, nontrivial
input setup/hold times and output delays.) Now, when time permits, I am
designing a 32-bit RISC in the left half of an XC4010, and I hope to use
the right half for a rendering accelerator as described above. Here
"interpolate" (one iteration of one of the above passes) will be a machine
instruction.
Jan Gray
Copyright © 2000, Gray Research LLC. All rights reserved. |