|
Life in an FPGA |
|||
|
Home Maximum element >> << Regexps in FPGAs
Usenet Postings |
Subject: Re: [++] Fast Life code (Was:Re: FPGA-based CPUs
(was Re: Minimal ALU instruction set))
Date: 25 May 1998 00:00:00 GMT
Newsgroups: comp.arch,comp.arch.fpga,comp.arch.embedded
Tim Olson wrote
>This algorithm looks like the one described in the Smalltalk "blue book",
>where a version of Life was implemented using BitBlt operations to
>implement the cell counting in parallel.
Another reference to the bitwise parallel approach is "Life Algorithms",
Mark Niemiec, Byte, Jan. 1979, pp 70-79. If I recall correctly, Mark, David
Buckingham, and friends, used Waterloo's Honeywell 66/60's EIS "move
mountain" instructions to animate 64K 36-bit words per iteration.
Inspired by Buckingham and the Blue Book, I wrote a bitblt version that did
800,000 cells in 34? bitblts on a Perq in 1983? and one that did 400,000
cells/s on an 8 MHz (1 M 2-operand 32-bit op/s) 68000 in 1985.
As Messrs. Mathisen and Mogensen describe, Life should run very fast on
modern processors (superscalar and multimedia enhanced and large caches).
64-bits, in 40 insns, in perhaps 15-20 clocks, at 3 ns/clock, e.g. 1 bit/ns.
FPGA Implementation: It is straightforward to run at full memory bandwidth.
For example, given an XCS20 and a 32Kx32 PBSRAM (32-bits in or out per 15 ns
clock) we can approach 32 bits/(2*15) ns, e.g. 1 bit/ns.
Since a given line is read three times (as "below", "current", and "above"),
we buffer 2 lines of cells in RAM in the FPGA. A 1024 x n playfield
requires 2 x 1024 bits = 64 CLBs of single port RAM, and preferably 3 x 1024
bits for 3 banks since each clock you must read from up to two lines and
write to a third.
Detailed design/floor plan. One bit requires approx. 9 CLBs. Assuming the
cell neighbours are (a,b,c,d,e,f,g,h), we need :-
3 CLBs RAM -- 3 32x1 RAMs (3 banks of line buffer)
6 CLBs logic --
1 s0a=a^b^c^d; s0e=e^f^g^h
2 s1a="a+b+c+d == 2 or 3"; s1e="e+f+g+h == 2 or 3"
3 s2a=a&b&c&d; s2e=e&f&g&h
4,5 (s3,s2,s1,s0)=(s2a,s1a,s0a) + (s2e,s1e,s0e) (uses dedicated carry
logic)
6 new = ~s3&~s2&s1&(s0|old)
and so in a 20x20 CLB XCS20, we explicitly place 16 rows of 1x9 CLB tiles in
the left half, another 16 in the right half, leaving plenty of room to spare
for control and address generation.
At the 1997 FPGAs for Custom Computing Machines conference., the paper "The
RAW Benchmark Suite" by Babb et al proposed a set of benchmarks for
comparing reconfigurable computing systems. One of the 12 benchmarks was
Life, for which they reported speedups of several hundred times over a
SparcStation20+software approach, but in fairness, they write "we are not
currently implementing known improvements to the software to take advantage
of the bit-level parallelism available in a microprocessor".
Summary. Hypotheticially...
Fast microprocessor + cache: ~1 bit/ns
Single FPGA + SRAM custom machine: ~1 bit/ns
Jan Gray
Copyright © 2000, Gray Research LLC. All rights reserved. |