fpgacpu.org - Inner Loop Custom Datapaths

Inner Loop Custom Datapaths

Home

Supercomputers >>
<< Multis and fast unis

Usenet Postings
  By Subject
  By Date

FPGA CPUs
  Why FPGA CPUs?
  Homebuilt processors
  Altera, Xilinx Announce
  Soft cores
  Porting lcc
  32-bit RISC CPU
  Superscalar FPGA CPUs
  Java processors
  Forth processors
  Reimplementing Alto
  Transputers
  FPGA CPU Speeds
  Synthesized CPUs
  Register files
  Register files (2)
  Floating point
  Using block RAM
  Flex10K CPUs
  Flex10KE CPUs

Multiprocessors
  Multis and fast unis
  Inner loop datapaths
  Supercomputers

Systems-on-a-Chip
  SoC On-Chip Buses
  On-chip Memory
  VGA controller
  Small footprints

CNets
  CNets and Datapaths
  Generators vs. synthesis

FPGAs vs. Processors
  CPUs vs. FPGAs
  Emulating FPGAs
  FPGAs as coprocessors
  Regexps in FPGAs
  Life in an FPGA
  Maximum element

Miscellaneous
  Floorplanning
  Pushing on a rope
  Virtex speculation
  Rambus for FPGAs
  3-D rendering
  LFSR Design

Subject: Re: FPGA multiprocessors
Date: 07 Oct 1997 00:00:00 GMT
Newsgroups: comp.arch.fpga

Charles Sweeney <CharlesSweeney-@compuserve.com> wrote in article
<3438A7D6.2431@compuserve.com>...
> Jan Gray wrote:
> > Assuming careful floorplanning, it should be possible to place six
32-bit
> > processor tiles, or twelve 16-bit processor tiles, in a single 56x56
> > XC4085XL with space left over for interprocessor interconnect.  Also
the
> > number of processor tiles can be doubled if we eschew the I-cache and
> > simplify the microarchitecture -- though performance would greatly
suffer.
> 
> It's good to see you planning to take advantage of the parallelism
> offered by FPGAs, but why constrain your software to have to run in a
> particular microprocessor architecture? why not go further and compile
> your programs directly into the hardware of the FPGA, Handel-C does
> exactly that, please see our web site below.

Good question.

The trite answer is since designing processor ISAs and microarchitectures
for FPGA implementations is my research interest, that's my hammer
in search of nails.  FPGA multiprocessors are now possible -- but it
remains to be seen if they are actually useful!

The other answer is that I don't preclude a modest custom
datapath per processor (and such datapaths could be designed
from source code by tools such as Handel-C).  So I think an FPGA
multiprocessor is the preferred solution for problems which:
1. are amenable to n-way "outer loop" parallelism and
2. involve too much irregular computation for custom datapath only and
3. involve enough inner loop regular computation that an FPGA
custom datapath is faster/cheaper than a general purpose processor
or multiprocessor built of same.
(Whether such problems exist and are important remains to be seen.)

As for your question "why not go further and compile your
programs directly into the hardware of the FPGA?" :-

There will always be very regular signal processing applications,
regular in computation, regular in operand fetch and result store,
and relatively simple in the computation kernel, for which a custom
datapath compiled to an FPGA is a good solution.

But there are also other computations which are either
too irregular or too large to practically implement in an FPGA
datapath, even in a time-multiplexed (reconfiguration) manner.

The "outer loops" and "outer function calls" of these
computations are best done in a general purpose processor,
even as you move the inner loop(s) to a custom datapath.
Indeed, the inner loops may constitute only a few percent
of the total text of the source code of the computation.

To help these large "dusty deck" applications take advantage
of custom datapaths, it must be extremely convenient to
interface the custom stuff to the general purpose processor.

For some problems where even the irregular computation
is a critical path, especially those involving floating-point,
it probably makes sense to choose a fast, cheap
commercial off-the-shelf microprocessor.
Of course there are penalties here.  Cost of processor.
Less integration.  Board real-estate costs.  "Representation
domain crossing" costs.  Relatively slow communication
between processor and FPGA.  Cost of FPGA resources
spent interfacing to processor.

But for problems where the irregular computation is
not the critical path, the now modest overhead (10-20%)
of an embedded general purpose CPU enables an
interesting integrated "system on chip" hybrid:
embedded processor, on-chip bus, on-chip custom
datapaths and peripherals.

In theory, you could compile your dusty deck C, C++,
Java, FORTRAN, Scheme, etc. and run it immediately
on your FPGA CPU.  Then automatically (profile driven)
or through explicit directives, you can compile the inner
loops to a custom datapath.  This can either be manifest
as an on-chip command oriented coprocessor, or in some
cases as new instructions.  The latter has the potential
advantage of very high custom operation issue rates
(today, 66 MHz) and access to processor register
file, etc.

Given this approach, even if your dusty deck app stores
its data in such advanced data structures (sarcasm)
as a linked list (/sarcasm), it can still potentially take
advantage of a custom datapath.  This is much less
feasible if your registers or operands(s) are microseconds
away on the non-embedded host processor.

For example, the unused logic in
    //www3.sympatico.ca/jsgray/sld021.htm
was reserved for the Gouraud rendering instructions described 
in the last paragraph in:
    //www3.sympatico.ca/jsgray/render.txt

Of course, embedded processor in programmable logic is just
one point on the CPU/custom datapath spectrum.  See also
the BRASS research
  //http.cs.berkeley.edu/Research/Projects/brass
and my old essay on FPGA PC coprocessors
  //www3.sympatico.ca/jsgray/coproc.txt

Jan Gray