|
Inner Loop Custom Datapaths |
|||
|
Home Supercomputers >> << Multis and fast unis
Usenet Postings |
Subject: Re: FPGA multiprocessors
Date: 07 Oct 1997 00:00:00 GMT
Newsgroups: comp.arch.fpga
Charles Sweeney <CharlesSweeney-@compuserve.com> wrote in article
<3438A7D6.2431@compuserve.com>...
> Jan Gray wrote:
> > Assuming careful floorplanning, it should be possible to place six
32-bit
> > processor tiles, or twelve 16-bit processor tiles, in a single 56x56
> > XC4085XL with space left over for interprocessor interconnect. Also
the
> > number of processor tiles can be doubled if we eschew the I-cache and
> > simplify the microarchitecture -- though performance would greatly
suffer.
>
> It's good to see you planning to take advantage of the parallelism
> offered by FPGAs, but why constrain your software to have to run in a
> particular microprocessor architecture? why not go further and compile
> your programs directly into the hardware of the FPGA, Handel-C does
> exactly that, please see our web site below.
Good question.
The trite answer is since designing processor ISAs and microarchitectures
for FPGA implementations is my research interest, that's my hammer
in search of nails. FPGA multiprocessors are now possible -- but it
remains to be seen if they are actually useful!
The other answer is that I don't preclude a modest custom
datapath per processor (and such datapaths could be designed
from source code by tools such as Handel-C). So I think an FPGA
multiprocessor is the preferred solution for problems which:
1. are amenable to n-way "outer loop" parallelism and
2. involve too much irregular computation for custom datapath only and
3. involve enough inner loop regular computation that an FPGA
custom datapath is faster/cheaper than a general purpose processor
or multiprocessor built of same.
(Whether such problems exist and are important remains to be seen.)
As for your question "why not go further and compile your
programs directly into the hardware of the FPGA?" :-
There will always be very regular signal processing applications,
regular in computation, regular in operand fetch and result store,
and relatively simple in the computation kernel, for which a custom
datapath compiled to an FPGA is a good solution.
But there are also other computations which are either
too irregular or too large to practically implement in an FPGA
datapath, even in a time-multiplexed (reconfiguration) manner.
The "outer loops" and "outer function calls" of these
computations are best done in a general purpose processor,
even as you move the inner loop(s) to a custom datapath.
Indeed, the inner loops may constitute only a few percent
of the total text of the source code of the computation.
To help these large "dusty deck" applications take advantage
of custom datapaths, it must be extremely convenient to
interface the custom stuff to the general purpose processor.
For some problems where even the irregular computation
is a critical path, especially those involving floating-point,
it probably makes sense to choose a fast, cheap
commercial off-the-shelf microprocessor.
Of course there are penalties here. Cost of processor.
Less integration. Board real-estate costs. "Representation
domain crossing" costs. Relatively slow communication
between processor and FPGA. Cost of FPGA resources
spent interfacing to processor.
But for problems where the irregular computation is
not the critical path, the now modest overhead (10-20%)
of an embedded general purpose CPU enables an
interesting integrated "system on chip" hybrid:
embedded processor, on-chip bus, on-chip custom
datapaths and peripherals.
In theory, you could compile your dusty deck C, C++,
Java, FORTRAN, Scheme, etc. and run it immediately
on your FPGA CPU. Then automatically (profile driven)
or through explicit directives, you can compile the inner
loops to a custom datapath. This can either be manifest
as an on-chip command oriented coprocessor, or in some
cases as new instructions. The latter has the potential
advantage of very high custom operation issue rates
(today, 66 MHz) and access to processor register
file, etc.
Given this approach, even if your dusty deck app stores
its data in such advanced data structures (sarcasm)
as a linked list (/sarcasm), it can still potentially take
advantage of a custom datapath. This is much less
feasible if your registers or operands(s) are microseconds
away on the non-embedded host processor.
For example, the unused logic in
//www3.sympatico.ca/jsgray/sld021.htm
was reserved for the Gouraud rendering instructions described
in the last paragraph in:
//www3.sympatico.ca/jsgray/render.txt
Of course, embedded processor in programmable logic is just
one point on the CPU/custom datapath spectrum. See also
the BRASS research
//http.cs.berkeley.edu/Research/Projects/brass
and my old essay on FPGA PC coprocessors
//www3.sympatico.ca/jsgray/coproc.txt
Jan Gray
Copyright © 2000, Gray Research LLC. All rights reserved. |