Jump to user comments
style. In April 1994 the Amulet group in the Computer Science
asynchronous circuit and the world's first implementation of a
Work was begun at the end of 1990 and the design despatched
for fabrication in February 1993. The primary intent was to
demonstrate that an asynchronous microprocessor can consume
less power than a synchronous design.
The design incorporates a number of concurrent units which
cooperate to give instruction level compatibility with the
existing synchronous part. These include an Address unit,
which autonomously generates instruction fetch requests and
Execution unit; a
Register file which supplies operands,
queues write destinations and handles data dependencies; an
Execution unit which includes a multiplier, a shifter and an
ALU with data-dependent delay; a Data interface which
performs byte extraction and alignment and includes an
to exchange data.
The design demonstrates that all the usual problems of
processor design can be solved in this asynchronous framework:
also demonstrates some unusual behaviour, for instance
(though the instructions which actually get executed are, of
course, deterministic). There are some unusual problems for
compare alternative code sequences is continuous rather than
also be taken into account.
The chip was designed using a mixture of custom
datapath and
compiled control logic elements, as was the synchronous ARM.
The fabrication technology is the same as that used for one
version of the synchronous part, reducing the number of
variables when comparing the two parts.
Two silicon implementations have been received and preliminary
measurements have been taken from these. The first is a 0.7um
process and has achieved about 28 kDhrystones running the
standard
benchmark program. The other is a 1 um
implementation and achieves about 20 kDhrystones. For the
faster of the parts this is equivalent to a synchronous
ARM6clocked at around 20MHz; in the case of AMULET1 it is likely
that this speed is limited by the memory system cycle time
(just over 50ns) rather than the processor chip itself.
A fair comparison of devices at the same geometries gives the
AMULET1 performance as about 70% of that of an
ARM6 running
at 20MHz. Its power consumption is very similar to that of
the ARM6; the AMULET1 therefore delivers about 80 MIPS/W
(compared with around 120 from a 20MHz ARM6). Multiplication
is several times faster on the AMULET1 owing to the inclusion
of a specialised asynchronous multiplier. This performance is
reasonable considering that the AMULET1 is a first generation
part, whereas the synchronous ARM has undergone several design
iterations. AMULET2 (currently under development) is expected
to be three times faster than AMULET1 - 120 k
dhrystones -
and use less power.
on a 1 micron
CMOS process, which is about twice the area of
the synchronous part. Some of the increase can be attributed
to the more sophisticated organisation of the new part: it has
a deeper
pipeline than the clocked version and it supports
multiple outstanding memory requests; there is also
specialised circuitry to increase the multiplication speed.
Although there is undoubtedly some overhead attributable to
the asynchronous control logic, this is estimated to be closer
to 20% than to the 100% suggested by the direct comparison.
AMULET1 is code compatible with
ARM6 and is so is capable of
running existing
binaries without modification. The
implementation also includes features such as interrupts and
memory aborts.
The work was part of a broad
ESPRIT funded investigation
there is interest in low-power techniques both for portable
equipment and (in the longer term) to alleviate the problems
of the increasingly high dissipation of high-performance
chips. This initial investigation into the role
asynchronouslogic might play has now demonstrated that asynchronous
techniques can be applied to problems of the scale of a
(1994-12-08)