A Case for Asynchronous Computer Architecture

Author: mahami

Score: 43

Comments: 36

Date: 2021-11-28 06:12:06

________________________________________________________________________________

Animats wrote at 2021-12-01 01:40:56:

It's a classic idea. There were some early asynchronous mainframes built from discrite logic. It might come back. It's an idea that comes around when you can't make the clock speed any higher.

It's one of those things from the department of "we can make it a little faster at the cost of much greater complexity, higher cost, and lower reliability". That's appropriate to weapons systems and auto racing.

fivelessminutes wrote at 2021-11-28 07:24:21:

This seems to be from 20 years ago, the most recent citation was from 2000 and it describes a MIPS chip built on a 1998 process.

matja wrote at 2021-11-28 07:35:57:

And not even a mention of AMULET (

https://en.wikipedia.org/wiki/AMULET_microprocessor

)

Taniwha wrote at 2021-12-01 00:30:07:

Nor this one:

https://authors.library.caltech.edu/43698/1/25YearsAgo.pdf

It was the original paper for this that got me interested in building silicon tools

nickdothutton wrote at 2021-11-28 20:48:22:

Came here to say this.

mikeurbach wrote at 2021-11-28 19:45:15:

We had the pleasure of hosting Dr. Manohar at a CIRCT weekly discussion session earlier this year. He presented much more recent work if anyone is interested. The talk and discussion was recorded here:

https://sifive.zoom.us/rec/play/Bg99_niHh9OG_8uE_nhaz6otxvA0...

EDIT: talk begins around 7 minutes.

mahami wrote at 2021-11-28 08:40:25:

Yes, but I thought that it could be interesting to look at research on the topic from 20 years ago to compare it with present progress.

dgellow wrote at 2021-11-28 11:25:22:

Could you add the publication year in the title of your submission?

Animats wrote at 2021-12-01 01:33:49:

(2000)

UncleOxidant wrote at 2021-11-30 23:29:05:

Has there been much progress? I remember hearing a lot about asynchronous logic circuits back in the 90s, but don't hear about much in the way of breakthroughs since then.

blagie wrote at 2021-11-28 08:44:51:

Asynchronous would work better, but we're unlikely to get there -- too big a change.

It's like:

* having ECC everywhere

* having a single display standard (as opposed to HDMI/DisplayPort/USB-C/DVI/VGA/...)

* some kind of architecture where a single bad expansion card (USB, PCIe, etc.) can't crash a whole computer

... and so on

On one hand, no brainer. On the other hand, it hasn't happened.

NVidia is breaking ground on the move to SIMD/MIMD-style architectures, as predicted at the same time, and only because it gives a 30x boost in performance. Async will probably net us a 50% performance boost or something.

astrange wrote at 2021-12-01 01:41:47:

> * some kind of architecture where a single bad expansion card (USB, PCIe, etc.) can't crash a whole computer

If you mean IOMMU, we do have that. It doesn't seem completely doable because someone could still plug an etherkiller into the card.

darkstarsys wrote at 2021-11-30 23:31:37:

I tried to do a clockless fully async bus interface in around 1988 in a chip I was designing at Masscomp for a fast data acquisition system. Never got built, but it was fun trying, and it would've been really fast. "Lower design complexity" though: hahaha! Nope.

bob1029 wrote at 2021-11-28 11:51:59:

Having a common clock reference (per core) is essential for reducing latency between components. If you have to poll or await some other component arbitrarily, there will necessarily be extra overhead and delays in these areas. There will also need to be extra logic area dedicated to these activities. Make no mistake, just because there's no central clock, doesnt mean you are magically off the hook. You still need to logically serialize the instruction stream(s).

Even for low power applications, you would probably use less battery getting the work done quickly in a clocked CPU and then falling back to a lower power state ASAP. Allowing the pipeline effects to take hold in a modern clocked CPU should quickly offset any relative overhead. Heterogenous compute architecture is also an excellent and proven approach.

Certainly, there are many things that happen in a CPU that should not necessarily be bound by a synchronous clock domain (e.g. ripple adder). But, for these areas where async cpu a clear win, would we actually see any gains in practice using real software? Feels like there's a lot of other strategic factors that wash out any specific wins.

saurik wrote at 2021-11-28 12:09:16:

My understanding--which seems to coincide with this article and which Wikipedia seems to agree with (not that that necessarily means much for this)--is that in an asynchronous circuit latency would be lower, not higher, as the clock is required to wait for the worst-case performance while a clock-less system can proceed immediately once only the required inputs have arrived (or even attempt to speculate on partial inputs, something which would offer no value if you would have to end up waiting for the next tick anyway).

blagie wrote at 2021-11-30 13:56:42:

This is correct. It happens at multiple levels. Oversimplified:

* An async add operation takes variable time based on the number of carries, whereas a sync one is set to the worst-case.

* The clock for an ALU is set for the worst-case even when doing something faster (e.g. an ADD rather than a NAND)

* If you have multiple logic stages handled in one clock cycle, the problem is compounded. The clock is set by the slowest stage for all components in the system.

* If your system is doing nothing, you're still clocking it. Clocks are adjusted, but not at a nanosecond-by-nanosecond level.

All-in-all async gives a nice power boost and a nice performance boost (not enough of a boost to displace an entrenched ecosystem, mind you, but a nice boost nonetheless).

FullyFunctional wrote at 2021-12-01 00:15:06:

Yeah that's the theory, but reality is different and probably why we don't see any in production. (The last company that would admit to a tiny bit of clockless logic, Wave, folded).

The reality is that doing clockless logic introduces a lot of overhead at every state, both area and timing. There is different styles and the issue are different for them, but the bottom line is that nobody has been able to realize the theoretically wins in production (note1). And that's not even addressing the lack of tooling.

note1: the closet IMO is Ivan Sutherlands group which have some very impressive claims, but still nothing you can run out and buy.

baybal2 wrote at 2021-11-28 12:17:35:

Clock distribution eats a lot of power at gigahertz frequencies, and a lot of gates.

> If you have to poll or await some other component arbitrarily, there will necessarily be extra overhead and delays in these areas.

You don't poll. You have a lot of small input-clocked domains which work at a speed with which data comes.

baybal2 wrote at 2021-11-28 11:25:46:

I will raise an import distinction: asynchronous logic != dynamic logic.

There can be dynamic synchronous logic, and vice versa.

Dynamic vs. static determines whether the circuit as such needs to be driven by any constant pacing input, whether embedded clock, or external clock, vs. not needing it to arrive to a settled state (to latch.)

If you are to speak strictly, asynchronous vs. synchronous determines whether that pacing input is external, or recovered from input.

SavantIdiot wrote at 2021-11-30 23:52:58:

Do you mean domino logic?

FullyFunctional wrote at 2021-12-01 00:18:06:

Domino is _one_ version of asynchronous, but that's using a different notion of Asynchronous than the article. Because of the ambiguity, we talk today of clock-less logic, which comes in variants, most notably delay-insensitive and quasi-delay-insensitive. The latter is faster, but less immune to noise (has has terrible timing analysis issues).

CalChris wrote at 2021-11-30 23:54:24:

Mini-MIPS isn't _that_ different from a conventional out-of-order superscalar microarchitecture. The article even says:

However, the MiniMIPS pipeline structure can execute instructions out-of-order with respect to each other because instructions that take different times to execute are not artificially synchronized by a clock signal.

IshKebab wrote at 2021-11-28 22:58:26:

Does this mean that the chip isn't clocked? Doesn't that give you a complete metastability nightmare? How does it work?

blagie wrote at 2021-11-30 13:51:25:

No metastability nightmare.

One way to do this is to have each component have an output clock, which raises when it's output is known stable. If an adder has no carries, that takes 1ns. If it has each possible carry, it takes 2ns. You have a second clock propagating backwards to know when the next stage is ready for it's next input.

You still have timing. It's just set to when a component is ready with output, or ready to receive input.

Everything goes faster and uses less power.

boibombeiro wrote at 2021-11-28 13:56:49:

Memory cells are the thing that uses the vast majority of power in a CPU. And they are used everywhere, cache, uOP cache, BTB, etc.

Async CPU solved a problem that would have marginal benefit in a metric we care about

Also, I imagine, they would need to be implemented assuming the worst timing delay from the processes. They can't be binned like modern CPUs.

IshKebab wrote at 2021-11-28 22:04:46:

That doesn't sound right? Dynamic power is consumed by toggling wires, and memory cells are going to be one of the places where toggling is rare because you can't access all memory all the time.

Am I missing something?

hypertele-Xii wrote at 2021-11-30 15:59:40:

Volatile memory consumes constant power to remember its value. Processing circuits only consume power when activated. And it's difficult to get the memory bandwidth saturated in a way that keeps all circuits busy. Computers do work in bursts; Then they wait for data. And practically all classical computer science data structures trash cache, like linked lists and OOP in general.

Taniwha wrote at 2021-12-01 00:33:15:

You're confusing DRAM and CPUs - CPUs almost only use static SRAM cells internally which don't require refresh

123pie123 wrote at 2021-12-01 00:24:58:

i would have thought an asynchronous finite state machine type of system could be used to create a computer?

Const-me wrote at 2021-11-28 10:02:15:

Modern clocked processors don't account for worst-case timings. Instead, instructions take variable count of clock cycles to complete.

In some sense they're already asynchronous, despite clocked.

nynx wrote at 2021-11-28 13:20:27:

Certainly, modern CPUs are pipelined, but each clock cycle is still the worse-case time for all steps in the pipeline.

Const-me wrote at 2021-11-28 14:27:06:

> each clock cycle is still the worse-case time for all steps in the pipeline

The pipeline takes variable count of clocks to complete an instruction. The number depends on the instruction, input data of the instruction, and quite a few other things. In some exotic cases it even depends on power state, e.g. some Intel CPUs took ~20k cycles to power on their AVX pieces, during that window AVX instructions are much slower.

If for any reason the pipeline is unable to deliver the result by the end of the clock, CPUs don’t delay the clock, they continue running the clock. You simply gonna get the result on some later clock cycle.

nynx wrote at 2021-11-28 15:14:42:

That's exactly what I mean.

twoodfin wrote at 2021-11-28 13:01:59:

Can you write more about this or provide some examples? Of course, memory access has had variable timing “forever”, but the idea that other functional units can vary their timings for instructions is new to me.

Taniwha wrote at 2021-12-01 00:40:13:

Well for example imagine you have a 64-bit adder - and you add two numbers together - let's assume that on e of the in puts is '1' - how long does it take until the output is stable? it depends on the second value an d more importantly how long it takes for all the carries to propagate to the MSB - for a naive circuit and input of 0 the output will stabilise very quickly, for an input of 0xffff_ffff_ffff_ffff it will take 64 adder delays - an async circuit can have simple additions run faster than the worst case ones (which will still work). While a synchronous circuit would have a clock that could only go as fast as the slowest case (or pipeline things so that the output appears multiple clocks later)

Const-me wrote at 2021-11-28 14:04:01:

A good source of that info is

https://www.uops.info/

For instance, on my CPU which is AMD Zen 3, the idiv instruction (it computes integer division and modulo) takes between 9 and 19 cycles for 64-bit version:

https://www.uops.info/html-instr/IDIV_R64.html#ZEN3

That’s for the operand already in a register i.e. no RAM access involved.

Whether it takes 9 cycles, 19 cycles, or something in between, depends on the arguments of the instruction, i.e. on the numbers being divided.

Same applies to quite a few other instructions: floating point divisions (divps, divpd), floating point square root (sqrtps, sqrtpd), even 64-bit integer multiplication (imul).

It’s not just the math. Jumps, branches and function calls take very different count of cycles depending mostly on two things: predicted or not, and the state of micro-ops cache at the target address. Albeit these effects are very hard to measure reliably, depends on the code too much, probably for this reason uops.info doesn’t have latency figures for jmp/call/etc.