Mistakes in silicon chips to help boost computer power

By Mark Ward Technology correspondent, BBC News

Making sure chips do not make mistakes has a financial and power cost

Silicon chips that are allowed to make mistakes could help ensure computers

continue to get more powerful, say US researchers.

As components shrink, chip makers struggle to get more performance out of them

while meeting power needs.

Research suggests relaxing the rules governing how they work and when they work

correctly could mean they use less power but get a performance boost.

Special software is also needed to cope with the error-laden chips.

Call costs

The silicon industry is defined by Moore's Law, which predicts that the number

of transistors that can fit on a given area of silicon, for a given price will

double every 18-24 months.

This is usually accomplished by shrinking transistors and typically means that

processing steadily gets more powerful.

Electricity meter, BBC Chips that make mistakes demand less power

Transistors are tiny switches that are used as the fundamental building blocks

of silicon chips.

However, many experts point out that the relentless march of Moore's Law could

stumble when components get so small they become unreliable.

The unreliability - or "statistical variability" - of chips is a problem that

many researchers were trying to deal with, said Professor Asen Asenov from the

Department of Electronics and Electrical Engineering at the University of

Glasgow.

Variability increases as components shrink, said Professor Asenov, who has been

using large scale simulations on grid computers to study how the behaviour of

transistors changes as they get smaller.

For Professor Rakesh Kumar at University of Illinois the demise of Moore's Law

is being hastened by an insistence on making silicon chips operate flawlessly.

Professor Kumar said variations in manufacturing, environment, and workload can

conspire to make a chip suffer errors. Manufacturers try to ensure that

whatever happens, he said, the chip works correctly.

"It's a case of 'if the software asks the chip to do something it does it at

any cost,'" he said.

Professor Kumar's research suggests that the pursuit of perfection forces

manufacturers to make some poor choices.

"To ensure correct operation you are purposefully running the chips at higher

power than you need to," he said.

Error condition

That insistence on perfection also pushes up manufacturing costs because many

chips have to be discarded if they fall short.

Professor Kumar said that it would become harder and harder for chip makers to

ensure instructions are executed flawlessly as components shrink.

The tiny components in chips are already starting to give rise to errors.

Instead of trying to eliminate this, he said, it should be embraced to produce

so-called "stochastic processors" that are subject to random errors.

"The hardware is already stochastic so why continue pretending it's flawless?"

he asked. "Why put in more and more money to make it look flawless?"

Through research part-funded by Intel, Professor Kumar and his colleagues are

designing processors that forgo flawlessness. Instead they attempt to manage

the number and type of errors so they can be coped with efficiently.

An example error, said Professor Kumar, is when a chip fails to complete a

cycle of instructions within a given time. The workings of most chips are

governed by a clock and the data processing they do advances with each tick of

that time-keeper.

Close-up of pocket watch, BBC The clocks in chips keep processing co-ordinated.

The upside of using chips that can make mistakes is much reduced power

consumption.

Depending on how many errors a designer is prepared to tolerate, power

consumption can be cut by up to 30%, he said. With only 1% error rates, power

can be cut by 23%.

In many cases the errors will not have a significant impact on the workings of

a computer. In other cases, he said, they could cause a system to crash.

To cope with this, Professor Kumar and colleagues are researching ways to make

applications more tolerant of mistakes.

The "robustification" of software, as he calls it, involves re-writing it so an

error simply causes the execution of instructions to take longer.

In another approach, the more robust software logs a user's actions. As the

software is used, this log can be consulted to spot when something unexpected

occurs.

The work on applications and programs may be more immediately useful, said

Professor Kumar, as it can be applied to existing applications. This should

make them cope with bugs that are showing up now and prepare them for use with

future processors.