By Mark Ward Technology correspondent, BBC News
Making sure chips do not make mistakes has a financial and power cost
Silicon chips that are allowed to make mistakes could help ensure computers
continue to get more powerful, say US researchers.
As components shrink, chip makers struggle to get more performance out of them
while meeting power needs.
Research suggests relaxing the rules governing how they work and when they work
correctly could mean they use less power but get a performance boost.
Special software is also needed to cope with the error-laden chips.
Call costs
The silicon industry is defined by Moore's Law, which predicts that the number
of transistors that can fit on a given area of silicon, for a given price will
double every 18-24 months.
This is usually accomplished by shrinking transistors and typically means that
processing steadily gets more powerful.
Electricity meter, BBC Chips that make mistakes demand less power
Transistors are tiny switches that are used as the fundamental building blocks
of silicon chips.
However, many experts point out that the relentless march of Moore's Law could
stumble when components get so small they become unreliable.
The unreliability - or "statistical variability" - of chips is a problem that
many researchers were trying to deal with, said Professor Asen Asenov from the
Department of Electronics and Electrical Engineering at the University of
Glasgow.
Variability increases as components shrink, said Professor Asenov, who has been
using large scale simulations on grid computers to study how the behaviour of
transistors changes as they get smaller.
For Professor Rakesh Kumar at University of Illinois the demise of Moore's Law
is being hastened by an insistence on making silicon chips operate flawlessly.
Professor Kumar said variations in manufacturing, environment, and workload can
conspire to make a chip suffer errors. Manufacturers try to ensure that
whatever happens, he said, the chip works correctly.
"It's a case of 'if the software asks the chip to do something it does it at
any cost,'" he said.
Professor Kumar's research suggests that the pursuit of perfection forces
manufacturers to make some poor choices.
"To ensure correct operation you are purposefully running the chips at higher
power than you need to," he said.
Error condition
That insistence on perfection also pushes up manufacturing costs because many
chips have to be discarded if they fall short.
Professor Kumar said that it would become harder and harder for chip makers to
ensure instructions are executed flawlessly as components shrink.
The tiny components in chips are already starting to give rise to errors.
Instead of trying to eliminate this, he said, it should be embraced to produce
so-called "stochastic processors" that are subject to random errors.
"The hardware is already stochastic so why continue pretending it's flawless?"
he asked. "Why put in more and more money to make it look flawless?"
Through research part-funded by Intel, Professor Kumar and his colleagues are
designing processors that forgo flawlessness. Instead they attempt to manage
the number and type of errors so they can be coped with efficiently.
An example error, said Professor Kumar, is when a chip fails to complete a
cycle of instructions within a given time. The workings of most chips are
governed by a clock and the data processing they do advances with each tick of
that time-keeper.
Close-up of pocket watch, BBC The clocks in chips keep processing co-ordinated.
The upside of using chips that can make mistakes is much reduced power
consumption.
Depending on how many errors a designer is prepared to tolerate, power
consumption can be cut by up to 30%, he said. With only 1% error rates, power
can be cut by 23%.
In many cases the errors will not have a significant impact on the workings of
a computer. In other cases, he said, they could cause a system to crash.
To cope with this, Professor Kumar and colleagues are researching ways to make
applications more tolerant of mistakes.
The "robustification" of software, as he calls it, involves re-writing it so an
error simply causes the execution of instructions to take longer.
In another approach, the more robust software logs a user's actions. As the
software is used, this log can be consulted to spot when something unexpected
occurs.
The work on applications and programs may be more immediately useful, said
Professor Kumar, as it can be applied to existing applications. This should
make them cope with bugs that are showing up now and prepare them for use with
future processors.