gemini - kennedy.gemi.dev

💾 Archived View for gemini.theuse.net › textfiles.com › rpg › fairdice.txt captured on 2022-01-08 at 19:48:49.
-=-=-=-=-=-=-
Newsgroups: rec.games.frp.archives
From: barnett@agsm.unsw.oz.au (Glen Barnett)
Subject: PAPER: Testing Dice for bias
Message-ID: <al#23np@rpi.edu>
Organization: The Australian Graduate School of Management
Date: Wed, 2 Dec 1992 04:41:08 GMT
Approved: goldm@rpi.edu
Lines: 749

The following information is intended for distribution over Internet,
and outside of that may be copied for personal use only. 
(c) Glen L. Barnett, 1992. All rights reserved.

I recently responded to a thread on rec.games.frp.dnd about testing dice,
in order to fix up a few misconceptions and make some suggestions. On the
suggestion of Coyt Watters I'm putting this on r.g.f.archives, which I think
is a good idea. What follows has some major additions, however.

In this post, I will begin by discussing various posts on the subject
of testing dice, then show how to do the test under discussion in the
thread (the chi-squared goodness-of-fit test) properly, and then talk 
about more appropriate tests.

---------------------------------------------------------------------

In article <BxHJLy.Jq0@watserv1.uwaterloo.ca>,
alongley@cape.UWaterloo.ca (Allan Longley) writes:
 [..stuff deleted..]
>testing dice for bias.  Well, here is a test to use.  I haven't actually 
>tested this yet, but it should -- in theory -- work.  And no, this is not
>a copy out of the old Dragon magazine, but it is the same test -- its a
>pretty standard test.  I will use simple terms -- so all you math/stat
>people out there, don't correct the fine points, I know.
 [description of chi-squared goodness-of-fit test deleted]

Since Allan asks for no correction of fine points, I will attempt to
limit myself to major problems. This is not intended as a flame on
Allan, but this is fairly important stuff, and should be explained 
correctly. If at any stage I get less than pleasant, please accept 
my apology in advance.

While the calculations that Allan describes give the correct value
of a chi-square goodness-of-fit statistic (which he calls "Indicator"),
you should be *very* wary of interpreting the results in the way 
he describes, as I will explain:

Let us assume you have 40 dice that (unknown to you) are all perfectly 
fair, and you wish to test all of them, to see if any are "biased".

The way Allan has set his test up, you'd expect 2 of them to give
results below "Probably Fair", which he says indicates the die is 
probably unbiased. That is, you have 40 fair dice, and you will 
expect to regard only *two* of them as probably O.K.! Similarly,
you will expect to consider two of your "purely fair" dice as
probably unfair. Of the remaining 36, you will expect 18 scores
between "Probably Fair" and "Maybe" and 18 more between "Maybe"
and "Probably biased". For these 36, you have to do the test again,
under Allan's scheme. If you get both results below "Maybe" (you expect
9 of these) you say "Probably Fair". Similarly you expect 9 above
"Maybe" on both trials.


So we have (after repeating the test for 90% of the dice):

Number of Fair Dice: 40
Expected number "probably fair": 11
Expected number "probably biased": 11
Expected number which we don't know about: 18.

So over a quarter of perfectly fair dice will be called "probably biased".
If we continue testing those remaining 18 we are still undecided about,
the problem gets worse.


Other problems:

Allan says:

"The column titled "Maybe" are the Indicator values where there is
a 50% chance that the die is fair and a 50% chance that the die is 
biased." (A) 

This is just plain wrong.

The column he refers to is the value that a test on a *fair* die will 
exceed 50% of the time. This is very different, and probably explains 
why Allan misunderstands the whole interpretation of the results. (B)

If any of you can't see why what I said (B), and what Allan said (A) are 
totally different, don't despair. This stuff is not always obvious from 
the start. If you can follow the rules of an average RPG, you are smart 
enough to understand a few non-trivial statistical ideas. I'm quite happy 
to provide further clarification to the net if the demand is there.

> >From Table 1, it appears that the d4 tested may be "Fair" but another test
> should be done.

I'd say not. A reasonable interpretation of the result is "There is
no reason to doubt that the die is O.K.".

Incidentally, the test statistic of 2.00 obtained in the example is
only 2/3 of what you'd expect with a fair die. The value of 2.00 will 
be exceeded almost 60% of the time by a test on a fair die.


---------------------------------------------------------------------

In article <Bxu26A.3F6@watserv1.uwaterloo.ca>,
alongley@cape.UWaterloo.ca (Allan Longley), in response
to Michael Wright, says:

|In article <wright.721879981@latcs1.lat.oz.au|wright@latcs2.lat.oz.au
| (Michael G. Wright) writes:
|>dks@acpub.duke.edu writes:
|>
|>>	This is called a chi-square test, and an article with the
|>>procedure and numbers for it appeared way back in Dragon issue 
|>>#74...  Thank you, Mr. Longley, for reposting it (or did you come
|>>up with it in isolation? =) ) for the benefit of those who don't
|>>have the issue (probably most readers).
|
|Yes, I seen the issue.  The chi-square test is a standard statistical test
|for determinig if a data set matches a particular distribution -- so, no, I
|did not come up with the test in isolation, its been around for a lot longer 
|than D&D.  I don't actually have the issue, so I didn't copy it for the net.
|
|I've been playing with the chi-square test and you know what I found out --
|ALL DICE ARE BIASED!!  Well, that's not true -- all except d4's and d6's are
|biased.  Of course, this really shouldn't be a surprise.  So, I've been 
|looking at modifiying the chi-square test for "real world" dice -- more on
|this in a later post.

Allan is correct that the test has been around a lot longer than D&D.
The test, due to Karl Pearson, is nearly 100 years old. 

Its no surprise that Allan finds that all real dice are biased:
 i) Its impossible to make a truly fair die (obviously). Its just
    that most are close enough that we don't care too much. The
    chance of getting "close to fair" will decrease with the number
    of sides.

ii) Allan's testing method will call more than a quarter of fair
    dice (assuming they existed) biased. Even the fairest dice you
    could buy have a good chance of being called biased.


|>Actually, I use a program I made to roll dice for stats. Unfortunately, nobody
|>in the party wants to use it, because dice rolls invariably end up better. I
|>think this must be because of the pseudo-randomness of the program. 
|>Anyone out there that knows better?
|
|I wouldn't want to use a computer generated die-value while playing AD&D.
|THe thrill of the "rolling die" is part of the game.  Also, with reference
|to the above, most players will have a favourite die/dice due to the
|inherent bias found in real dice.  The trick is to find the dice that are
|biased beyond reasonable playability.

In response to Michael G Wright:

Michael's discovery that hand-rolled dice often come out better
could occur for a couple of reasons: 

 i) Players tend to hang on to "favourite" dice that "roll well"
    (i.e. come up with good results). So biased dice have some 
    chance of concentrating into the hands of players. As long as
    this isn't too extreme, it probably doesn't matter too much.
    (Allan quite correctly identifies this reason).

ii) Players don't really roll randomly. I had a fairly long email 
    discussion with Sea Wasp on this topic just recently. Even 
    unconciously, you can pick up the dice in a "non-random" fashion,
    so that a good roll will tend to be followed by another good
    roll if you don't roll tooo vigorously. You may notice that
    after a bad roll players tend to throw harder. This may be more
    of a problem, as some players are *much* better at it than others.
    If it becomes too noticeable, you may wish to invest in a dice
    cup, or mock up a craps-table affair.

Allan's comment that "The trick is to find the dice that are biased
beyond reasonable playability" is spot on. Exactly correct. 
Remember it, because I'll come back to it later.

    
I think the above discussion also answers Paul Kinsler's questions.

[in article <BxvuzM.CsI@bunyip.cc.uq.oz.au>, kinsler@physics.uq.oz.au 
 (Paul Kinsler) asked for clarification of Allan's comment that all
 dice are biased].

---------------------------------------------------------------------

In article <1992Nov12.194015.1602@Princeton.EDU>,
 dagolden@phoenix.Princeton.EDU (David Alexandre Golden) writes:

[stuff deleted]
>I once did something along those lines to test whether my DM's die was
>fair.  (It would roll 20's a lot.  Often on command.)
>
>To test the die, I made a histogram (I believe is the term for it) like this:
>
>x  x  x        x        x
>x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x
>x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x
>1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
>
>With an "x" being each time that number came up.  A totally fair die would
>have a straight line across, assuming it was rolled enough times.  The
>question is, what is enough times?  I rolled a d20 about 380 times (not bad
>if one person rolls and the other makes an x in the column) and while it
>gave a fairly good idea that the die was biased, the biased numbers had only
>occured a couple times more than the average.  (If I remember correctly).
>So I wrote a computer program to do the same thing until the deviation between
>the highest and lowest number of occurances was less than about 15% of the
>average number of occurances.  (i.e a reasonably smooth profile).  The 
>computer required SEVERAL THOUSAND ROLLS to do this. 

Your idea of making a histogram is a good one. In fact all the
chi-squared test does is the same as drawing a line across the 
histogram where your expected "straight line across" would go, 
looking at the deviations from that, squaring (to get all positives)
and adding the squared deviations up and dividing by that expected 
number. This gives a single overall measure of deviation from uniformity. 
The advantage of looking at the histogram is you see where
the the differences are, but you can't tell how big they "ought" to be
for a fair die. The actual number in the cell will be approximately normally
distributed with mean equal to the expected number in each cell and standard
deviation approximately the square root of the mean. In the above example,
we'd expect Dave to get 19 in each cell, so the standard deviation is about
4.35. That is, we'd expect to get about 2/3 of the cells with counts in
the range 15 to 23, and about a 2/3 chance of all but 1 or 2 of the values
inside the range 11 to 27.

   [ some of my own discussion deleted - see the section "Tests based
     on the histogram" below]

>                           ...         Still, the point is that I'm skeptical
>that the "fairness" of a die can be determined in only a hundred or so rolls.
>(d4 maybe... d20 no way!)

Well, in fact Allan's suggestion was to use 20 rolls per cell, so he'd use
400 rolls for a d20 and 80 rolls for a d4. But in any case you can never
decide that a die is actually fair. If you do a test and get a result
close to what you expected if the die was fair, you have a lack of evidence
against the hypothesis of fairness (which is the default assumption for
a statistical test of biasedness in a die - the "null hypothesis").

What you get is either a higher degree of evidence against the hypothesis 
of fairness (by getting a result that is very unlikely with a fair die),
or a low degree of evidence against fairness. It's like in a court case,
(a criminal case), where the defendant is assumed innocent until proven
guilty (innocence is the null hypothesis), but evidence against the
defendant is presented by the prosecution. The jury then decides either
"guilty" if there is strong enough evidence, or "Not guilty" if there
is not. They don't declare innocence.

So we can't determine "fairness" anyway. The question we need to ask
is: If the die is biased, will a hundred rolls (or whatever number)
be enough for us to have a good chance to pick up that difference,
while at the same time, not "convicting the innocent" too often? 
Whether it is enough depends on how big a difference you think it is 
important to pick up.

-----------------------------------------------------------------------


In article <1992Nov12.231019.1204@mcs.kent.edu>,
 adray@mcs.kent.edu (Adam Dray) writes (in response to Dave Golden):

>In other words, a histogram shows very little.  Random doesn't mean
>necessarily that you'll get an even distribution.  It just mean the
>probability that you won't get an even distribution is proportional to
>the number of sides on the die, and the number of times you roll it.

I disgree with the first sentence. The final sentence above is wrong.
The more you roll it, the more even the distribution will be, as long
as the die is fair. It doesn't really depend on the number of sides.
(except as far as the negative dependence between cell counts is
 reduced for more sides).

>Notes about the fairness of dice:
>
>Sharp-edged dice are better than smooth-edged dice.  They're also more
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Not always, but this may be true more often than not.

>expensive, however.  Rounded dice are often inked by coating the
>entire die with ink, then tossing the die in a "tumbler" (similar to
>tumblers for smoothing rocks) until all the die on the outside is
>gone.  Thus, the ink is left in the crevices where the numbers are.
>
>Theoretically, the grooves for the numbers can make one side a more
>likely outcome.  Official casino dice don't have inset pips.

If you do it right any effect will be swamped by other manufacturing 
defects anyway.

[some stuff deleted]
>
>GameScience did tests on other manufacturers' dice.  They found
>certain numbers to be more likely.  I've heard that the real 100-sided
>die tends to roll certain numbers more often.

It is impossible (both effectively and theoretically) to get a fair 100-
sided die. The practical problems are more important than the theoretical
ones.

>Filing corners off your dice can make certain outcomes more probable. 
>Natural wear can do the same thing.
>
>For most people, none of this matters one damn bit.  =)

In general, no, it doesn't matter. It's encouraging to see that so
many people (just about all posters on the topic) realise this.

---------------------------------------------------------------------------

Interpreting the test statistic (Allan's "Indicator")

Carry out the calculations as described by Allan*, but use any number of
throws per cell (possible outcome) you like (I'd suggest 10 as a minimum,
because otherwise the tabled distribution is out a bit). The more rolls
you do, the better chance you have of picking up a difference of a given
size. The value of 20 that Allan suggested may well be a reasonable choice
in most circumstances. Allan gives the calculations for two different
numbers of throws (20 and 10 per cell, but in different posts), so you
ought to be able to generalise.


	The calculations given by Allan may no longer be available to you, so

   an indication of how to do the calculations is given here:

   Roll the die many times, say 20 times per face. Record each result
   (I suggest you make up a tally sheet). Calculate the difference between
   the number of times each face came up and the expected number (20 in this
   case). Square these values and add them. Divide by the expected number 
   per face. This is your chi-squared statistic.
   E.g.    d4:   Roll 20 times per face = 80 rolls  
           Face:    1   2   3   4 
         No times  23  18  15  24
         expected  20  20  20  20
        difference  3   2   5   4
        diff^2      9   4  25  16  Sum = 54, chi-squared value = 54/20 = 2.7


If the result is less than the final column of Allan's table 2 (which 
are the tabulated values for a 5% significance level), you shouldn't 
worry too much, there is not very strong evidence of bias - in fact 1 
in 20 tests on a fair die will score worse than this. If the result is 
much bigger than the value you have some cause for concern. A result
bigger than the 1% column below is quite unusual if the die is fair 
(a result at least this big only occuring 1% of the time), so it gives 
us good reason to suspect bias.

A small table of the chi-squared distribution:

	5%      1%     df
 d4    7.81    11.34    3
 d6   11.07    15.09    5
 d8   14.07    18.48    7 
d10   16.92    21.67    9
d12   19.68    24.72   11
d20   30.14    36.19   19

(these results came from a computer approximation to the chi-squared
 distribution. They should be accurate to the figures given.)

If you want a more "cookbook" approach; if the result exceeds the 1%
value, its probably biased. If its between the 1% and 5% values, there
is a moderate degree of evidence that its biased, but it still might be
OK. If its less than the 5% value, you don't have any reason to think
its biased on the basis of the test.


You will find more extensive tables in most elementary statistics books.
  (references for the chi-squared and Kolmogorov-Smirnov tests are
     at the end of this article).
You look up the df (degress-of-freedom) that are one less than the
number of faces on the die (e.g. d4 -> 3 df).

A note on pronunciation: The Greek letter chi (the capital looks like
an X, and the lower-case has one of the two crossed lines a bit curly)
is pronounced with a hard "ch" like Charisma, and the word rhymes with
pie. Note that mathematical symbols come from *ancient* Greek, so no
arguments from any modern Greeks please.

This will provide a reasonable all-round test for bias in a die.


-------------------------------------------------------------------

Why you probably don't want to do the chi-squared test:
(at least for d8 and above)

The chi-squared test will pick up any kind of deviation from a purely even
distribution. However, we are much more worried about some kind of deviations
than others. For example, I'd be more interested in knowing that "20" came
up too often on a d20 than knowing "10" came up too often. The first could
affect play substantially, the second probably only a little. We should use a
test with a better chance to pick up the kind of deviations from fairness
that are most important to us (which will trade off with less chance of
picking up deviations we are less concerned with).

Let us consider a more complete example:

Imagine we have two d20's we'd like to test, and that in fact 
(but unknown to us) they have the following (percentage) probabilities:

(the rows are: Face number
               % prob 1st die
               % prob 2nd die )


  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
  5  4.5  6  3.5  7  2.5  8  1.5  7   5   5   7  1.5  8  2.5  7  3.5  6  4.5  5
 1.5 1.5 2.5 2.5 3.5 3.5 4.5 4.5  5   5   5   5   6   6   7   7   7   7   8   8

A fair die would have 5% right across, of course. These 2 dice can be obtained
from each other by relabelling the faces. The first die will be reasonable in
play, because, of course, we don't try to roll 'exactly 8' or 'exactly 9',
but 'less than 8' or 'greater than 11'. The first die is never out by
more than one twentieth of the required probablility (e.g. probability of
a 2 or less is 9.5% instead of 10%) in either direction. It has the correct
average, and almost the correct standard deviation (the difference is tiny).

The second die would be very unbalancing in play: it has about a 2/3 chance
(66%) of rolling 11 or higher, and a 20 is more than 6 times as likely as 
a 1. The mean is almost 13. The standard deviation is also out, but that's
relatively unimportant.

The chi-squared will rate them as equally bad!  

So a good test should be likely to identify the second die, but we might
be prepared to sacrifice some of our ability to pick up the first, since
it will make little practical difference in play. (I said I'd come back to
this point!)

Note that almost any deviation on a d4 will be important (there are only 4
different values), and to a lesser extent a d6. I'd stick with the chi-
squared test on those.

There are many tests that will do what we want. I will present only one
such test*. (This is not to say that a properly applied chi-squared is
not good, just that a test more closely tailored to our specific 
question of interest will be even better.)


	two tests if you count "Tests based on histograms", below.


The Kolmogorov-Smirnov test:

Collect data as for the chi-squared test, up to the point where you
start doing calculations.

That is, lay out like this (you could run down rather than across):


Roll:        1   2   3   4   5  ......
Count:      17  19  22  27  24  ......
Expected:   20  20  20  20  20  ......

Now add up your counts and expected counts, writing the partial 
totals as you go:

Roll:        1   2   3   4   5  ......
Count:      17  19  22  27  24  ......
Expected:   20  20  20  20  20  ......
Sum Count:  17  36  58  85  109 ......
Sum Exp:    20  40  60  80  100 ......

Now find the differences (without sign):

Sum Count:  17  36  58  85  109 ......
Sum Exp:    20  40  60  80  100 ......
Difference:  3   4   2   5    9 ......

The last difference will be zero, so you don't have to work out the
final column (I still would as a check).

Divide the largest difference (9 is the largest difference above, for
the calculations you can see) by the number of rolls you made altogether.

This is your test statistic. Let's call the value D.

You can look it up in most books on nonparametrics, which will
have tables. However, you would be better to use the table below, for 
reasons I'll discuss in a second.

You multiply D by the square root of the number of rolls
(equivalently, divide the largest difference by the square root of 
 the number of rolls), and compare with:

  5%   1%    d#
 1.08 1.35   4
 1.10 1.37   6     These values apply pretty well irrespective of
 1.11 1.38   8     the total number of rolls, but I would use at 
 1.12 1.39  10     least 10 rolls per face.
 1.12 1.40  12     Note also that these values come from simulation,
 1.14 1.42  20     and are hence not exact. This doesn't really matter.

and interpret as I suggested for the chi-square test.

You may find the following values in tables:

   5%    1%
  1.36  1.63  (irrespective of the number of sides on the die)

the reason these are larger is that they are based on the assumption that
the distribution the data are from is continuous (effectively, a *very*
large number of faces on the die would give these values). If you use the
textbook values, the test will be conservative (a fair die will reject 
slightly less often than the supposed 5% and 1% for the above table), due 
to the distribution of values being discrete (d20 generates only integers,
not anything between). 

So, for our above example, assume there are no larger differences
than 9, and that we made 400 rolls on a d20 (hence the expected number
in each cell is 20, as above). Then D is 9/400 = .0225, which if you
can get tables you'd look up. We made 400 rolls, so we could use the
table above: the square root of 400 is 20, so D x 20  (= 9/20) = .45.
This is much less than the 5% value, so there is little evidence that
the die is unfair.

There are tests which are probably even more appropriate, but these
two (chi-squared and K-S) will be enough for you to get a good idea
of any suspect dice.

Note:    If you suspect a die, and decide to test it, don't use the
         rolls that made you suspect it in the test. Generate a new
         set. e.g. if you are all recording your rolls as you play,
         and one players' results look funny, don't then test those
         recorded values - you have to generate a new set.

-------------------------------------------------------------------------

Testing a die based on the histogram of rolls:

The histogram approach can be turned into a test of sorts as follows:
After drawing the histogram, 2 lines can be drawn either side of the
expected (mean) result. If all histogram bars lie within the inner
lines, there is no strong evidence of bias. If any of the bars go outside
the outer lines, there is fairly clear evidence of bias. If one of the
bars lies between the inner and outer lines, then there is some (mild)
evidence of bias, but its is not really clear. You may wish to then
perform a further test on the probability for that individual side,
as described in the next section (Testing an individual face). If 
several of the bars lie between the inner and outer lines, we have a 
stronger indication of bias. 

Where do we draw the lines? 

I have worked out values for rolling 20 times for each face (as in the
other examples, 80 times for a d4, 400 times for a d20). The bars on the
histogram must actually go past these values. You could think of these
values as giving "Acceptable Ranges" (literally, 95% and 99% acceptance 
regions) for the histogram. A fair die will give histograms with one or 
more bars outside these ranges 5% and 1% of the time respectively 
(actually just under).


    Table for 20 
   throws per face          Approximate formula:

   d#    5%    1%           Let N be the total number of rolls.
    4   8-34  6-37          Let c be the number of faces on the die.
    6   9-33  7-36          Let e be the expected number of times
    8   9-33  8-35            each face will come up ( e = N/c ).
   10  10-32  8-35          Then the lines go at e +/- A x sqrt[e (c-1)/c],
   20  11-30  9-32          and the value for 'A' comes from the table below. 
                            All fractions should be rounded up. 
                            e.g. N=160, c=8, e=20 give: 
                                 5%: 20 +/- 2.73 sqrt (20 x 7/8) or 8-32
                                 1%: 20 +/- 3.22 sqrt (20 x 7/8) or 7-34

                              Table to go with 
                             approximate formula:

                                d#   5%  1%
                                 4 2.49 3.02       Due to being in the
                                 6 2.63 3.14       extreme tails of the
                                 8 2.73 3.22       distribution, combined
                                10 2.80 3.29       with slight asymmetry,
                                12 2.86 3.34       the ranges we get are
                                20 3.02 3.48       sometimes out a bit.
                                                   This is not a big deal.


Example: We throw a d20 400 times, and record the results 
 and from the table above, we draw the inner lines at 11 & 30,
 and the outer lines at 9 & 32, as well as a reference line at 20. 
 (in the histogram below, "." =1 count, ":" =2 counts. The horizontal
  lines aren't quite in the correct positions; they ought to be about 
  a quarter to half a character position lower.)

  Counts                         (34)                                
    35 +                        >   <                                
       |__________________________:__________________________________ (32)
       |__________________________:__________________________________ (30)
       |                          :                                  
       |                          :                                  
    25 +                          :                       .          
       |           .              :                       :  :       
       |__:__._____:_____:________:____________________:__:__:_______ (20)
       |  :  :     :  .  :  :     :     .              :  :  :       
       |  :  :  :  :  :  :  :     :  .  :     :        :  :  :  :  : 
    15 +  :  :  :  :  :  :  :  :  :  :  :  .  :  .     :  :  :  :  : 
       |  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  : 
       |--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:- (11)
       |--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:--:- ( 9)
       |  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  : 
     5 +  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  : 
       |  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  : 
       |  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  : 
       `--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+- 
face:     1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 

counts:  22 21 18 23 19 22 20 16 34 17 19 15 18 15 14 22 25 24 18 18
                                 ^^

One value (34) goes outside the 1% values (technically, you could say
goes outside a 99% acceptance region), so it seems our dice is biased.
More particularly, it rolls too many 9's. Whether this will affect a
game very much is another point. (If it had been "1" or "20", however,
perhaps this would result in a large effect on the game).



Testing an individual face:

When you suspect a particular face is coming up with the wrong 
frequency, or only wish to test a particular face (e.g. 20 on a
d20), throw as before, but you can compare with a narrower range,
as below. However, you can't use the data that made you suspect
the face in this test, you must generate a new set of rolls. 
E.g. If you a histogram and find the results for 7 look odd,
you can't use those numbers in this test. In that case the ranges
given above (for testing an entire histogram) are appropriate.

For the tables below, as for those above, you need to exceed the
range given to call the die 'probably biased'.

When you don't know the direction already (two-tailed test):
(If you are in doubt, use this table rather than the next one)

    Table for 20 
   throws per face          Approximate formula (reasonably accurate):

   d#    5%    1%           Let N be the total number of rolls.
    4  13-28 11-30          Let c be the number of faces on the die.
    6  12-28 10-31          Let e be the expected number of times
    8  12-29 10-31            each face will come up ( e = N/c ).
   10  12-29 10-32          Then the lines go at e +/- 1.96 x sqrt[e (c-1)/c],
   12  12-29 10-32            (5%) and for 1% at e +/- 2.58 x sqrt[e (c-1)/c].
   20  12-29 10-32          All fractions should be rounded up.
                            e.g. N=160, c=8, e=20 give:
                           5%: 20 +/- 1.96 sqrt(20 x 7/8) or 12-29 (rounded up)
                           1%: 20 +/- 2.58 sqrt(20 x 7/8) or 10-31 (rounded up)



When you think a particular face is coming up too often, or if you
think a particular face isn't coming up enough (one-tailed test):

    Table for 20
   throws per face          Approximate formula (reasonably accurate):

   d#    5%    1%           Let N be the total number of rolls.
    4  14/26 11/30          Let c be the number of faces on the die.
    6  13/27 11/30          Let e be the expected number of times
    8  13/27 11/30            each face will come up ( e = N/c ).
   10  13/27 11/30          Then the lines go at e +/- 1.65 x sqrt[e (c-1)/c],
   12  13/27 11/30            (5%) and for 1% at e +/- 2.33 x sqrt[e (c-1)/c].
   20  13/27 11/31          All fractions should be rounded up.
                            e.g. N=160, c=8, e=20 give:
                           5%: 20 +/- 1.65 sqrt(20 x 7/8) or 14-27 (rounded up)
                           1%: 20 +/- 2.33 sqrt(20 x 7/8) or 11-30 (rounded up)

These values are given as either/or i.e. since you have already specified
a particular direction, you will compare with only the higher values or the 
lower values, not both.


Example 1: You decide to test you new d20 to see if it rolls the
           correct number of 20's, but you don't believe it to
           be biased in a particular direction. You roll 400
           times, and 35 times you get a "20". From the "two-tailed"
           table above, you can see that's outside the outer (1%) range.
           It seems your d20 rolls too many 20's.

Example 2: Another player seems to be rolling a lot of 1's on her d4.
           You decide to test it whether 1 comes up too often. 
           You roll it 80 times, and get the following:

               1   2   3   4
              13  26  25  16

	   Since you decided to test if there were too many 1's, you
           can only see if the number of 1's exceeds 26, which it does
           not. You can't say, after generating the data "Oh, actually,
           perhaps it rolls too few ones", or "perhaps it rolls too many
           2's" without generating a new set of data for the new hypothesis.
           You must never base what you are testing for on what you spy
           in the set of data you use in the test. Our only conclusion
           on this test: the d4 doesn't roll too many 1's. 

	   You may like to then generate a new set of rolls to see if it
           rolls too few 1's.

---------------------------------------------------------------------
As further examples, here are the chi-squared test and Kolmogorov-Smirnov (KS)
tests performed on the same data.

chi squared test:

counts:  22 21 18 23 19 22 20 16 34 17 19 15 18 15 14 22 25 24 18 18
diff 
from 20:  2  1  2  3  1  2  0  4 14  3  1  5  2  5  6  2  5  6  2  2
diff^2:   4  1  4  9  1  4  0 16 196 9  1 25  4 25 36  4 25 36  4  4
sum diff^2: 408
chi-squared statistic: 408/20 = 20.4 
(far less than the 5% value of 30.14)


Kolmogorov-Smirnov test:

(Calculations have been run down the page because I can't fit 20 3 digit 
 numbers, with spaces and labels across an 80-column screen).

 counts sum expected diff
   22    22    20     2
   21    43    40     3
   18    61    60     1
   23    84    80     4
   19   103   100     3
   22   125   120     5
   20   145   140     5
   16   161   160     1
   34   195   180    15 <-- max diff, D, is 15. Well short of significance
   17   212   200    12     at the 5% level. e.g. calc D/sqrt(n) = 15/20
   19   231   220    11     or .75; where the 5% value from the simulations
   15   246   240     6     is 1.14
   18   264   260     4
   15   279   280     1
   14   293   300     7
   22   315   320     5
   25   340   340     0
   24   364   360     4
   18   382   380     2
   18   400   400     0

----------------------------------------------------------------------------


Conover, W.J. (1980): Practical nonparametric statistics,
2nd Ed., Wiley, New York.

Neave, H.R. and Worthington, P.L.B. (1988): Distribution-free tests,
Unwin Hyman, London.