💾 Archived View for gemini.spam.works › mirrors › textfiles › programming › crenshawtut.txt captured on 2020-10-31 at 14:48:55.

View Raw

More Information

-=-=-=-=-=-=-

TUTOR.ZIP

This file contains all of the installments of Jack Crenshaw's
tutorial on compiler construction, including the new Installment 15. 
The intended audience is those folks who are not computer scientists,
but who enjoy computing and have always wanted to know how compilers
work. A lot of compiler theory has been left out, but the practical
issues are covered. By the time you have completed the series, you
should be able to design and build your own working compiler. It will
not be the world's best, nor will it put out incredibly tight code.
Your product will probably never put Borland or MicroSoft out of
business.  But it will work, and it will be yours.

A word about the file format: The files were originally created using
Borland's DOS editor, Sprint.  Sprint could write to a text file only
if you formatted the file to go to the selected printer.  I used the
most common printer I could think of, the Epson MX-80, but even then
the files ended up with printer control sequences at the beginning
and end of each page.

To bring the files up to date and get myself positioned to continue
the series, I recently (1994) converted all the files to work with
Microsoft Word for Windows.  Unlike Sprint, Word allows you to write
the file as a DOS text file.  Unfortunately, this gave me a new
problem, because when Word is writing to a text file, it doesn't
write hard page breaks or page numbers.  In other words, in six years
we've gone from a file with page breaks and page numbers, but
embedded escape sequences, to files with no embedded escape sequences
but no page breaks or page numbers.  Isn't progress wonderful?

Of course, it's possible for me to insert the page numbers as
straight text, rather than asking the editor to do it for me.  But
since Word won't allow me to write page breaks to the file, we would
end up with files with page numbers that may or may not fall at the
ends of the pages, depending on your editor and your printer.  It
seems to me that almost every file I've ever downloaded from
CompuServe or BBS's that had such page numbering was incompatible
with my printer, and gave me pages that were one line short or one
line long, with the page numbers consequently walking up the page.  

So perhaps this new format is, after all, the safest one for general
distribution.  The files as they exist will look just fine if read
into any text editor capable of reading DOS text files.  Since most
editors these days include rather sophisticated word processing
capabilities, you should be able to get your editor to paginate for
you, prior to printing.

I hope you like the tutorials.  Much thought went into them.


									Jack W. Crenshaw

								CompuServe 72325,1327



























                            LET'S BUILD A COMPILER!

                                       By

                            Jack W. Crenshaw, Ph.D.

                                  24 July 1988


                              Part I: INTRODUCTION





INTRODUCTION


This series of articles is a tutorial on the theory  and practice
of  developing language parsers and compilers.    Before  we  are
finished,  we  will  have  covered  every   aspect   of  compiler
construction, designed a new programming  language,  and  built a
working compiler.

Though I am not a computer scientist by education (my Ph.D. is in
a different  field, Physics), I have been interested in compilers
for many years.  I have  bought  and tried to digest the contents
of virtually every  book  on  the  subject ever written.  I don't
mind  telling you that it was slow going.    Compiler  texts  are
written for Computer  Science  majors, and are tough sledding for
the rest of us.  But over the years a bit of it began to seep in.
What really caused it to jell was when I began  to  branch off on
my own and begin to try things on my own computer.  Now I plan to
share with you what I have  learned.    At the end of this series
you will by no means be  a  computer scientist, nor will you know
all the esoterics of  compiler  theory.    I intend to completely
ignore the more theoretical  aspects  of  the  subject.  What you
_WILL_ know is all  the  practical aspects that one needs to know
to build a working system.

This is a "learn-by-doing" series.  In the course of the series I
will be performing  experiments  on  a  computer.    You  will be
expected to follow along,  repeating  the  experiments that I do,
and  performing  some  on your own.  I will be using Turbo Pascal
4.0 on a PC  clone.   I will periodically insert examples written
in TP.  These will be executable code, which you will be expected
to copy into your own computer and run.  If you don't have a copy
of  Turbo,  you  will be severely limited in how well you will be
able to follow what's going on.  If you don't have a copy, I urge
you to get one.  After  all,  it's an excellent product, good for
many other uses!

Some articles on compilers show you examples, or show you  (as in
the case of Small-C) a finished product, which you can  then copy
and  use without a whole lot of understanding of how it works.  I
hope to do much more  than  that.    I  hope to teach you HOW the
things get done,  so that you can go off on your own and not only
reproduce what I have done, but improve on it.
                              
This is admittedly an ambitious undertaking, and it won't be done
in  one page.  I expect to do it in the course  of  a  number  of
articles.    Each  article will cover a single aspect of compiler
theory,  and  will  pretty  much  stand  alone.   If  all  you're
interested in at a given time is one  aspect,  then  you  need to
look only at that one article.  Each article will be  uploaded as
it  is complete, so you will have to wait for the last one before
you can consider yourself finished.  Please be patient.



The average text on  compiler  theory covers a lot of ground that
we won't be covering here.  The typical sequence is:

 o An introductory chapter describing what a compiler is.

 o A chapter or two on syntax equations, using Backus-Naur Form
   (BNF).

 o A chapter or two on lexical scanning, with emphasis on
   deterministic and non-deterministic finite automata.

 o Several chapters on parsing theory, beginning with top-down
   recursive descent, and ending with LALR parsers.

 o A chapter on intermediate languages, with emphasis on P-code
   and similar reverse polish representations.

 o Many chapters on alternative ways to handle subroutines and
   parameter passing, type declarations, and such.

 o A chapter toward the end on code generation, usually for some
   imaginary CPU with a simple instruction set.  Most readers
   (and in fact, most college classes) never make it this far.

 o A final chapter or two on optimization. This chapter often
   goes unread, too.


I'll  be taking a much different approach in  this  series.    To
begin  with,  I  won't dwell long on options.  I'll be giving you
_A_ way that works.  If you want  to  explore  options,  well and
good ...  I  encourage  you  to do so ... but I'll be sticking to
what I know.   I also will skip over most of the theory that puts
people  to  sleep.  Don't get me  wrong:  I  don't  belittle  the
theory, and it's vitally important  when it comes to dealing with
the more tricky  parts  of  a  given  language.  But I believe in
putting first things first.    Here we'll be dealing with the 95%
of compiler techniques that don't need a lot of theory to handle.

I  also  will  discuss only one approach  to  parsing:  top-down,
recursive descent parsing, which is the  _ONLY_  technique that's
at  all   amenable  to  hand-crafting  a  compiler.    The  other
approaches are only useful if you have a tool like YACC, and also
don't care how much memory space the final product uses.
                              
I  also take a page from the work of Ron Cain, the author of  the
original Small C.  Whereas almost all other compiler authors have
historically  used  an  intermediate  language  like  P-code  and
divided  the  compiler  into two parts (a front end that produces
P-code,  and   a  back  end  that  processes  P-code  to  produce
executable   object  code),  Ron  showed  us   that   it   is   a
straightforward  matter  to  make  a  compiler  directly  produce
executable  object  code,  in  the  form  of  assembler  language
statements.  The code will _NOT_ be the world's tightest code ...
producing optimized code is  a  much  more  difficult job. But it
will work, and work reasonably well.  Just so that I  don't leave
you with the impression that our end product will be worthless, I
_DO_ intend to show you how  to  "soup up" the compiler with some
optimization.



Finally, I'll be  using  some  tricks  that I've found to be most
helpful in letting  me  understand what's going on without wading
through a lot of boiler plate.  Chief among these  is  the use of
single-character tokens, with no embedded spaces,  for  the early
design work.  I figure that  if  I  can get a parser to recognize
and deal with I-T-L, I can  get  it  to do the same with IF-THEN-
ELSE.  And I can.  In the second "lesson,"   I'll  show  you just
how easy it  is  to  extend  a  simple parser to handle tokens of
arbitrary length.  As another  trick,  I  completely  ignore file
I/O, figuring that  if  I  can  read source from the keyboard and
output object to the screen, I can also do it from/to disk files.
Experience  has  proven  that  once  a   translator   is  working
correctly, it's a  straightforward  matter to redirect the I/O to
files.    The last trick is that I make no attempt  to  do  error
correction/recovery.   The   programs   we'll  be  building  will
RECOGNIZE errors, and will not CRASH, but they  will  simply stop
on the first error ... just like good ol' Turbo does.  There will
be  other tricks that you'll see as you go. Most of them can't be
found in any compiler textbook, but they work.

A word about style and efficiency.    As  you will see, I tend to
write programs in  _VERY_  small, easily understood pieces.  None
of the procedures we'll  be  working with will be more than about
15-20 lines long.  I'm a fervent devotee  of  the  KISS  (Keep It
Simple, Sidney) school of software development.  I  try  to never
do something tricky or  complex,  when  something simple will do.
Inefficient?  Perhaps, but you'll like the  results.    As  Brian
Kernighan has said,  FIRST  make  it  run, THEN make it run fast.
If, later on,  you want to go back and tighten up the code in one
of  our products, you'll be able to do so, since the code will be
quite understandable. If you  do  so, however, I urge you to wait
until the program is doing everything you want it to.

I  also  have  a  tendency  to  delay  building  a module until I
discover that I need  it.    Trying  to anticipate every possible
future contingency can  drive  you  crazy,  and  you'll generally
guess wrong anyway.    In  this  modern day of screen editors and
fast compilers, I don't hesitate to change a module when I feel I
need a more powerful one.  Until then,  I'll  write  only  what I
need.

One final caveat: One of the principles we'll be sticking to here
is that we don't  fool  around with P-code or imaginary CPUs, but
that we will start out on day one  producing  working, executable
object code, at least in the form of  assembler  language source.
However, you may not  like  my  choice  of assembler language ...
it's 68000 code, which is what works on my system (under SK*DOS).
I  think  you'll  find, though, that the translation to any other
CPU such as the 80x86 will  be  quite obvious, though, so I don't
see  a problem here.  In fact, I hope someone out there who knows
the '86 language better than I do will offer  us  the  equivalent
object code fragments as we need them.


THE CRADLE

Every program needs some boiler  plate  ...  I/O  routines, error
message routines, etc.   The  programs we develop here will be no
exceptions.    I've  tried to hold  this  stuff  to  an  absolute
minimum, however, so that we  can  concentrate  on  the important
stuff without losing it  among  the  trees.  The code given below
represents about the minimum that we need to  get  anything done.
It consists of some I/O routines, an error-handling routine and a
skeleton, null main program.   I  call  it  our  cradle.    As we
develop other routines, we'll add them to the cradle, and add the
calls to them as we  need to.  Make a copy of the cradle and save
it, because we'll be using it more than once.

There are many different ways to organize the scanning activities
of  a  parser.   In Unix systems, authors tend to  use  getc  and
ungetc.  I've had very good luck with the  approach  shown  here,
which is to use  a  single, global, lookahead character.  Part of
the initialization procedure  (the  only part, so far!) serves to
"prime  the  pump"  by reading the first character from the input
stream.  No other special  techniques are required with Turbo 4.0
... each successive call to  GetChar will read the next character
in the stream.


{--------------------------------------------------------------}
program Cradle;

{--------------------------------------------------------------}
{ Constant Declarations }

const TAB = ^I;

{--------------------------------------------------------------}
{ Variable Declarations }

var Look: char;              { Lookahead Character }
                              
{--------------------------------------------------------------}
{ Read New Character From Input Stream }

procedure GetChar;
begin
   Read(Look);
end;

{--------------------------------------------------------------}
{ Report an Error }

procedure Error(s: string);
begin
   WriteLn;
   WriteLn(^G, 'Error: ', s, '.');
end;


{--------------------------------------------------------------}
{ Report Error and Halt }

procedure Abort(s: string);
begin
   Error(s);
   Halt;
end;


{--------------------------------------------------------------}
{ Report What Was Expected }

procedure Expected(s: string);
begin
   Abort(s + ' Expected');
end;

{--------------------------------------------------------------}
{ Match a Specific Input Character }

procedure Match(x: char);
begin
   if Look = x then GetChar
   else Expected('''' + x + '''');
end;


{--------------------------------------------------------------}
{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;
begin
   IsAlpha := upcase(c) in ['A'..'Z'];
end;
                              

{--------------------------------------------------------------}

{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;
begin
   IsDigit := c in ['0'..'9'];
end;


{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: char;
begin
   if not IsAlpha(Look) then Expected('Name');
   GetName := UpCase(Look);
   GetChar;
end;


{--------------------------------------------------------------}
{ Get a Number }

function GetNum: char;
begin
   if not IsDigit(Look) then Expected('Integer');
   GetNum := Look;
   GetChar;
end;


{--------------------------------------------------------------}
{ Output a String with Tab }

procedure Emit(s: string);
begin
   Write(TAB, s);
end;




{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);
begin
   Emit(s);
   WriteLn;
end;

{--------------------------------------------------------------}
{ Initialize }

procedure Init;
begin
   GetChar;
end;


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
end.
{--------------------------------------------------------------}


That's it for this introduction.  Copy the code above into TP and
compile it.  Make sure that it compiles and runs  correctly. Then
proceed to the first lesson, which is on expression parsing.




































                     LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                           24 July 1988


                   Part II: EXPRESSION PARSING





GETTING STARTED

If you've read the introduction document to this series, you will
already know what  we're  about.    You will also have copied the
cradle software  into your Turbo Pascal system, and have compiled
it.  So you should be ready to go.


The purpose of this article is for us to learn  how  to parse and
translate mathematical expressions.  What we would like to see as
output is a series of assembler-language statements  that perform
the desired actions.    For purposes of definition, an expression
is the right-hand side of an equation, as in

               x = 2*y + 3/(4*z)

In the early going, I'll be taking things in _VERY_  small steps.
That's  so  that  the beginners among you won't get totally lost.
There are also  some  very  good  lessons to be learned early on,
that will serve us well later.  For the more experienced readers:
bear with me.  We'll get rolling soon enough.

SINGLE DIGITS

In keeping with the whole theme of this series (KISS, remember?),
let's start with the absolutely most simple case we can think of.
That, to me, is an expression consisting of a single digit.

Before starting to code, make sure you have a  baseline  copy  of
the  "cradle" that I gave last time.  We'll be using it again for
other experiments.  Then add this code:


{---------------------------------------------------------------}
{ Parse and Translate a Math Expression }

procedure Expression;
begin
   EmitLn('MOVE #' + GetNum + ',D0')
end;
{---------------------------------------------------------------}


And add the  line  "Expression;"  to  the main program so that it
reads:
                              

{---------------------------------------------------------------}
begin
   Init;
   Expression;
end.
{---------------------------------------------------------------}


Now  run  the  program. Try any single-digit number as input. You
should get a single line of assembler-language output.    Now try
any  other character as input, and you'll  see  that  the  parser
properly reports an error.


CONGRATULATIONS! You have just written a working translator!

OK, I grant you that it's pretty limited. But don't brush  it off
too  lightly.  This little "compiler" does,  on  a  very  limited
scale,  exactly  what  any larger compiler does:    it  correctly
recognizes legal  statements in the input "language" that we have
defined for it, and  it  produces  correct,  executable assembler
code,  suitable  for  assembling  into  object  format.  Just  as
importantly,  it correctly  recognizes  statements  that  are NOT
legal, and gives a  meaningful  error message.  Who could ask for
more?  As we expand our  parser,  we'd better make sure those two
characteristics always hold true.

There  are  some  other  features  of  this  tiny  program  worth
mentioning.    First,  you  can  see that we don't separate  code
generation from parsing ...  as  soon as the parser knows what we
want  done, it generates the object code directly.    In  a  real
compiler, of course, the reads in GetChar would be  from  a  disk
file, and the writes to another  disk  file, but this way is much
easier to deal with while we're experimenting.

Also note that an expression must leave a result somewhere.  I've
chosen the  68000  register  DO.    I  could have made some other
choices, but this one makes sense.


BINARY EXPRESSIONS

Now that we have that under our belt,  let's  branch  out  a bit.
Admittedly, an "expression" consisting of only  one  character is
not going to meet our needs for long, so let's see what we can do
to extend it. Suppose we want to handle expressions of the form:

                         1+2
     or                  4-3
     or, in general, <term> +/- <term>

(That's a bit of Backus-Naur Form, or BNF.)
                              
To do this we need a procedure that recognizes a term  and leaves
its   result   somewhere,  and  another   that   recognizes   and
distinguishes  between   a  '+'  and  a  '-'  and  generates  the
appropriate code.  But if Expression is going to leave its result
in DO, where should Term leave its result?    Answer:    the same
place.  We're  going  to  have  to  save the first result of Term
somewhere before we get the next one.

OK, basically what we want to  do  is have procedure Term do what
Expression was doing before.  So just RENAME procedure Expression
as Term, and enter the following new version of Expression:




{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
   Term;
   EmitLn('MOVE D0,D1');
   case Look of
    '+': Add;
    '-': Subtract;
   else Expected('Addop');
   end;
end;
{--------------------------------------------------------------}


Next, just above Expression enter these two procedures:


{--------------------------------------------------------------}
{ Recognize and Translate an Add }

procedure Add;
begin
   Match('+');
   Term;
   EmitLn('ADD D1,D0');
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }

procedure Subtract;
begin
   Match('-');
   Term;
   EmitLn('SUB D1,D0');
end;
{-------------------------------------------------------------}
                              

When you're finished with that,  the order of the routines should
be:

 o Term (The OLD Expression)
 o Add
 o Subtract
 o Expression

Now run the program.  Try any combination you can think of of two
single digits,  separated  by  a  '+' or a '-'.  You should get a
series of four assembler-language instructions out  of  each run.
Now  try  some  expressions with deliberate errors in them.  Does
the parser catch the errors?

Take  a  look  at the object  code  generated.    There  are  two
observations we can make.  First, the code generated is  NOT what
we would write ourselves.  The sequence

        MOVE #n,D0
        MOVE D0,D1

is inefficient.  If we were  writing  this code by hand, we would
probably just load the data directly to D1.

There is a  message  here:  code  generated by our parser is less
efficient  than the code we would write by hand.  Get used to it.
That's going to be true throughout this series.  It's true of all
compilers to some extent.  Computer scientists have devoted whole
lifetimes to the issue of code optimization, and there are indeed
things that can be done to improve the quality  of  code  output.
Some compilers do quite well, but  there  is a heavy price to pay
in complexity, and it's  a  losing  battle  anyway ... there will
probably never come a time when  a  good  assembler-language pro-
grammer can't out-program a compiler.    Before  this  session is
over, I'll briefly mention some ways that we can do a  little op-
timization,  just  to  show you that we can indeed improve things
without too much trouble.  But remember, we're here to learn, not
to see how tight we can make  the  object  code.    For  now, and
really throughout  this  series  of  articles,  we'll  studiously
ignore optimization and  concentrate  on  getting  out  code that
works.

Speaking of which: ours DOESN'T!  The code is _WRONG_!  As things
are working  now, the subtraction process subtracts D1 (which has
the FIRST argument in it) from D0 (which has the second).  That's
the wrong way, so we end up with the wrong  sign  for the result.
So let's fix up procedure Subtract with a  sign-changer,  so that
it reads


{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }

procedure Subtract;
begin
   Match('-');
   Term;
   EmitLn('SUB D1,D0');
   EmitLn('NEG D0');
end;
{-------------------------------------------------------------}


Now  our  code  is even less efficient, but at least it gives the
right answer!  Unfortunately, the  rules that give the meaning of
math expressions require that the terms in an expression come out
in an inconvenient  order  for  us.    Again, this is just one of
those facts of life you learn to live with.   This  one will come
back to haunt us when we get to division.

OK,  at this point we have a parser that can recognize the sum or
difference of two digits.    Earlier,  we  could only recognize a
single digit.  But  real  expressions can have either form (or an
infinity of others).  For kicks, go back and run the program with
the single input line '1'.

Didn't work, did it?   And  why  should  it?    We  just finished
telling  our  parser  that the only kinds of expressions that are
legal are those  with  two  terms.    We  must  rewrite procedure
Expression to be a lot more broadminded, and this is where things
start to take the shape of a real parser.




GENERAL EXPRESSIONS

In the  REAL  world,  an  expression  can  consist of one or more
terms, separated  by  "addops"  ('+'  or  '-').   In BNF, this is
written

          <expression> ::= <term> [<addop> <term>]*


We  can  accomodate  this definition of an  expression  with  the
addition of a simple loop to procedure Expression:


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
   Term;
   while Look in ['+', '-'] do begin
      EmitLn('MOVE D0,D1');
      case Look of
       '+': Add;
       '-': Subtract;
      else Expected('Addop');
      end;
   end;
end;
{--------------------------------------------------------------}


NOW we're getting somewhere!   This version handles any number of
terms, and it only cost us two extra lines of code.  As we go on,
you'll discover that this is characteristic  of  top-down parsers
... it only takes a few lines of code to accomodate extensions to
the  language.    That's  what  makes  our  incremental  approach
possible.  Notice, too, how well the code of procedure Expression
matches the BNF definition.   That, too, is characteristic of the
method.  As you get proficient in the approach, you'll  find that
you can turn BNF into parser code just about as  fast  as you can
type!

OK, compile the new version of our parser, and give it a try.  As
usual,  verify  that  the  "compiler"   can   handle   any  legal
expression,  and  will  give a meaningful error  message  for  an
illegal one.  Neat, eh?  You might note that in our test version,
any error message comes  out  sort of buried in whatever code had
already been  generated. But remember, that's just because we are
using  the  CRT  as  our  "output  file"  for   this   series  of
experiments.  In a production version, the two  outputs  would be
separated ... one to the output file, and one to the screen.


USING THE STACK

At  this  point  I'm going to  violate  my  rule  that  we  don't
introduce any complexity until  it's  absolutely  necessary, long
enough to point out a problem with the code we're generating.  As
things stand now, the parser  uses D0 for the "primary" register,
and D1 as  a place to store the partial sum.  That works fine for
now,  because  as  long as we deal with only the "addops" '+' and
'-', any new term can be added in as soon as it is found.  But in
general that isn't true.  Consider, for example, the expression

               1+(2-(3+(4-5)))
                              
If we put the '1' in D1, where  do  we  put  the  '2'?    Since a
general expression can have any degree of complexity, we're going
to run out of registers fast!

Fortunately,  there's  a  simple  solution.    Like  every modern
microprocessor, the 68000 has a stack, which is the perfect place
to save a variable number of items. So instead of moving the term
in D0 to  D1, let's just push it onto the stack.  For the benefit
of  those unfamiliar with 68000 assembler  language,  a  push  is
written

               -(SP)

and a pop,     (SP)+ .


So let's change the EmitLn in Expression to read:

               EmitLn('MOVE D0,-(SP)');

and the two lines in Add and Subtract to

               EmitLn('ADD (SP)+,D0')

and            EmitLn('SUB (SP)+,D0'),

respectively.  Now try the parser again and make sure  we haven't
broken it.

Once again, the generated code is less efficient than before, but
it's a necessary step, as you'll see.


MULTIPLICATION AND DIVISION

Now let's get down to some REALLY serious business.  As  you  all
know,  there  are  other  math   operators   than   "addops"  ...
expressions can also have  multiply  and  divide operations.  You
also  know  that  there  is  an implied operator  PRECEDENCE,  or
hierarchy, associated with expressions, so that in  an expression
like

                    2 + 3 * 4,

we know that we're supposed to multiply FIRST, then  add.    (See
why we needed the stack?)

In the early days of compiler technology, people used some rather
complex techniques to insure that the  operator  precedence rules
were  obeyed.    It turns out,  though,  that  none  of  this  is
necessary ... the rules can be accommodated quite  nicely  by our
top-down  parsing technique.  Up till now,  the  only  form  that
we've considered for a term is that of a  single  decimal  digit.

More generally, we  can  define  a  term as a PRODUCT of FACTORS;
i.e.,

          <term> ::= <factor>  [ <mulop> <factor ]*

What  is  a factor?  For now, it's what a term used to be  ...  a
single digit.

Notice the symmetry: a  term  has the same form as an expression.
As a matter of fact, we can  add  to  our  parser  with  a little
judicious  copying and renaming.  But  to  avoid  confusion,  the
listing below is the complete set of parsing routines.  (Note the
way we handle the reversal of operands in Divide.)


{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure Factor;
begin
   EmitLn('MOVE #' + GetNum + ',D0')
end;


{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }

procedure Multiply;
begin
   Match('*');
   Factor;
   EmitLn('MULS (SP)+,D0');
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Divide }

procedure Divide;
begin
   Match('/');
   Factor;
   EmitLn('MOVE (SP)+,D1');
   EmitLn('DIVS D1,D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Term }

procedure Term;
begin
   Factor;
   while Look in ['*', '/'] do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '*': Multiply;
       '/': Divide;
      else Expected('Mulop');
      end;
   end;
end;




{--------------------------------------------------------------}
{ Recognize and Translate an Add }

procedure Add;
begin
   Match('+');
   Term;
   EmitLn('ADD (SP)+,D0');
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }

procedure Subtract;
begin
   Match('-');
   Term;
   EmitLn('SUB (SP)+,D0');
   EmitLn('NEG D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
   Term;
   while Look in ['+', '-'] do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '+': Add;
       '-': Subtract;
      else Expected('Addop');
      end;
   end;
end;
{--------------------------------------------------------------}


Hot dog!  A NEARLY functional parser/translator, in only 55 lines
of Pascal!  The output is starting to look really useful,  if you
continue to overlook the inefficiency,  which  I  hope  you will.
Remember, we're not trying to produce tight code here.


PARENTHESES

We  can  wrap  up this part of the parser with  the  addition  of
parentheses with  math expressions.  As you know, parentheses are
a  mechanism to force a desired operator  precedence.    So,  for
example, in the expression

               2*(3+4) ,

the parentheses force the addition  before  the  multiply.   Much
more importantly, though, parentheses  give  us  a  mechanism for
defining expressions of any degree of complexity, as in

               (1+2)/((3+4)+(5-6))

The  key  to  incorporating  parentheses  into our parser  is  to
realize that  no matter how complicated an expression enclosed by
parentheses may be,  to  the  rest  of  the world it looks like a
simple factor.  That is, one of the forms for a factor is:

          <factor> ::= (<expression>)

This is where the recursion comes in. An expression can contain a
factor which contains another expression which contains a factor,
etc., ad infinitum.

Complicated or not, we can take care of this by adding just a few
lines of Pascal to procedure Factor:
                             

{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;
begin
   if Look = '(' then begin
      Match('(');
      Expression;
      Match(')');
      end
   else
      EmitLn('MOVE #' + GetNum + ',D0');
end;
{--------------------------------------------------------------}


Note again how easily we can extend the parser, and how  well the
Pascal code matches the BNF syntax.

As usual, compile the new version and make sure that it correctly
parses  legal sentences, and flags illegal  ones  with  an  error
message.


UNARY MINUS

At  this  point,  we have a parser that can handle just about any
expression, right?  OK, try this input sentence:

                         -1

WOOPS!  It doesn't work, does it?   Procedure  Expression expects
everything to start with an integer, so it coughs up  the leading
minus  sign.  You'll find that +3 won't  work  either,  nor  will
something like

                    -(3-2) .

There  are  a  couple of ways to fix the problem.    The  easiest
(although not necessarily the best)  way is to stick an imaginary
leading zero in  front  of  expressions  of this type, so that -3
becomes 0-3.  We can easily patch this into our  existing version
of Expression:



{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
   if IsAddop(Look) then
      EmitLn('CLR D0')
   else
      Term;
   while IsAddop(Look) do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '+': Add;
       '-': Subtract;
      else Expected('Addop');
      end;
   end;
end;
{--------------------------------------------------------------}
 

I TOLD you that making changes  was  easy!   This time it cost us
only  three  new lines of Pascal.   Note  the  new  reference  to
function IsAddop.  Since the test for an addop appeared  twice, I
chose  to  embed  it in the new function.  The  form  of  IsAddop
should be apparent from that for IsAlpha.  Here it is:


{--------------------------------------------------------------}
{ Recognize an Addop }

function IsAddop(c: char): boolean;
begin
   IsAddop := c in ['+', '-'];
end;
{--------------------------------------------------------------}


OK, make these changes to the program and recompile.   You should
also include IsAddop in your baseline copy of the cradle.   We'll
be needing  it  again  later.   Now try the input -1 again.  Wow!
The efficiency of the code is  pretty  poor ... six lines of code
just for loading a simple constant ... but at least it's correct.
Remember, we're not trying to replace Turbo Pascal here.

At this point we're just about finished with the structure of our
expression parser.   This version of the program should correctly
parse and compile just about any expression you care to  throw at
it.    It's still limited in that  we  can  only  handle  factors
involving single decimal digits.    But I hope that by now you're
starting  to  get  the  message  that we can  accomodate  further
extensions  with  just  some  minor  changes to the parser.   You
probably won't be  surprised  to  hear  that a variable or even a
function call is just another kind of a factor.
                             
In  the next session, I'll show you just how easy it is to extend
our parser to take care of  these  things too, and I'll also show
you just  how easily we can accomodate multicharacter numbers and
variable names.  So you see,  we're  not  far at all from a truly
useful parser.




A WORD ABOUT OPTIMIZATION

Earlier in this session, I promised to give you some hints  as to
how we can improve the quality of the generated code.  As I said,
the  production of tight code is not the  main  purpose  of  this
series of articles.  But you need to at least know that we aren't
just  wasting our time here ... that we  can  indeed  modify  the
parser further to  make  it produce better code, without throwing
away everything we've done to date.  As usual, it turns  out that
SOME optimization is not that difficult to do ... it simply takes
some extra code in the parser.

There are two basic approaches we can take:

  o Try to fix up the code after it's generated

    This is  the concept of "peephole" optimization.  The general
    idea it that we  know  what  combinations of instructions the
    compiler  is  going  to generate, and we also know which ones
    are pretty bad (such as the code for -1, above).    So all we
    do  is  to   scan   the  produced  code,  looking  for  those
    combinations, and replacing  them  by better ones.  It's sort
    of   a   macro   expansion,   in   reverse,   and   a  fairly
    straightforward  exercise  in   pattern-matching.   The  only
    complication,  really, is that there may be  a  LOT  of  such
    combinations to look for.  It's called  peephole optimization
    simply because it only looks at a small group of instructions
    at a time.  Peephole  optimization can have a dramatic effect
    on  the  quality  of the code,  with  little  change  to  the
    structure of the compiler  itself.   There is a price to pay,
    though,  in  both  the  speed,   size, and complexity of  the
    compiler.  Looking for all those combinations calls for a lot
    of IF tests, each one of which is a source of error.  And, of
    course, it takes time.

     In  the  classical  implementation  of a peephole optimizer,
    it's done as a second pass to the compiler.  The  output code
    is  written  to  disk,  and  then  the  optimizer  reads  and
    processes the disk file again.  As a matter of fact,  you can
    see that the optimizer could  even be a separate PROGRAM from
    the compiler proper.  Since the optimizer only  looks  at the
    code through a  small  "window"  of  instructions  (hence the
    name), a better implementation would be to simply buffer up a
    few lines of output, and scan the buffer after each EmitLn.

  o Try to generate better code in the first place
                             
    This approach calls for us to look for  special  cases BEFORE
    we Emit them.  As a trivial example,  we  should  be  able to
    identify a constant zero,  and  Emit a CLR instead of a load,
    or even do nothing at all, as in an add of zero, for example.
    Closer to home, if we had chosen to recognize the unary minus
    in Factor  instead of in Expression, we could treat constants
    like -1 as ordinary constants,  rather  then  generating them
    from  positive  ones.   None of these things are difficult to
    deal with ... they only add extra tests in the code, which is
    why  I  haven't  included them in our program.  The way I see
    it, once we get to the point that we have a working compiler,
    generating useful code  that  executes, we can always go back
    and tweak the thing to tighten up the code produced.   That's
    why there are Release 2.0's in the world.

There IS one more type  of  optimization  worth  mentioning, that
seems to promise pretty tight code without too much hassle.  It's
my "invention" in the  sense  that I haven't seen it suggested in
print anywhere, though I have  no  illusions  that  it's original
with me.

This  is to avoid such a heavy use of the stack, by making better
use of the CPU registers.  Remember back when we were  doing only
addition  and  subtraction,  that we used registers  D0  and  D1,
rather than the stack?  It worked, because with  only  those  two
operations, the "stack" never needs more than two entries.

Well,  the 68000 has eight data registers.  Why not use them as a
privately managed stack?  The key is to recognize  that,  at  any
point in its processing,  the  parser KNOWS how many items are on
the  stack, so it can indeed manage it properly.  We can define a
private "stack pointer" that keeps  track  of  which  stack level
we're at, and addresses the  corresponding  register.   Procedure
Factor,  for  example,  would  not  cause data to be loaded  into
register  D0,  but   into  whatever  the  current  "top-of-stack"
register happened to be.

What we're doing in effect is to replace the CPU's RAM stack with
a  locally  managed  stack  made  up  of  registers.    For  most
expressions, the stack level  will  never  exceed eight, so we'll
get pretty good code out.  Of course, we also  have  to deal with
those  odd cases where the stack level  DOES  exceed  eight,  but
that's no problem  either.    We  simply let the stack spill over
into the CPU  stack.    For  levels  beyond eight, the code is no
worse  than  what  we're generating now, and for levels less than
eight, it's considerably better.

For the record, I  have  implemented  this  concept, just to make
sure  it  works  before  I  mentioned  it to you.  It does.    In
practice, it turns out that you can't really use all eight levels
... you need at least one register free to  reverse  the  operand
order for division  (sure  wish  the  68000 had an XTHL, like the
8080!).  For expressions  that  include  function calls, we would
also need a register reserved for them. Still, there  is  a  nice
improvement in code size for most expressions.

So, you see, getting  better  code  isn't  that difficult, but it
does add complexity to the our translator ...  complexity  we can
do without at this point.  For that reason,  I  STRONGLY  suggest
that we continue to ignore efficiency issues for the rest of this
series,  secure  in  the knowledge that we can indeed improve the
code quality without throwing away what we've done.

Next lesson, I'll show you how to deal with variables factors and
function calls.  I'll also show you just how easy it is to handle
multicharacter tokens and embedded white space.


 































                     LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                            4 Aug 1988


                    Part III: MORE EXPRESSIONS





INTRODUCTION

In the last installment, we examined the techniques used to parse
and  translate a general math expression.  We  ended  up  with  a
simple parser that  could handle arbitrarily complex expressions,
with two restrictions:

  o No variables were allowed, only numeric factors

  o The numeric factors were limited to single digits

In this installment, we'll get  rid of those restrictions.  We'll
also extend what  we've  done  to  include  assignment statements
function  calls  and.    Remember,   though,   that   the  second
restriction was  mainly self-imposed  ... a choice of convenience
on our part, to make life easier and to let us concentrate on the
fundamental concepts.    As  you'll  see  in  a bit, it's an easy
restriction to get rid of, so don't get  too  hung  up  about it.
We'll use the trick when it serves us to do so, confident that we
can discard it when we're ready to.


VARIABLES

Most expressions  that we see in practice involve variables, such
as

               b * b + 4 * a * c

No  parser is much good without being able  to  deal  with  them.
Fortunately, it's also quite easy to do.

Remember that in our parser as it currently stands, there are two
kinds of  factors  allowed:  integer  constants  and  expressions
within parentheses.  In BNF notation,

     <factor> ::= <number> | (<expression>)

The '|' stands for "or", meaning of course that either form  is a
legal form for a factor.   Remember,  too, that we had no trouble
knowing which was which  ...  the  lookahead  character is a left
paren '(' in one case, and a digit in the other.
                              
It probably won't come as too much of a surprise that  a variable
is just another kind of factor.    So  we extend the BNF above to
read:


     <factor> ::= <number> | (<expression>) | <variable>


Again, there is no  ambiguity:  if  the  lookahead character is a
letter,  we  have  a variable; if a digit, we have a number. Back
when we translated the number, we just issued code  to  load  the
number,  as immediate data, into D0.  Now we do the same, only we
load a variable.

A minor complication in the  code generation arises from the fact
that most  68000 operating systems, including the SK*DOS that I'm
using, require the code to be  written  in "position-independent"
form, which  basically means that everything is PC-relative.  The
format for a load in this language is

               MOVE X(PC),D0

where X is, of course, the variable name.  Armed with that, let's
modify the current version of Factor to read:


{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;
begin
   if Look = '(' then begin
      Match('(');
      Expression;
      Match(')');
      end
   else if IsAlpha(Look) then
      EmitLn('MOVE ' + GetName + '(PC),D0')
   else
      EmitLn('MOVE #' + GetNum + ',D0');
end;
{--------------------------------------------------------------}


I've  remarked before how easy it is to  add  extensions  to  the
parser, because of  the  way  it's  structured.  You can see that
this  still  holds true here.  This time it cost us  all  of  two
extra lines of code.  Notice, too, how the if-else-else structure
exactly parallels the BNF syntax equation.

OK, compile and test this new version of the parser.  That didn't
hurt too badly, did it?
                              

FUNCTIONS

There is only one  other  common kind of factor supported by most
languages: the function call.  It's really too early  for  us  to
deal with functions well,  because  we  haven't yet addressed the
issue of parameter passing.  What's more, a "real" language would
include a mechanism to  support  more than one type, one of which
should be a function type.  We haven't gotten there  yet, either.
But I'd still like to deal with functions  now  for  a  couple of
reasons.    First,  it  lets  us  finally  wrap  up the parser in
something very close to its final form, and second, it  brings up
a new issue which is very much worth talking about.

Up  till  now,  we've  been  able  to  write  what  is  called  a
"predictive parser."  That  means  that at any point, we can know
by looking at the current  lookahead character exactly what to do
next.  That isn't the case when we add functions.  Every language
has some naming rules  for  what  constitutes a legal identifier.
For the present, ours is simply that it  is  one  of  the letters
'a'..'z'.  The  problem  is  that  a variable name and a function
name obey  the  same  rules.   So how can we tell which is which?
One way is to require that they each be declared before  they are
used.    Pascal  takes that approach.  The other is that we might
require a function to be followed by a (possibly empty) parameter
list.  That's the rule used in C.

Since  we  don't  yet have a mechanism for declaring types, let's
use the C  rule for now.  Since we also don't have a mechanism to
deal  with parameters, we can only handle  empty  lists,  so  our
function calls will have the form

                    x()  .

Since  we're  not  dealing  with  parameter lists yet,  there  is
nothing  to do but to call the function, so we need only to issue
a BSR (call) instead of a MOVE.

Now that there are two  possibilities for the "If IsAlpha" branch
of the test in Factor, let's treat them in a  separate procedure.
Modify Factor to read:


{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;
begin
   if Look = '(' then begin
      Match('(');
      Expression;
      Match(')');
      end
   else if IsAlpha(Look) then
      Ident
   else
      EmitLn('MOVE #' + GetNum + ',D0');
end;
{--------------------------------------------------------------}


and insert before it the new procedure


{---------------------------------------------------------------}
{ Parse and Translate an Identifier }

procedure Ident;
var Name: char;
begin
   Name := GetName;
   if Look = '(' then begin
      Match('(');
      Match(')');
      EmitLn('BSR ' + Name);
      end
   else
      EmitLn('MOVE ' + Name + '(PC),D0')
end;
{---------------------------------------------------------------}


OK, compile and  test  this  version.  Does  it  parse  all legal
expressions?  Does it correctly flag badly formed ones?

The important thing to notice is that even though  we  no  longer
have  a predictive parser, there is  little  or  no  complication
added with the recursive descent approach that we're  using.   At
the point where  Factor  finds an identifier (letter), it doesn't
know whether it's a variable name or a function name, nor does it
really care.  It simply passes it on to Ident and leaves it up to
that procedure to figure it out.  Ident, in  turn,  simply  tucks
away the identifier and then reads one more  character  to decide
which kind of identifier it's dealing with.

Keep this approach in mind.  It's a very powerful concept, and it
should be used  whenever  you  encounter  an  ambiguous situation
requiring further lookahead.   Even  if  you  had to look several
tokens ahead, the principle would still work.


MORE ON ERROR HANDLING

As long as we're talking  philosophy,  there's  another important
issue to point out:  error  handling.    Notice that although the
parser correctly rejects (almost)  every malformed  expression we
can  throw at it, with a meaningful  error  message,  we  haven't
really had to  do much work to make that happen.  In fact, in the
whole parser per se (from  Ident  through  Expression)  there are
only two calls to the error routine, Expected.  Even those aren't
necessary ... if you'll look again in Term and Expression, you'll
see that those statements can't be reached.  I put them  in early
on as a  bit  of  insurance,  but  they're no longer needed.  Why
don't you delete them now?

So how did we get this nice error handling  virtually  for  free?
It's simply  that  I've  carefully  avoided  reading  a character
directly  using  GetChar.  Instead,  I've  relied  on  the  error
handling in GetName,  GetNum,  and  Match  to  do  all  the error
checking for me.    Astute  readers  will notice that some of the
calls to Match (for example, the ones in Add  and  Subtract)  are
also unnecessary ... we already know what the character is by the
time  we get there ... but it maintains  a  certain  symmetry  to
leave them in, and  the  general rule to always use Match instead
of GetChar is a good one.

I mentioned an "almost" above.   There  is a case where our error
handling  leaves a bit to be desired.  So far we haven't told our
parser what and  end-of-line  looks  like,  or  what  to  do with
embedded  white  space.  So  a  space  character  (or  any  other
character not part of the recognized character set) simply causes
the parser to terminate, ignoring the unrecognized characters.

It  could  be  argued  that  this is reasonable behavior at  this
point.  In a "real"  compiler, there is usually another statement
following the one we're working on, so any characters not treated
as part of our expression will either be used for or  rejected as
part of the next one.

But  it's  also a very easy thing to fix up, even  if  it's  only
temporary.   All  we  have  to  do  is assert that the expression
should end with an end-of-line , i.e., a carriage return.

To see what I'm talking about, try the input line

               1+2 <space> 3+4

See  how the space was treated as a terminator?  Now, to make the
compiler properly flag this, add the line

               if Look <> CR then Expected('Newline');

in the main  program,  just  after  the call to Expression.  That
catches anything left over in the input stream.  Don't  forget to
define CR in the const statement:

               CR = ^M;

As usual, recompile the program and verify that it does what it's
supposed to.


ASSIGNMENT STATEMENTS

OK,  at  this  point we have a parser that works very nicely. I'd
like to  point  out  that  we  got  it  using  only  88  lines of
executable code, not  counting  what  was  in  the  cradle.   The
compiled  object  file  is  a  whopping  4752  bytes.   Not  bad,
considering we weren't trying very  hard  to  save  either source
code or object size.  We just stuck to the KISS principle.

Of course, parsing an expression  is not much good without having
something to do with it afterwards.  Expressions USUALLY (but not
always) appear in assignment statements, in the form

          <Ident> = <Expression>

We're only a breath  away  from being able to parse an assignment
statement, so let's take that  last  step.  Just  after procedure
Expression, add the following new procedure:


{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
var Name: char;
begin
   Name := GetName;
   Match('=');
   Expression;
   EmitLn('LEA ' + Name + '(PC),A0');
   EmitLn('MOVE D0,(A0)')
end;
{--------------------------------------------------------------}


Note again that the  code  exactly parallels the BNF.  And notice
further that  the error checking was painless, handled by GetName
and Match.

The reason for the two  lines  of  assembler  has  to  do  with a
peculiarity in the  68000,  which requires this kind of construct
for PC-relative code.

Now change the call to Expression, in the main program, to one to
Assignment.  That's all there is to it.

Son of a gun!  We are actually  compiling  assignment statements.
If those were the only kind of statements in a language, all we'd
have to  do  is  put  this in a loop and we'd have a full-fledged
compiler!

Well, of course they're not the only kind.  There are also little
items  like  control  statements  (IFs  and  loops),  procedures,
declarations, etc.  But cheer  up.    The  arithmetic expressions
that we've been dealing with are among the most challenging  in a
language.      Compared  to  what  we've  already  done,  control
statements  will be easy.  I'll be covering  them  in  the  fifth
installment.  And the other statements will all fall in  line, as
long as we remember to KISS.


MULTI-CHARACTER TOKENS

Throughout  this   series,   I've   been   carefully  restricting
everything  we  do  to  single-character  tokens,  all  the while
assuring  you  that  it wouldn't be difficult to extend to multi-
character ones.    I  don't  know if you believed me or not ... I
wouldn't  really blame you if you were a  bit  skeptical.    I'll
continue  to use  that approach in  the  sessions  which  follow,
because it helps keep complexity away.    But I'd like to back up
those  assurances, and wrap up this portion  of  the  parser,  by
showing you  just  how  easy  that  extension  really is.  In the
process, we'll also provide for embedded white space.  Before you
make  the  next  few changes, though, save the current version of
the parser away under another name.  I have some more uses for it
in  the  next  installment, and we'll be working with the single-
character version.

Most compilers separate out the handling of the input stream into
a separate module called  the  lexical scanner.  The idea is that
the scanner deals with all the character-by-character  input, and
returns the separate units  (tokens)  of  the  stream.  There may
come a time when we'll want  to  do something like that, too, but
for  now  there  is  no  need. We can handle the  multi-character
tokens that we need by very slight and  very  local modifications
to GetName and GetNum.

The usual definition of an identifier is that the first character
must be a letter, but the rest can be  alphanumeric  (letters  or
numbers).  To  deal  with  this,  we  need  one  other recognizer
function


{--------------------------------------------------------------}
{ Recognize an Alphanumeric }

function IsAlNum(c: char): boolean;
begin
   IsAlNum := IsAlpha(c) or IsDigit(c);
end;
{--------------------------------------------------------------}


Add this function to your parser.  I put mine just after IsDigit.
While you're  at  it,  might  as  well  include it as a permanent
member of Cradle, too.
                              
Now, we need  to  modify  function  GetName  to  return  a string
instead of a character:


{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: string;
var Token: string;
begin
   Token := '';
   if not IsAlpha(Look) then Expected('Name');
   while IsAlNum(Look) do begin
      Token := Token + UpCase(Look);
      GetChar;
   end;
   GetName := Token;
end;
{--------------------------------------------------------------}


Similarly, modify GetNum to read:


{--------------------------------------------------------------}
{ Get a Number }

function GetNum: string;
var Value: string;
begin
   Value := '';
   if not IsDigit(Look) then Expected('Integer');
   while IsDigit(Look) do begin
      Value := Value + Look;
      GetChar;
   end;
   GetNum := Value;
end;
{--------------------------------------------------------------}


Amazingly enough, that  is  virtually all the changes required to
the  parser!  The local variable Name  in  procedures  Ident  and
Assignment was originally declared as  "char",  and  must  now be
declared string[8].  (Clearly,  we  could  make the string length
longer if we chose, but most assemblers limit the length anyhow.)
Make  this  change,  and  then  recompile and test. _NOW_ do  you
believe that it's a simple change?


WHITE SPACE

Before we leave this parser for awhile, let's  address  the issue
of  white  space.   As it stands now, the parser  will  barf  (or
simply terminate) on a single space  character  embedded anywhere
in  the input stream.  That's pretty  unfriendly  behavior.    So
let's "productionize" the thing  a  bit  by eliminating this last
restriction.

The  key  to easy handling of white space is to come  up  with  a
simple rule for how the parser should treat the input stream, and
to  enforce that rule everywhere.  Up  till  now,  because  white
space wasn't permitted, we've been able to assume that after each
parsing action, the lookahead character  Look  contains  the next
meaningful  character,  so  we could test it  immediately.    Our
design was based upon this principle.

It still sounds like a good rule to me, so  that's  the one we'll
use.    This  means  that  every routine that advances the  input
stream must skip over white space, and leave  the  next non-white
character in Look.   Fortunately,  because  we've been careful to
use GetName, GetNum, and Match  for most of our input processing,
it is  only  those  three  routines  (plus  Init) that we need to
modify.

Not  surprisingly,  we  start  with  yet  another  new recognizer
routine:


{--------------------------------------------------------------}
{ Recognize White Space }

function IsWhite(c: char): boolean;
begin
   IsWhite := c in [' ', TAB];
end;
{--------------------------------------------------------------}


We  also need a routine that  will  eat  white-space  characters,
until it finds a non-white one:


{--------------------------------------------------------------}
{ Skip Over Leading White Space }

procedure SkipWhite;
begin
   while IsWhite(Look) do
      GetChar;
end;
{--------------------------------------------------------------}


Now,  add calls to SkipWhite to Match,  GetName,  and  GetNum  as
shown below:


{--------------------------------------------------------------}
{ Match a Specific Input Character }

procedure Match(x: char);
begin
   if Look <> x then Expected('''' + x + '''')
   else begin
      GetChar;
      SkipWhite;
   end;
end;


{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: string;
var Token: string;
begin
   Token := '';
   if not IsAlpha(Look) then Expected('Name');
   while IsAlNum(Look) do begin
      Token := Token + UpCase(Look);
      GetChar;
   end;
   GetName := Token;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Get a Number }

function GetNum: string;
var Value: string;
begin
   Value := '';
   if not IsDigit(Look) then Expected('Integer');
   while IsDigit(Look) do begin
      Value := Value + Look;
      GetChar;
   end;
   GetNum := Value;
   SkipWhite;
end;
{--------------------------------------------------------------}

(Note  that  I  rearranged  Match  a  bit,  without changing  the
functionality.)

Finally, we need to skip over leading blanks where we  "prime the
pump" in Init:
                             
{--------------------------------------------------------------}
{ Initialize }

procedure Init;
begin
   GetChar;
   SkipWhite;
end;
{--------------------------------------------------------------}


Make these changes and recompile the program.  You will find that
you will have to move Match below SkipWhite, to  avoid  an  error
message from the Pascal compiler.  Test the program as  always to
make sure it works properly.

Since we've made quite  a  few  changes  during this session, I'm
reproducing the entire parser below:


{--------------------------------------------------------------}
program parse;

{--------------------------------------------------------------}
{ Constant Declarations }

const TAB = ^I;
       CR = ^M;

{--------------------------------------------------------------}
{ Variable Declarations }

var Look: char;              { Lookahead Character }

{--------------------------------------------------------------}
{ Read New Character From Input Stream }

procedure GetChar;
begin
   Read(Look);
end;

{--------------------------------------------------------------}
{ Report an Error }

procedure Error(s: string);
begin
   WriteLn;
   WriteLn(^G, 'Error: ', s, '.');
end;


{--------------------------------------------------------------}
{ Report Error and Halt }
                             
procedure Abort(s: string);
begin
   Error(s);
   Halt;
end;


{--------------------------------------------------------------}
{ Report What Was Expected }

procedure Expected(s: string);
begin
   Abort(s + ' Expected');
end;


{--------------------------------------------------------------}
{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;
begin
   IsAlpha := UpCase(c) in ['A'..'Z'];
end;


{--------------------------------------------------------------}
{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;
begin
   IsDigit := c in ['0'..'9'];
end;


{--------------------------------------------------------------}
{ Recognize an Alphanumeric }

function IsAlNum(c: char): boolean;
begin
   IsAlNum := IsAlpha(c) or IsDigit(c);
end;


{--------------------------------------------------------------}
{ Recognize an Addop }

function IsAddop(c: char): boolean;
begin
   IsAddop := c in ['+', '-'];
end;


{--------------------------------------------------------------}
{ Recognize White Space }
                             
function IsWhite(c: char): boolean;
begin
   IsWhite := c in [' ', TAB];
end;


{--------------------------------------------------------------}
{ Skip Over Leading White Space }

procedure SkipWhite;
begin
   while IsWhite(Look) do
      GetChar;
end;


{--------------------------------------------------------------}
{ Match a Specific Input Character }

procedure Match(x: char);
begin
   if Look <> x then Expected('''' + x + '''')
   else begin
      GetChar;
      SkipWhite;
   end;
end;


{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: string;
var Token: string;
begin
   Token := '';
   if not IsAlpha(Look) then Expected('Name');
   while IsAlNum(Look) do begin
      Token := Token + UpCase(Look);
      GetChar;
   end;
   GetName := Token;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Get a Number }

function GetNum: string;
var Value: string;
begin
   Value := '';
   if not IsDigit(Look) then Expected('Integer');
   while IsDigit(Look) do begin
      Value := Value + Look;
      GetChar;
   end;
   GetNum := Value;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Output a String with Tab }

procedure Emit(s: string);
begin
   Write(TAB, s);
end;


{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);
begin
   Emit(s);
   WriteLn;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Identifier }

procedure Ident;
var Name: string[8];
begin
   Name:= GetName;
   if Look = '(' then begin
      Match('(');
      Match(')');
      EmitLn('BSR ' + Name);
      end
   else
      EmitLn('MOVE ' + Name + '(PC),D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;
begin
   if Look = '(' then begin
      Match('(');
      Expression;
      Match(')');
      end
   else if IsAlpha(Look) then
      Ident
   else
      EmitLn('MOVE #' + GetNum + ',D0');
end;


{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }

procedure Multiply;
begin
   Match('*');
   Factor;
   EmitLn('MULS (SP)+,D0');
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Divide }

procedure Divide;
begin
   Match('/');
   Factor;
   EmitLn('MOVE (SP)+,D1');
   EmitLn('EXS.L D0');
   EmitLn('DIVS D1,D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Term }

procedure Term;
begin
   Factor;
   while Look in ['*', '/'] do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '*': Multiply;
       '/': Divide;
      end;
   end;
end;


{--------------------------------------------------------------}
{ Recognize and Translate an Add }

procedure Add;
begin
   Match('+');
   Term;
   EmitLn('ADD (SP)+,D0');
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }

procedure Subtract;
begin
   Match('-');
   Term;
   EmitLn('SUB (SP)+,D0');
   EmitLn('NEG D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
   if IsAddop(Look) then
      EmitLn('CLR D0')
   else
      Term;
   while IsAddop(Look) do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '+': Add;
       '-': Subtract;
      end;
   end;
end;


{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
var Name: string[8];
begin
   Name := GetName;
   Match('=');
   Expression;
   EmitLn('LEA ' + Name + '(PC),A0');
   EmitLn('MOVE D0,(A0)')
end;


{--------------------------------------------------------------}
{ Initialize }
                             
procedure Init;
begin
   GetChar;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
   Assignment;
   If Look <> CR then Expected('NewLine');
end.
{--------------------------------------------------------------}


Now the parser is complete.  It's got every feature we can put in
a  one-line "compiler."  Tuck it away in a safe place.  Next time
we'll move on to a new subject, but we'll still be  talking about
expressions for quite awhile.  Next installment, I plan to talk a
bit about interpreters as opposed  to compilers, and show you how
the structure of the parser changes a bit as we change  what sort
of action has to be taken.  The information we pick up there will
serve  us in good stead later on, even if you have no interest in
interpreters.  See you next time.



































                     LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                           24 July 1988


                      Part IV: INTERPRETERS





INTRODUCTION

In the first three installments of this series,  we've  looked at
parsing and  compiling math expressions, and worked our way grad-
ually and methodically from dealing  with  very  simple one-term,
one-character "expressions" up through more general ones, finally
arriving at a very complete parser that could parse and translate
complete  assignment  statements,  with  multi-character  tokens,
embedded white space, and function calls.  This  time,  I'm going
to walk you through the process one more time, only with the goal
of interpreting rather than compiling object code.

Since this is a series on compilers, why should  we  bother  with
interpreters?  Simply because I want you to see how the nature of
the  parser changes as we change the goals.  I also want to unify
the concepts of the two types of translators, so that you can see
not only the differences, but also the similarities.

Consider the assignment statement

               x = 2 * y + 3

In a compiler, we want the target CPU to execute  this assignment
at EXECUTION time.  The translator itself doesn't  do  any arith-
metic ... it only issues the object code that will cause  the CPU
to do it when the code is executed.  For  the  example above, the
compiler would issue code to compute the expression and store the
results in variable x.

For an interpreter,  on  the  other  hand, no object code is gen-
erated.   Instead, the arithmetic is computed immediately, as the
parsing is going on.  For the example, by the time parsing of the
statement is complete, x will have a new value.

The approach we've been  taking  in  this  whole series is called
"syntax-driven translation."  As you are aware by now, the struc-
ture of the  parser  is  very  closely  tied to the syntax of the
productions we parse.  We  have built Pascal procedures that rec-
ognize every language  construct.   Associated with each of these
constructs (and procedures) is  a  corresponding  "action," which
does  whatever  makes  sense to do  once  a  construct  has  been
recognized.    In  our  compiler  so far, every  action  involves
emitting object code, to be executed later at execution time.  In
an interpreter, every action  involves  something  to be done im-
mediately.

What I'd like you to see here is that the  layout  ... the struc-
ture ... of  the  parser  doesn't  change.  It's only the actions
that change.   So  if  you  can  write an interpreter for a given
language, you can also write a compiler, and vice versa.  Yet, as
you  will  see,  there  ARE  differences,  and  significant ones.
Because the actions are different,  the  procedures  that  do the
recognizing end up being written differently.    Specifically, in
the interpreter  the recognizing procedures end up being coded as
FUNCTIONS that return numeric values to their callers.    None of
the parsing routines for our compiler did that.

Our compiler, in fact,  is  what we might call a "pure" compiler.
Each time a construct is recognized, the object  code  is emitted
IMMEDIATELY.  (That's one reason the code is not very efficient.)
The interpreter we'll be building  here is a pure interpreter, in
the sense that there is  no  translation,  such  as "tokenizing,"
performed on the source code.  These represent  the  two extremes
of translation.  In  the  real  world,  translators are rarely so
pure, but tend to have bits of each technique.

I can think of  several  examples.    I've already mentioned one:
most interpreters, such as Microsoft BASIC,  for  example, trans-
late the source code (tokenize it) into an  intermediate  form so
that it'll be easier to parse real time.

Another example is an assembler.  The purpose of an assembler, of
course, is to produce object code, and it normally does that on a
one-to-one basis: one object instruction per line of source code.
But almost every assembler also permits expressions as arguments.
In this case, the expressions  are  always  constant expressions,
and  so the assembler isn't supposed to  issue  object  code  for
them.  Rather,  it  "interprets" the expressions and computes the
corresponding constant result, which is what it actually emits as
object code.

As a matter of fact, we  could  use  a bit of that ourselves. The
translator we built in the  previous  installment  will dutifully
spit out object code  for  complicated  expressions,  even though
every term in  the  expression  is  a  constant.  In that case it
would be far better if the translator behaved a bit more  like an
interpreter, and just computed the equivalent constant result.

There is  a concept in compiler theory called "lazy" translation.
The  idea is that you typically don't just  emit  code  at  every
action.  In fact, at the extreme you don't emit anything  at all,
until  you  absolutely  have to.  To accomplish this, the actions
associated with the parsing routines  typically  don't  just emit
code.  Sometimes  they  do,  but  often  they  simply  return in-
formation back to the caller.  Armed with  such  information, the
caller can then make a better choice of what to do.

For example, given the statement

               x = x + 3 - 2 - (5 - 4)  ,

our compiler will dutifully spit  out a stream of 18 instructions
to load each parameter into  registers,  perform  the arithmetic,
and store the result.  A lazier evaluation  would  recognize that
the arithmetic involving constants can  be  evaluated  at compile
time, and would reduce the expression to

               x = x + 0  .

An  even  lazier  evaluation would then be smart enough to figure
out that this is equivalent to

               x = x  ,

which  calls  for  no  action  at  all.   We could reduce 18  in-
structions to zero!

Note that there is no chance of optimizing this way in our trans-
lator as it stands, because every action takes place immediately.

Lazy  expression  evaluation  can  produce  significantly  better
object code than  we  have  been  able  to  so  far.  I warn you,
though: it complicates the parser code considerably, because each
routine now has to make decisions as to whether  to  emit  object
code or not.  Lazy evaluation is certainly not named that because
it's easier on the compiler writer!

Since we're operating mainly on  the KISS principle here, I won't
go  into much more depth on this subject.  I just want you to  be
aware  that  you  can get some code optimization by combining the
techniques of compiling and  interpreting.    In  particular, you
should know that the parsing  routines  in  a  smarter translator
will generally  return  things  to  their  caller,  and sometimes
expect things as  well.    That's  the main reason for going over
interpretation in this installment.


THE INTERPRETER

OK, now that you know WHY we're going into all this, let's do it.
Just to give you practice, we're going to start over with  a bare
cradle and build up the translator all over again.  This time, of
course, we can go a bit faster.

Since we're now going  to  do arithmetic, the first thing we need
to do is to change function GetNum, which up till now  has always
returned a character  (or  string).    Now, it's better for it to
return an integer.    MAKE  A  COPY of the cradle (for goodness's
sake, don't change the version  in  Cradle  itself!!)  and modify
GetNum as follows:


{--------------------------------------------------------------}
{ Get a Number }

function GetNum: integer;
begin
   if not IsDigit(Look) then Expected('Integer');
   GetNum := Ord(Look) - Ord('0');
   GetChar;
end;
{--------------------------------------------------------------}


Now, write the following version of Expression:


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

function Expression: integer;
begin
   Expression := GetNum;
end;
{--------------------------------------------------------------}


Finally, insert the statement


   Writeln(Expression);


at the end of the main program.  Now compile and test.

All this program  does  is  to  "parse"  and  translate  a single
integer  "expression."    As always, you should make sure that it
does that with the digits 0..9, and gives an  error  message  for
anything else.  Shouldn't take you very long!

OK, now let's extend this to include addops.    Change Expression
to read:


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

function Expression: integer;
var Value: integer;
begin
   if IsAddop(Look) then
      Value := 0
   else
      Value := GetNum;
   while IsAddop(Look) do begin
      case Look of
       '+': begin
               Match('+');
               Value := Value + GetNum;
            end;
       '-': begin
               Match('-');
               Value := Value - GetNum;
            end;
      end;
   end;
   Expression := Value;
end;
{--------------------------------------------------------------}


The structure of Expression, of  course,  parallels  what  we did
before,  so  we  shouldn't have too much  trouble  debugging  it.
There's  been  a  SIGNIFICANT  development, though, hasn't there?
Procedures Add and Subtract went away!  The reason  is  that  the
action to be taken  requires  BOTH arguments of the operation.  I
could have chosen to retain the procedures and pass into them the
value of the expression to date,  which  is Value.  But it seemed
cleaner to me to  keep  Value as strictly a local variable, which
meant that the code for Add and Subtract had to be moved in line.
This result suggests  that,  while the structure we had developed
was nice and  clean  for our simple-minded translation scheme, it
probably  wouldn't do for use with lazy  evaluation.    That's  a
little tidbit we'll probably want to keep in mind for later.

OK,  did the translator work?  Then let's  take  the  next  step.
It's not hard to  figure  out what procedure Term should now look
like.  Change every call to GetNum in function  Expression  to  a
call to Term, and then enter the following form for Term:




{---------------------------------------------------------------}
{ Parse and Translate a Math Term }

function Term: integer;
var Value: integer;
begin
   Value := GetNum;
   while Look in ['*', '/'] do begin
      case Look of
       '*': begin
               Match('*');
               Value := Value * GetNum;
            end;
       '/': begin
               Match('/');
               Value := Value div GetNum;
            end;
      end;
   end;
   Term := Value;
end;
{--------------------------------------------------------------}

Now, try it out.    Don't forget two things: first, we're dealing
with integer division, so, for example, 1/3 should come out zero.
Second, even  though we can output multi-digit results, our input
is still restricted to single digits.

That seems like a silly restriction at this point, since  we have
already  seen how easily function GetNum can  be  extended.    So
let's go ahead and fix it right now.  The new version is


{--------------------------------------------------------------}
{ Get a Number }

function GetNum: integer;
var Value: integer;
begin
   Value := 0;
   if not IsDigit(Look) then Expected('Integer');
   while IsDigit(Look) do begin
      Value := 10 * Value + Ord(Look) - Ord('0');
      GetChar;
   end;
   GetNum := Value;
end;
{--------------------------------------------------------------}


If you've compiled and  tested  this  version of the interpreter,
the  next  step  is to install function Factor, complete with pa-
renthesized  expressions.  We'll hold off a  bit  longer  on  the
variable  names.    First, change the references  to  GetNum,  in
function Term, so that they call Factor instead.   Now  code  the
following version of Factor:




{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

function Expression: integer; Forward;

function Factor: integer;
begin
   if Look = '(' then begin
      Match('(');
      Factor := Expression;
      Match(')');
      end
   else
       Factor := GetNum;
end;
{---------------------------------------------------------------}

That was pretty easy, huh?  We're rapidly closing in on  a useful
interpreter.


A LITTLE PHILOSOPHY

Before going any further, there's something I'd like  to  call to
your attention.  It's a concept that we've been making use  of in
all these sessions, but I haven't explicitly mentioned it up till
now.  I think it's time, because it's a concept so useful, and so
powerful,  that  it  makes all the difference  between  a  parser
that's trivially easy, and one that's too complex to deal with.

In the early days of compiler technology, people  had  a terrible
time  figuring  out  how to deal with things like operator prece-
dence  ...  the  way  that  multiply  and  divide operators  take
precedence over add and subtract, etc.  I remember a colleague of
some  thirty years ago, and how excited he was to find out how to
do it.  The technique used involved building two  stacks,    upon
which you pushed each operator  or operand.  Associated with each
operator was a precedence level,  and the rules required that you
only actually performed an operation  ("reducing"  the  stack) if
the precedence level showing on top of the stack was correct.  To
make life more interesting,  an  operator  like ')' had different
precedence levels, depending  upon  whether or not it was already
on the stack.  You  had to give it one value before you put it on
the stack, and another to decide when to take it  off.   Just for
the experience, I worked all of  this  out for myself a few years
ago, and I can tell you that it's very tricky.

We haven't  had  to  do  anything like that.  In fact, by now the
parsing of an arithmetic statement should seem like child's play.
How did we get so lucky?  And where did the precedence stacks go?

A similar thing is going on  in  our interpreter above.  You just
KNOW that in  order  for  it  to do the computation of arithmetic
statements (as opposed to the parsing of them), there have  to be
numbers pushed onto a stack somewhere.  But where is the stack?

Finally,  in compiler textbooks, there are  a  number  of  places
where  stacks  and  other structures are discussed.  In the other
leading parsing method (LR), an explicit stack is used.  In fact,
the technique is very  much  like the old way of doing arithmetic
expressions.  Another concept  is  that of a parse tree.  Authors
like to draw diagrams  of  the  tokens  in a statement, connected
into a tree with  operators  at the internal nodes.  Again, where
are the trees and stacks in our technique?  We haven't seen any.
The answer in all cases is that the structures are  implicit, not
explicit.    In  any computer language, there is a stack involved
every  time  you  call  a  subroutine.  Whenever a subroutine  is
called, the return address is pushed onto the CPU stack.   At the
end of the subroutine, the address is popped back off and control
is  transferred  there.   In a recursive language such as Pascal,
there can also be local data pushed onto the stack, and  it, too,
returns when it's needed.

For example,  function  Expression  contains  a  local  parameter
called  Value, which it fills by a call to Term.  Suppose, in its
next call to  Term  for  the  second  argument,  that  Term calls
Factor, which recursively  calls  Expression  again.    That "in-
stance" of Expression gets another value for its  copy  of Value.
What happens  to  the  first  Value?    Answer: it's still on the
stack, and  will  be  there  again  when  we return from our call
sequence.

In other words, the reason things look so simple  is  that  we've
been making maximum use of the resources of the  language.    The
hierarchy levels  and  the  parse trees are there, all right, but
they're hidden within the  structure  of  the parser, and they're
taken care of by the order with which the various  procedures are
called.  Now that you've seen how we do it, it's probably hard to
imagine doing it  any other way.  But I can tell you that it took
a lot of years for compiler writers to get that smart.  The early
compilers were too complex  too  imagine.    Funny how things get
easier with a little practice.

The reason  I've  brought  all  this up is as both a lesson and a
warning.  The lesson: things can be easy when you do  them right.
The warning: take a look at what you're doing.  If, as you branch
out on  your  own,  you  begin to find a real need for a separate
stack or tree structure, it may be time to ask yourself if you're
looking at things the right way.  Maybe you just aren't using the
facilities of the language as well as you could be.


The next step is to add variable names.  Now,  though,  we have a
slight problem.  For  the  compiler, we had no problem in dealing
with variable names ... we just issued the names to the assembler
and let the rest  of  the program take care of allocating storage
for  them.  Here, on the other hand, we need to be able to  fetch
the values of the variables and return them as the  return values
of Factor.  We need a storage mechanism for these variables.

Back in the early days of personal computing,  Tiny  BASIC lived.
It had  a  grand  total  of  26  possible variables: one for each
letter of the  alphabet.    This  fits nicely with our concept of
single-character tokens, so we'll  try  the  same  trick.  In the
beginning of your  interpreter,  just  after  the  declaration of
variable Look, insert the line:

               Table: Array['A'..'Z'] of integer;

We also need to initialize the array, so add this procedure:




{---------------------------------------------------------------}
{ Initialize the Variable Area }

procedure InitTable;
var i: char;
begin
   for i := 'A' to 'Z' do
      Table[i] := 0;
end;
{---------------------------------------------------------------}


You must also insert a call to InitTable, in procedure Init.
DON'T FORGET to do that, or the results may surprise you!

Now that we have an array  of  variables, we can modify Factor to
use it.  Since we don't have a way (so far) to set the variables,
Factor  will always return zero values for  them,  but  let's  go
ahead and extend it anyway.  Here's the new version:


{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

function Expression: integer; Forward;

function Factor: integer;
begin
   if Look = '(' then begin
      Match('(');
      Factor := Expression;
      Match(')');
      end
   else if IsAlpha(Look) then
      Factor := Table[GetName]
   else
       Factor := GetNum;
end;
{---------------------------------------------------------------}


As always, compile and test this version of the  program.    Even
though all the variables are now zeros, at least we can correctly
parse the complete expressions, as well as catch any badly formed
expressions.

I suppose you realize the next step: we need to do  an assignment
statement so we can  put  something INTO the variables.  For now,
let's  stick  to  one-liners,  though  we will soon  be  handling
multiple statements.

The assignment statement parallels what we did before:


{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }
                             


procedure Assignment;
var Name: char;
begin
   Name := GetName;
   Match('=');
   Table[Name] := Expression;
end;
{--------------------------------------------------------------}


To test this,  I  added  a  temporary write statement in the main
program,  to  print out the value of A.  Then I  tested  it  with
various assignments to it.

Of course, an interpretive language that can only accept a single
line of program  is not of much value.  So we're going to want to
handle multiple statements.  This  merely  means  putting  a loop
around  the  call  to Assignment.  So let's do that now. But what
should be the loop exit criterion?  Glad you  asked,  because  it
brings up a point we've been able to ignore up till now.

One of the most tricky things  to  handle in any translator is to
determine when to bail out of  a  given construct and go look for
something else.  This hasn't been a problem for us so far because
we've only allowed for  a  single kind of construct ... either an
expression  or an assignment statement.   When  we  start  adding
loops and different kinds of statements, you'll find that we have
to be very careful that things terminate properly.  If we put our
interpreter in a loop, we need a way to quit.    Terminating on a
newline is no good, because that's what sends us back for another
line.  We could always let an unrecognized character take us out,
but that would cause every run to end in an error  message, which
certainly seems uncool.

What we need  is  a  termination  character.  I vote for Pascal's
ending period ('.').   A  minor  complication  is that Turbo ends
every normal line  with  TWO characters, the carriage return (CR)
and line feed (LF).   At  the  end  of  each line, we need to eat
these characters before processing the next one.   A  natural way
to do this would  be  with  procedure  Match, except that Match's
error  message  prints  the character, which of course for the CR
and/or  LF won't look so great.  What we need is a special proce-
dure for this, which we'll no doubt be using over and over.  Here
it is:


{--------------------------------------------------------------}
{ Recognize and Skip Over a Newline }

procedure NewLine;
begin
   if Look = CR then begin
      GetChar;
      if Look = LF then
         GetChar;
   end;
end;
{--------------------------------------------------------------}


Insert this procedure at any convenient spot ... I put  mine just
after Match.  Now, rewrite the main program to look like this:


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
   repeat
      Assignment;
      NewLine;
   until Look = '.';
end.
{--------------------------------------------------------------}


Note that the  test for a CR is now gone, and that there are also
no  error tests within NewLine itself.   That's  OK,  though  ...
whatever is left over in terms of bogus characters will be caught
at the beginning of the next assignment statement.

Well, we now have a functioning interpreter.  It doesn't do  us a
lot of  good,  however,  since  we have no way to read data in or
write it out.  Sure would help to have some I/O!

Let's wrap this session  up,  then,  by  adding the I/O routines.
Since we're  sticking to single-character tokens, I'll use '?' to
stand for a read statement, and  '!'  for a write, with the char-
acter  immediately  following  them  to  be used as  a  one-token
"parameter list."  Here are the routines:

{--------------------------------------------------------------}
{ Input Routine }

procedure Input;
begin
   Match('?');
   Read(Table[GetName]);
end;


{--------------------------------------------------------------}
{ Output Routine }

procedure Output;
begin
   Match('!');
   WriteLn(Table[GetName]);
end;
{--------------------------------------------------------------}

They aren't very fancy, I admit ... no prompt character on input,
for example ... but they get the job done.

The corresponding changes in  the  main  program are shown below.
Note that we use the usual  trick  of a case statement based upon
the current lookahead character, to decide what to do.


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
   repeat
      case Look of
       '?': Input;
       '!': Output;
       else Assignment;
      end;
      NewLine;
   until Look = '.';
end.
{--------------------------------------------------------------}


You have now completed a  real, working interpreter.  It's pretty
sparse, but it works just like the "big boys."  It includes three
kinds of program statements  (and  can  tell the difference!), 26
variables,  and  I/O  statements.  The only things that it lacks,
really, are control statements,  subroutines,    and some kind of
program editing function.  The program editing part, I'm going to
pass on.  After all, we're  not  here  to build a product, but to
learn  things.    The control statements, we'll cover in the next
installment, and the subroutines soon  after.  I'm anxious to get
on with that, so we'll leave the interpreter as it stands.

I hope that by  now  you're convinced that the limitation of sin-
gle-character names  and the processing of white space are easily
taken  care  of, as we did in the last session.   This  time,  if
you'd like to play around with these extensions, be my  guest ...
they're  "left as an exercise for the student."    See  you  next
time.



 1 --






























                     LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                          19 August 1988


                    Part V: CONTROL CONSTRUCTS





INTRODUCTION

In  the  first  four  installments  of  this  series, we've  been
concentrating on the parsing of math  expressions  and assignment
statements.  In  this  installment,  we'll  take off on a new and
exciting  tangent:  that   of  parsing  and  translating  control
constructs such as IF statements.

This subject is dear to my heart, because it represents a turning
point  for  me.    I  had  been  playing  with  the   parsing  of
expressions, just as  we  have  done  in this series, but I still
felt that I was a LONG way from being able  to  handle a complete
language.  After all, REAL  languages have branches and loops and
subroutines and all that.  Perhaps you've shared some of the same
thoughts.    Awhile  back,  though,  I  had  to  produce  control
constructs for a structured assembler preprocessor I was writing.
Imagine my surprise to  discover  that it was far easier than the
expression  parsing  I  had  already  been through.   I  remember
thinking, "Hey! This is EASY!" After we've finished this session,
I'll bet you'll be thinking so, too.


THE PLAN

In what follows, we'll be starting over again with a bare cradle,
and as we've done twice before now, we'll build things up  one at
a time.  We'll also  be retaining the concept of single-character
tokens that has served us so well to date.   This  means that the
"code" will look a little funny, with 'i' for IF, 'w'  for WHILE,
etc.  But it helps us  get  the concepts down pat without fussing
over  lexical  scanning.    Fear  not  ...  eventually we'll  see
something looking like "real" code.

I also don't  want  to  have  us  get bogged down in dealing with
statements other than branches, such as the assignment statements
we've  been  working  on.  We've already demonstrated that we can
handle them, so there's no point carrying them  around  as excess
baggage during this exercise.  So what I'll do instead is  to use
an  anonymous  statement,  "other", to take the place of the non-
control statements and serve as a place-holder for them.  We have
to generate some kind of object code for them  (we're  back  into
compiling, not interpretation), so for want of anything else I'll
just echo the character input.

OK, then, starting with  yet  another  copy  of the cradle, let's
define the procedure:


{--------------------------------------------------------------}
{ Recognize and Translate an "Other" }

procedure Other;
begin
   EmitLn(GetName);
end;
{--------------------------------------------------------------}


Now include a call to it in the main program, thus:


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
   Other;
end.
{--------------------------------------------------------------}


Run  the program and see what you get.  Not very exciting, is it?
But hang in there, it's a start, and things will get better.

The first thing we need is the ability to deal with more than one
statement, since a single-line branch  is pretty limited.  We did
that in the last session on interpreting, but this time let's get
a little more formal.  Consider the following BNF:

          <program> ::= <block> END

          <block> ::= [ <statement> ]*

This says that, for our purposes here, a program is defined  as a
block, followed by an END statement.  A block, in  turn, consists
of zero or more statements.  We only have one kind  of statement,
so far.

What signals the end of a block?  It's  simply any construct that
isn't an "other"  statement.    For  now, that means only the END
statement.

Armed with these ideas, we can proceed to build  up  our  parser.
The code for a program (we  have  to call it DoProgram, or Pascal
will complain, is:

{--------------------------------------------------------------}
{ Parse and Translate a Program }

procedure DoProgram;
begin
   Block;
   if Look <> 'e' then Expected('End');
   EmitLn('END')
end;
{--------------------------------------------------------------}


Notice  that  I've  arranged to emit  an  "END"  command  to  the
assembler, which sort of  punctuates  the  output code, and makes
sense considering that we're parsing a complete program here.

The code for Block is:


{--------------------------------------------------------------}
{ Recognize and Translate a Statement Block }

procedure Block;
begin
   while not(Look in ['e']) do begin
      Other;
   end;
end;
{--------------------------------------------------------------}


(From the form of the procedure, you just KNOW we're going  to be
adding to it in a bit!)

OK, enter these routines into your program.  Replace the  call to
Block in the main program, by  a  call  to DoProgram.  Now try it
and  see  how  it works.  Well, it's still not  much,  but  we're
getting closer.


SOME GROUNDWORK

Before we begin to define the various control constructs, we need
to  lay a bit more groundwork.  First, a word of warning: I won't
be using the same syntax  for these constructs as you're familiar
with  from Pascal or C.  For example, the Pascal syntax for an IF
is:


     IF <condition> THEN <statement>


(where the statement, of course, may be compound).

The C version is similar:


     IF ( <condition> ) <statement>


Instead, I'll be using something that looks more like Ada:


     IF <condition> <block> ENDIF


In  other  words,  the IF construct has  a  specific  termination
symbol.  This avoids  the  dangling-else of Pascal and C and also
precludes the need for the brackets {} or begin-end.   The syntax
I'm showing you here, in fact, is that of the language  KISS that
I'll be detailing in  later  installments.   The other constructs
will also be  slightly  different.    That  shouldn't  be  a real
problem for you.  Once you see how it's done, you'll realize that
it  really  doesn't  matter  so  much  which  specific syntax  is
involved.  Once the syntax is defined, turning it  into  code  is
straightforward.

Now, all of the  constructs  we'll  be  dealing with here involve
transfer of control, which at the assembler-language  level means
conditional  and/or  unconditional branches.   For  example,  the
simple IF statement


          IF <condition> A ENDIF B ....

must get translated into

          Branch if NOT condition to L
          A
     L:   B
          ...


It's clear, then, that we're going to need  some  more procedures
to  help  us  deal with these branches.  I've defined two of them
below.  Procedure NewLabel generates unique labels.  This is done
via the simple expedient of calling every label  'Lnn',  where nn
is a label number starting from zero.   Procedure  PostLabel just
outputs the labels at the proper place.

Here are the two routines:


{--------------------------------------------------------------}
{ Generate a Unique Label }

function NewLabel: string;
var S: string;
begin
   Str(LCount, S);
   NewLabel := 'L' + S;
   Inc(LCount);
end;


{--------------------------------------------------------------}
{ Post a Label To Output }

procedure PostLabel(L: string);
begin
   WriteLn(L, ':');
end;
{--------------------------------------------------------------}


Notice that we've added  a  new  global  variable, LCount, so you
need to change the VAR declarations at the top of the  program to
look like this:


var Look  : char;              { Lookahead Character }
    Lcount: integer;           { Label Counter }


Also, add the following extra initialization to Init:


   LCount := 0;

(DON'T forget that, or your labels can look really strange!)


At this point I'd also like to show you a  new  kind of notation.
If  you  compare  the form of the IF statement above with the as-
sembler code that must be produced, you can see  that  there  are
certain  actions  associated  with each of the  keywords  in  the
statement:


     IF:  First, get the condition and issue the code for it.
          Then, create a unique label and emit a branch if false.

     ENDIF: Emit the label.


These actions can be shown very concisely if we write  the syntax
this way:
                              

     IF
     <condition>    { Condition;
                      L = NewLabel;
                      Emit(Branch False to L); }
     <block>
     ENDIF          { PostLabel(L) }


This is an example  of  syntax-directed  translation.  We've been
doing it all along ... we've just never written it down  this way
before.  The stuff in curly brackets represents the ACTIONS to be
taken.  The nice part about this representation is  that  it  not
only shows what  we  have  to  recognize, but also the actions we
have to perform, and in which  order.   Once we have this syntax,
the code almost writes itself.

About  the  only thing left to do is to be a  bit  more  specific
about what we mean by "Branch if false."

I'm assuming that there will  be  code  executed  for <condition>
that  will  perform  Boolean algebra and compute some result.  It
should also set the condition flags corresponding to that result.
Now, the usual convention  for  a Boolean variable is to let 0000
represent "false," and  anything  else (some use FFFF, some 0001)
represent "true."

On the 68000  the  condition  flags  are set whenever any data is
moved or calculated.  If the  data  is a 0000 (corresponding to a
false condition, remember), the zero flag will be set.   The code
for "Branch on zero" is BEQ.  So for our purposes here,


               BEQ  <=> Branch if false
               BNE  <=> Branch if true


It's the nature of the beast that most  of  the  branches  we see
will  be  BEQ's  ...  we'll  be branching AROUND the code  that's
supposed to be executed when the condition is true.


THE IF STATEMENT

With that bit of explanation out of the way, we're  finally ready
to begin coding the IF-statement parser.  In  fact,  we've almost
already  done  it!   As usual, I'll be using our single-character
approach, with the character 'i' for IF, and 'e'  for  ENDIF  (as
well  as END ... that dual nature causes  no  confusion).    I'll
also, for now, skip completely  the character for the branch con-
dition, which we still have to define.

The code for DoIf is:

{--------------------------------------------------------------}
{ Recognize and Translate an IF Construct }

procedure Block; Forward;


procedure DoIf;
var L: string;
begin
   Match('i');
   L := NewLabel;
   Condition;
   EmitLn('BEQ ' + L);
   Block;
   Match('e');
   PostLabel(L);
end;
{--------------------------------------------------------------}


Add this routine to your program, and change  Block  to reference
it as follows:


{--------------------------------------------------------------}
{ Recognize and Translate a Statement Block }

procedure Block;
begin
   while not(Look in ['e']) do begin
      case Look of
       'i': DoIf;
       'o': Other;
      end;
   end;
end;
{--------------------------------------------------------------}


Notice the reference to procedure Condition.    Eventually, we'll
write a routine that  can  parse  and  translate any Boolean con-
dition we care to give it.  But  that's  a  whole  installment by
itself (the next one, in fact).    For  now, let's just make it a
dummy that emits some text.  Write the following routine:


{--------------------------------------------------------------}
{ Parse and Translate a Boolean Condition }
{ This version is a dummy }

Procedure Condition;
begin
   EmitLn('<condition>');
end;
{--------------------------------------------------------------}


Insert this procedure in your program just before DoIf.   Now run
the program.  Try a string like

     aibece

As you can see,  the  parser seems to recognize the construct and
inserts the object code at the  right  places.   Now try a set of
nested IF's, like

     aibicedefe

It's starting to look real, eh?

Now that we  have  the  general  idea  (and the tools such as the
notation and the procedures NewLabel and PostLabel), it's a piece
of cake to extend the parser to include other  constructs.    The
first (and also one of the  trickiest)  is to add the ELSE clause
to IF.  The BNF is


     IF <condition> <block> [ ELSE <block>] ENDIF


The tricky part arises simply  because there is an optional part,
which doesn't occur in the other constructs.

The corresponding output code should be


          <condition>
          BEQ L1
          <block>
          BRA L2
     L1:  <block>
     L2:  ...


This leads us to the following syntax-directed translation:


     IF
     <condition>    { L1 = NewLabel;
                      L2 = NewLabel;
                      Emit(BEQ L1) }
     <block>
     ELSE           { Emit(BRA L2);
                      PostLabel(L1) }
     <block>
     ENDIF          { PostLabel(L2) }


Comparing this with the case for an ELSE-less IF gives us  a clue
as to how to handle both situations.   The  code  below  does it.
(Note that I  use  an  'l'  for  the ELSE, since 'e' is otherwise
occupied):


{--------------------------------------------------------------}
{ Recognize and Translate an IF Construct }

procedure DoIf;
var L1, L2: string;
begin
   Match('i');
   Condition;
   L1 := NewLabel;
   L2 := L1;
   EmitLn('BEQ ' + L1);
   Block;
   if Look = 'l' then begin
      Match('l');
      L2 := NewLabel;
      EmitLn('BRA ' + L2);
      PostLabel(L1);
      Block;
   end;
   Match('e');
   PostLabel(L2);
end;
{--------------------------------------------------------------}


There you have it.  A complete IF parser/translator, in  19 lines
of code.

Give it a try now.  Try something like

   aiblcede

Did it work?  Now, just  to  be  sure we haven't broken the ELSE-
less case, try

   aibece

Now try some nested IF's.  Try anything you like,  including some
badly formed statements.   Just  remember that 'e' is not a legal
"other" statement.


THE WHILE STATEMENT

The next type of statement should be easy, since we  already have
the process  down  pat.    The  syntax  I've chosen for the WHILE
statement is


          WHILE <condition> <block> ENDWHILE


I know,  I  know,  we  don't  REALLY  need separate kinds of ter-
minators for each construct ... you can see that by the fact that
in our one-character version, 'e' is used for all of them.  But I
also remember  MANY debugging sessions in Pascal, trying to track
down a wayward END that the compiler obviously thought I meant to
put  somewhere  else.   It's been my experience that specific and
unique  keywords,  although  they add to the  vocabulary  of  the
language,  give  a  bit of error-checking that is worth the extra
work for the compiler writer.

Now,  consider  what  the  WHILE  should be translated into.   It
should be:


     L1:  <condition>
          BEQ L2
          <block>
          BRA L1
     L2:




As before, comparing the two representations gives us the actions
needed at each point.


     WHILE          { L1 = NewLabel;
                      PostLabel(L1) }
     <condition>    { Emit(BEQ L2) }
     <block>
     ENDWHILE       { Emit(BRA L1);
                      PostLabel(L2) }


The code follows immediately from the syntax:


{--------------------------------------------------------------}
{ Parse and Translate a WHILE Statement }

procedure DoWhile;
var L1, L2: string;
begin
   Match('w');
   L1 := NewLabel;
   L2 := NewLabel;
   PostLabel(L1);
   Condition;
   EmitLn('BEQ ' + L2);
   Block;
   Match('e');
   EmitLn('BRA ' + L1);
   PostLabel(L2);
end;
{--------------------------------------------------------------}


Since  we've  got a new statement, we have to add a  call  to  it
within procedure Block:


{--------------------------------------------------------------}
{ Recognize and Translate a Statement Block }

procedure Block;
begin
   while not(Look in ['e', 'l']) do begin
      case Look of
       'i': DoIf;
       'w': DoWhile;
       else Other;
      end;
   end;
end;
{--------------------------------------------------------------}


No other changes are necessary.

OK, try the new program.  Note that this  time,  the  <condition>
code is INSIDE the upper label, which is just where we wanted it.
Try some nested loops.  Try some loops within IF's, and some IF's
within loops.  If you get  a  bit  confused as to what you should
type, don't be discouraged:  you  write  bugs in other languages,
too, don't you?  It'll look a lot  more  meaningful  when  we get
full keywords.

I hope by now that you're beginning to  get  the  idea  that this
really  IS easy.  All we have to do to accomodate a new construct
is to work out  the  syntax-directed translation of it.  The code
almost falls out  from  there,  and  it doesn't affect any of the
other routines.  Once you've gotten the feel of the thing, you'll
see that you  can  add  new  constructs  about as fast as you can
dream them up.


THE LOOP STATEMENT

We could stop right here, and  have  a language that works.  It's
been  shown  many  times that a high-order language with only two
constructs, the IF and the WHILE, is sufficient  to  write struc-
tured  code.   But we're on a roll now, so let's  richen  up  the
repertoire a bit.

This construct is even easier, since it has no condition  test at
all  ... it's an infinite loop.  What's the point of such a loop?
Not much, by  itself,  but  later  on  we're going to add a BREAK
command,  that  will  give us a way out.  This makes the language
considerably richer than Pascal, which  has  no  break,  and also
avoids the funny  WHILE(1) or WHILE TRUE of C and Pascal.

The syntax is simply

     LOOP <block> ENDLOOP

and the syntax-directed translation is:


     LOOP           { L = NewLabel;
                      PostLabel(L) }
     <block>
     ENDLOOP        { Emit(BRA L }


The corresponding code is shown below.  Since  I've  already used
'l'  for  the  ELSE, I've used  the  last  letter,  'p',  as  the
"keyword" this time.


{--------------------------------------------------------------}
{ Parse and Translate a LOOP Statement }

procedure DoLoop;
var L: string;
begin
   Match('p');
   L := NewLabel;
   PostLabel(L);
   Block;
   Match('e');
   EmitLn('BRA ' + L);
end;
{--------------------------------------------------------------}
                             

When you insert this routine, don't forget to add a line in Block
to call it.




REPEAT-UNTIL

Here's one construct that I lifted right from Pascal.  The syntax
is


     REPEAT <block> UNTIL <condition>  ,


and the syntax-directed translation is:


     REPEAT         { L = NewLabel;
                      PostLabel(L) }
     <block>
     UNTIL
     <condition>    { Emit(BEQ L) }


As usual, the code falls out pretty easily:


{--------------------------------------------------------------}
{ Parse and Translate a REPEAT Statement }

procedure DoRepeat;
var L: string;
begin
   Match('r');
   L := NewLabel;
   PostLabel(L);
   Block;
   Match('u');
   Condition;
   EmitLn('BEQ ' + L);
end;
{--------------------------------------------------------------}


As  before, we have to add the call  to  DoRepeat  within  Block.
This time, there's a difference, though.  I decided  to  use  'r'
for REPEAT (naturally), but I also decided to use 'u'  for UNTIL.
This means that the 'u' must be added to the set of characters in
the while-test.  These  are  the  characters  that signal an exit
from the current  block  ... the "follow" characters, in compiler
jargon.


{--------------------------------------------------------------}
{ Recognize and Translate a Statement Block }

procedure Block;
begin
   while not(Look in ['e', 'l', 'u']) do begin
      case Look of
       'i': DoIf;
       'w': DoWhile;
       'p': DoLoop;
       'r': DoRepeat;
       else Other;
      end;
   end;
end;
{--------------------------------------------------------------}


THE FOR LOOP

The FOR loop  is a very handy one to have around, but it's a bear
to translate.  That's not so much because the construct itself is
hard ... it's only a loop  after  all ... but simply because it's
hard to implement  in  assembler  language.    Once  the  code is
figured out, the translation is straightforward enough.

C fans love  the  FOR-loop  of  that language (and, in fact, it's
easier to code), but I've chosen instead a syntax very  much like
the one from good ol' BASIC:


     FOR <ident> = <expr1> TO <expr2> <block> ENDFOR


The translation of a FOR loop  can  be just about as difficult as
you choose  to  make  it,  depending  upon  the way you decide to
define  the rules as to how to handle the limits.  Does expr2 get
evaluated  every time through the loop, for  example,  or  is  it
treated as a constant limit?   Do  you always go through the loop
at least once,  as  in  FORTRAN,  or  not? It gets simpler if you
adopt the point of view that the construct is equivalent to:


     <ident> = <expr1>
     TEMP = <expr2>
     WHILE <ident> <= TEMP
     <block>
     ENDWHILE


Notice that with this definition of the loop, <block> will not be
executed at all if <expr1> is initially larger than <expr2>.
                             
The 68000 code needed to do this is trickier than  anything we've
done so far.  I had a couple  of  tries  at  it, putting both the
counter  and  the    upper limit on the stack, both in registers,
etc.  I  finally  arrived  at  a hybrid arrangement, in which the
loop counter is in memory (so that it can be accessed  within the
loop), and the upper limit is on the stack.  The  translated code
came out like this:


          <ident>             get name of loop counter
          <expr1>             get initial value
          LEA <ident>(PC),A0  address the loop counter
          SUBQ #1,D0          predecrement it
          MOVE D0,(A0)        save it
          <expr1>             get upper limit
          MOVE D0,-(SP)       save it on stack

     L1:  LEA <ident>(PC),A0  address loop counter
          MOVE (A0),D0        fetch it to D0
          ADDQ #1,D0          bump the counter
          MOVE D0,(A0)        save new value
          CMP (SP),D0         check for range
          BLE L2              skip out if D0 > (SP)
          <block>
          BRA L1              loop for next pass
     L2:  ADDQ #2,SP          clean up the stack


Wow!    That  seems like a lot of code ...  the  line  containing
<block> seems to almost get lost.  But that's the best I could do
with it.   I guess it helps to keep in mind that it's really only
sixteen  words,  after  all.  If  anyone else can  optimize  this
better, please let me know.

Still, the parser  routine  is  pretty  easy now that we have the
code:


{--------------------------------------------------------------}
{ Parse and Translate a FOR Statement }

procedure DoFor;
var L1, L2: string;
    Name: char;
begin
   Match('f');
   L1 := NewLabel;
   L2 := NewLabel;
   Name := GetName;
   Match('=');
   Expression;
   EmitLn('SUBQ #1,D0');
   EmitLn('LEA ' + Name + '(PC),A0');
   EmitLn('MOVE D0,(A0)');
   Expression;
   EmitLn('MOVE D0,-(SP)');
   PostLabel(L1);
   EmitLn('LEA ' + Name + '(PC),A0');
   EmitLn('MOVE (A0),D0');
   EmitLn('ADDQ #1,D0');
   EmitLn('MOVE D0,(A0)');
   EmitLn('CMP (SP),D0');
   EmitLn('BGT ' + L2);
   Block;
   Match('e');
   EmitLn('BRA ' + L1);
   PostLabel(L2);
   EmitLn('ADDQ #2,SP');
end;
{--------------------------------------------------------------}


Since we don't have  expressions  in this parser, I used the same
trick as for Condition, and wrote the routine


{--------------------------------------------------------------}
{ Parse and Translate an Expression }
{ This version is a dummy }

Procedure Expression;
begin
   EmitLn('<expr>');
end;
{--------------------------------------------------------------}


Give it a try.  Once again,  don't  forget  to  add  the  call in
Block.    Since  we don't have any input for the dummy version of
Expression, a typical input line would look something like

     afi=bece

Well, it DOES generate a lot of code, doesn't it?    But at least
it's the RIGHT code.


THE DO STATEMENT

All this made me wish for a simpler version of the FOR loop.  The
reason for all the code  above  is  the  need  to  have  the loop
counter accessible as a variable within the loop.  If all we need
is a counting loop to make us go through  something  a  specified
number of times, but  don't  need  access  to the counter itself,
there is a much easier solution.  The 68000 has a  "decrement and
branch nonzero" instruction built in which is ideal for counting.
For good measure, let's add this construct, too.   This  will  be
the last of our loop structures.
                             
The syntax and its translation is:


     DO
     <expr>         { Emit(SUBQ #1,D0);
                      L = NewLabel;
                      PostLabel(L);
                      Emit(MOVE D0,-(SP) }
     <block>
     ENDDO          { Emit(MOVE (SP)+,D0;
                      Emit(DBRA D0,L) }


That's quite a bit simpler!  The loop will execute  <expr> times.
Here's the code:


{--------------------------------------------------------------}
{ Parse and Translate a DO Statement }

procedure Dodo;
var L: string;
begin
   Match('d');
   L := NewLabel;
   Expression;
   EmitLn('SUBQ #1,D0');
   PostLabel(L);
   EmitLn('MOVE D0,-(SP)');
   Block;
   EmitLn('MOVE (SP)+,D0');
   EmitLn('DBRA D0,' + L);
end;
{--------------------------------------------------------------}


I think you'll have to agree, that's a whole lot simpler than the
classical FOR.  Still, each construct has its place.


THE BREAK STATEMENT

Earlier I promised you a BREAK statement to accompany LOOP.  This
is  one  I'm sort of proud of.  On the face of it a  BREAK  seems
really  tricky.  My first approach was to just use it as an extra
terminator to Block, and split all the loops into two parts, just
as  I did with the ELSE half of an IF.  That  turns  out  not  to
work, though, because the BREAK statement is almost certainly not
going to show  up at the same level as the loop itself.  The most
likely place for a BREAK is right after an IF, which  would cause
it to exit to the IF  construct,  not the enclosing loop.  WRONG.
The  BREAK  has  to exit the inner LOOP, even if it's nested down
into several levels of IFs.
                             
My next thought was that I would just store away, in  some global
variable, the ending label of the innermost loop.    That doesn't
work  either, because there may be a break  from  an  inner  loop
followed by a break from an outer one.  Storing the label for the
inner loop would clobber the label for the  outer  one.    So the
global variable turned into a stack.  Things were starting to get
messy.

Then  I  decided  to take my own advice.  Remember  in  the  last
session when  I  pointed  out  how  well  the implicit stack of a
recursive descent parser was  serving  our needs?  I said that if
you begin to  see  the  need  for  an external stack you might be
doing  something  wrong.   Well, I was.  It is indeed possible to
let the recursion built into  our parser take care of everything,
and the solution is so simple that it's surprising.

The secret is  to  note  that  every BREAK statement has to occur
within a block ... there's no place else for it to be.  So all we
have  to  do  is to pass into  Block  the  exit  address  of  the
innermost loop.  Then it can pass the address to the routine that
translates the  break instruction.  Since an IF statement doesn't
change the loop level, procedure DoIf doesn't need to do anything
except  pass the label into ITS blocks (both  of  them).    Since
loops DO change the level,  each  loop  construct  simply ignores
whatever label is above it and passes its own exit label along.

All  this  is easier to show you than it is to  describe.    I'll
demonstrate with the easiest loop, which is LOOP:


{--------------------------------------------------------------}
{ Parse and Translate a LOOP Statement }

procedure DoLoop;
var L1, L2: string;
begin
   Match('p');
   L1 := NewLabel;
   L2 := NewLabel;
   PostLabel(L1);
   Block(L2);
   Match('e');
   EmitLn('BRA ' + L1);
   PostLabel(L2);
end;
{--------------------------------------------------------------}


Notice that DoLoop now has TWO labels, not just one.   The second
is to give the BREAK instruction a target to jump  to.   If there
is no BREAK within  the  loop, we've wasted a label and cluttered
up things a bit, but there's no harm done.

Note also that Block now has a parameter, which  for  loops  will
always be the exit address.  The new version of Block is:


{--------------------------------------------------------------}
{ Recognize and Translate a Statement Block }

procedure Block(L: string);
begin
   while not(Look in ['e', 'l', 'u']) do begin
      case Look of
       'i': DoIf(L);
       'w': DoWhile;
       'p': DoLoop;
       'r': DoRepeat;
       'f': DoFor;
       'd': DoDo;
       'b': DoBreak(L);
       else Other;
      end;
   end;
end;
{--------------------------------------------------------------}


Again,  notice  that  all Block does with the label is to pass it
into DoIf and  DoBreak.    The  loop  constructs  don't  need it,
because they are going to pass their own label anyway.

The new version of DoIf is:

{--------------------------------------------------------------}
{ Recognize and Translate an IF Construct }

procedure Block(L: string); Forward;


procedure DoIf(L: string);
var L1, L2: string;
begin
   Match('i');
   Condition;
   L1 := NewLabel;
   L2 := L1;
   EmitLn('BEQ ' + L1);
   Block(L);
   if Look = 'l' then begin
      Match('l');
      L2 := NewLabel;
      EmitLn('BRA ' + L2);
      PostLabel(L1);
      Block(L);
   end;
   Match('e');
   PostLabel(L2);
end;
{--------------------------------------------------------------}


Here,  the  only  thing  that  changes  is  the addition  of  the
parameter to procedure Block.  An IF statement doesn't change the
loop  nesting level, so DoIf just passes the  label  along.    No
matter how many levels of IF nesting we have, the same label will
be used.

Now, remember that DoProgram also calls Block, so it now needs to
pass it a label.  An  attempt  to  exit the outermost block is an
error, so DoProgram  passes  a  null  label  which  is  caught by
DoBreak:


{--------------------------------------------------------------}
{ Recognize and Translate a BREAK }

procedure DoBreak(L: string);
begin
   Match('b');
   if L <> '' then
      EmitLn('BRA ' + L)
   else Abort('No loop to break from');
end;


{--------------------------------------------------------------}

{ Parse and Translate a Program }

procedure DoProgram;
begin
   Block('');
   if Look <> 'e' then Expected('End');
   EmitLn('END')
end;
{--------------------------------------------------------------}


That  ALMOST takes care of everything.  Give it a try, see if you
can "break" it <pun>.  Careful, though.  By this time  we've used
so many letters, it's hard to think of characters that aren't now
representing  reserved  words.    Remember:  before  you  try the
program, you're going to have to edit every occurence of Block in
the other loop constructs to include the new parameter.    Do  it
just like I did for LOOP.

I  said ALMOST above.  There is one slight problem: if you take a
hard  look  at  the code generated for DO, you'll see that if you
break  out  of  this loop, the value of the loop counter is still
left on the stack.  We're going to have to fix that!  A shame ...
that was one  of  our  smaller  routines, but it can't be helped.
Here's a version that doesn't have the problem:




{--------------------------------------------------------------}
{ Parse and Translate a DO Statement }

procedure Dodo;
var L1, L2: string;
begin
   Match('d');
   L1 := NewLabel;
   L2 := NewLabel;
   Expression;
   EmitLn('SUBQ #1,D0');
   PostLabel(L1);
   EmitLn('MOVE D0,-(SP)');
   Block(L2);
   EmitLn('MOVE (SP)+,D0');
   EmitLn('DBRA D0,' + L1);
   EmitLn('SUBQ #2,SP');
   PostLabel(L2);
   EmitLn('ADDQ #2,SP');
end;
{--------------------------------------------------------------}


The  two  extra  instructions,  the  SUBQ and ADDQ, take care  of
leaving the stack in the right shape.
                             

CONCLUSION

At this point we have created a number of control  constructs ...
a richer set, really, than that provided by almost any other pro-
gramming language.  And,  except  for the FOR loop, it was pretty
easy to do.  Even that one was tricky only because it's tricky in
assembler language.

I'll conclude this session here.  To wrap the thing up with a red
ribbon, we really  should  have  a  go  at  having  real keywords
instead of these mickey-mouse  single-character  things.   You've
already seen that  the  extension to multi-character words is not
difficult, but in this case it will make a big difference  in the
appearance of our input code.  I'll save that little bit  for the
next installment.  In that installment we'll also address Boolean
expressions, so we can get rid of the dummy version  of Condition
that we've used here.  See you then.

For reference purposes, here is  the  completed  parser  for this
session:




{--------------------------------------------------------------}
program Branch;

{--------------------------------------------------------------}
{ Constant Declarations }

const TAB = ^I;
      CR  = ^M;


{--------------------------------------------------------------}
{ Variable Declarations }

var Look  : char;              { Lookahead Character }
    Lcount: integer;           { Label Counter }


{--------------------------------------------------------------}
{ Read New Character From Input Stream }

procedure GetChar;
begin
   Read(Look);
end;


{--------------------------------------------------------------}
{ Report an Error }

procedure Error(s: string);
begin
   WriteLn;
   WriteLn(^G, 'Error: ', s, '.');
end;


{--------------------------------------------------------------}
{ Report Error and Halt }

procedure Abort(s: string);
begin
   Error(s);
   Halt;
end;


{--------------------------------------------------------------}
{ Report What Was Expected }

procedure Expected(s: string);
begin
   Abort(s + ' Expected');
end;

{--------------------------------------------------------------}
{ Match a Specific Input Character }

procedure Match(x: char);
begin
   if Look = x then GetChar
   else Expected('''' + x + '''');
end;


{--------------------------------------------------------------}
{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;
begin
   IsAlpha := UpCase(c) in ['A'..'Z'];
end;


{--------------------------------------------------------------}
{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;
begin
   IsDigit := c in ['0'..'9'];
end;
                             

{--------------------------------------------------------------}
{ Recognize an Addop }

function IsAddop(c: char): boolean;
begin
   IsAddop := c in ['+', '-'];
end;


{--------------------------------------------------------------}
{ Recognize White Space }

function IsWhite(c: char): boolean;
begin
   IsWhite := c in [' ', TAB];
end;


{--------------------------------------------------------------}
{ Skip Over Leading White Space }

procedure SkipWhite;
begin
   while IsWhite(Look) do
      GetChar;
end;


{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: char;
begin
   if not IsAlpha(Look) then Expected('Name');
   GetName := UpCase(Look);
   GetChar;
end;




{--------------------------------------------------------------}
{ Get a Number }

function GetNum: char;
begin
   if not IsDigit(Look) then Expected('Integer');
   GetNum := Look;
   GetChar;
end;


{--------------------------------------------------------------}
{ Generate a Unique Label }

function NewLabel: string;
var S: string;
begin
   Str(LCount, S);
   NewLabel := 'L' + S;
   Inc(LCount);
end;


{--------------------------------------------------------------}
{ Post a Label To Output }

procedure PostLabel(L: string);
begin
   WriteLn(L, ':');
end;


{--------------------------------------------------------------}
{ Output a String with Tab }

procedure Emit(s: string);
begin
   Write(TAB, s);
end;


{--------------------------------------------------------------}

{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);
begin
   Emit(s);
   WriteLn;
end;


{--------------------------------------------------------------}
{ Parse and Translate a Boolean Condition }

procedure Condition;
begin
   EmitLn('<condition>');
end;

                             


{--------------------------------------------------------------}
{ Parse and Translate a Math Expression }

procedure Expression;
begin
   EmitLn('<expr>');
end;


{--------------------------------------------------------------}
{ Recognize and Translate an IF Construct }

procedure Block(L: string); Forward;


procedure DoIf(L: string);
var L1, L2: string;
begin
   Match('i');
   Condition;
   L1 := NewLabel;
   L2 := L1;
   EmitLn('BEQ ' + L1);
   Block(L);
   if Look = 'l' then begin
      Match('l');
      L2 := NewLabel;
      EmitLn('BRA ' + L2);
      PostLabel(L1);
      Block(L);
   end;
   Match('e');
   PostLabel(L2);
end;


{--------------------------------------------------------------}
{ Parse and Translate a WHILE Statement }

procedure DoWhile;
var L1, L2: string;
begin
   Match('w');
   L1 := NewLabel;
   L2 := NewLabel;
   PostLabel(L1);
   Condition;
   EmitLn('BEQ ' + L2);
   Block(L2);
   Match('e');
   EmitLn('BRA ' + L1);
   PostLabel(L2);
end;
                             

{--------------------------------------------------------------}
{ Parse and Translate a LOOP Statement }

procedure DoLoop;
var L1, L2: string;
begin
   Match('p');
   L1 := NewLabel;
   L2 := NewLabel;
   PostLabel(L1);
   Block(L2);
   Match('e');
   EmitLn('BRA ' + L1);
   PostLabel(L2);
end;


{--------------------------------------------------------------}
{ Parse and Translate a REPEAT Statement }

procedure DoRepeat;
var L1, L2: string;
begin
   Match('r');
   L1 := NewLabel;
   L2 := NewLabel;
   PostLabel(L1);
   Block(L2);
   Match('u');
   Condition;
   EmitLn('BEQ ' + L1);
   PostLabel(L2);
end;


{--------------------------------------------------------------}
{ Parse and Translate a FOR Statement }

procedure DoFor;
var L1, L2: string;
    Name: char;
begin
   Match('f');
   L1 := NewLabel;
   L2 := NewLabel;
   Name := GetName;
   Match('=');
   Expression;
   EmitLn('SUBQ #1,D0');
   EmitLn('LEA ' + Name + '(PC),A0');
   EmitLn('MOVE D0,(A0)');
   Expression;
   EmitLn('MOVE D0,-(SP)');
   PostLabel(L1);
   EmitLn('LEA ' + Name + '(PC),A0');
   EmitLn('MOVE (A0),D0');
   EmitLn('ADDQ #1,D0');
   EmitLn('MOVE D0,(A0)');
   EmitLn('CMP (SP),D0');
   EmitLn('BGT ' + L2);
   Block(L2);
   Match('e');
   EmitLn('BRA ' + L1);
   PostLabel(L2);
   EmitLn('ADDQ #2,SP');
end;




{--------------------------------------------------------------}
{ Parse and Translate a DO Statement }

procedure Dodo;
var L1, L2: string;
begin
   Match('d');
   L1 := NewLabel;
   L2 := NewLabel;
   Expression;
   EmitLn('SUBQ #1,D0');
   PostLabel(L1);
   EmitLn('MOVE D0,-(SP)');
   Block(L2);
   EmitLn('MOVE (SP)+,D0');
   EmitLn('DBRA D0,' + L1);
   EmitLn('SUBQ #2,SP');
   PostLabel(L2);
   EmitLn('ADDQ #2,SP');
end;


{--------------------------------------------------------------}
{ Recognize and Translate a BREAK }

procedure DoBreak(L: string);
begin
   Match('b');
   EmitLn('BRA ' + L);
end;


{--------------------------------------------------------------}
{ Recognize and Translate an "Other" }

procedure Other;
begin
   EmitLn(GetName);
end;


{--------------------------------------------------------------}
{ Recognize and Translate a Statement Block }

procedure Block(L: string);
begin
   while not(Look in ['e', 'l', 'u']) do begin
      case Look of
       'i': DoIf(L);
       'w': DoWhile;
       'p': DoLoop;
       'r': DoRepeat;
       'f': DoFor;
       'd': DoDo;
       'b': DoBreak(L);
       else Other;
      end;
   end;
end;




{--------------------------------------------------------------}

{ Parse and Translate a Program }

procedure DoProgram;
begin
   Block('');
   if Look <> 'e' then Expected('End');
   EmitLn('END')
end;


{--------------------------------------------------------------}

{ Initialize }

procedure Init;
begin
   LCount := 0;
   GetChar;
end;


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
   DoProgram;
end.
{--------------------------------------------------------------}



































                     LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                          31 August 1988


                   Part VI: BOOLEAN EXPRESSIONS





INTRODUCTION

In Part V of this series,  we  took a look at control constructs,
and developed parsing  routines  to  translate  them  into object
code.    We  ended  up  with  a  nice,  relatively  rich  set  of
constructs.

As we left  the  parser,  though,  there  was one big hole in our
capabilities:  we  did  not  address  the  issue  of  the  branch
condition.  To fill the void,  I  introduced to you a dummy parse
routine called Condition, which only served as a place-keeper for
the real thing.

One of the things we'll do in this session is  to  plug that hole
by expanding Condition into a true parser/translator.


THE PLAN

We're going to  approach  this installment a bit differently than
any of the others.    In those other installments, we started out
immediately with experiments  using the Pascal compiler, building
up the parsers from  very  rudimentary  beginnings to their final
forms, without spending much time in planning  beforehand. That's
called coding without specs, and it's usually frowned  upon.   We
could get away with it before because the rules of arithmetic are
pretty well established ...  we  know what a '+' sign is supposed
to mean without having to discuss it at length.  The same is true
for branches and  loops.    But  the  ways  in  which programming
languages  implement  logic  vary quite a bit  from  language  to
language.  So before we begin serious coding,  we'd  better first
make up our minds what it is we want.  And the way to do  that is
at the level of the BNF syntax rules (the GRAMMAR).


THE GRAMMAR

For some time  now,  we've been implementing BNF syntax equations
for arithmetic expressions, without  ever  actually  writing them
down all in one place.  It's time that we did so.  They are:


     <expression> ::= <unary op> <term> [<addop> <term>]*
     <term>       ::= <factor> [<mulop> factor]*
     <factor>     ::= <integer> | <variable> | ( <expression> )

(Remember, the nice thing about  this grammar is that it enforces
the operator precedence hierarchy  that  we  normally  expect for
algebra.)

Actually,  while we're on the subject, I'd  like  to  amend  this
grammar a bit right now.   The  way we've handled the unary minus
is  a  bit  awkward.  I've found that it's better  to  write  the
grammar this way:


  <expression>    ::= <term> [<addop> <term>]*
  <term>          ::= <signed factor> [<mulop> factor]*
  <signed factor> ::= [<addop>] <factor>
  <factor>        ::= <integer> | <variable> | (<expression>)


This puts the job of handling the unary minus onto  Factor, which
is where it really belongs.

This  doesn't  mean  that  you  have  to  go  back and recode the
programs you've already written, although you're free to do so if
you like.  But I will be using the new syntax from now on.

Now, it probably won't come as  a  shock  to you to learn that we
can define an analogous grammar for Boolean algebra.    A typical
set or rules is:


 <b-expression>::= <b-term> [<orop> <b-term>]*
 <b-term>      ::= <not-factor> [AND <not-factor>]*
 <not-factor>  ::= [NOT] <b-factor>
 <b-factor>    ::= <b-literal> | <b-variable> | (<b-expression>)


Notice that in this  grammar,  the  operator  AND is analogous to
'*',  and  OR  (and exclusive OR) to '+'.  The  NOT  operator  is
analogous to a unary  minus.    This  hierarchy is not absolutely
standard ...  some  languages,  notably  Ada,  treat  all logical
operators  as  having  the same precedence level ... but it seems
natural.

Notice also the slight difference between the way the NOT and the
unary  minus  are  handled.    In  algebra,  the unary  minus  is
considered to go with the whole term, and so  never  appears  but
once in a given term. So an expression like

                    a * -b

or worse yet,
                    a - -b

is not allowed.  In Boolean algebra, though, the expression

                    a AND NOT b

makes perfect sense, and the syntax shown allows for that.


RELOPS

OK, assuming that you're willing to accept the grammar I've shown
here,  we  now  have syntax rules for both arithmetic and Boolean
algebra.    The  sticky part comes in when we have to combine the
two.  Why do we have to do that?  Well, the whole subject came up
because of the  need  to  process  the  "predicates" (conditions)
associated with control statements such as the IF.  The predicate
is required to have a Boolean value; that is, it must evaluate to
either TRUE or FALSE.  The branch is  then  taken  or  not taken,
depending  on  that  value.  What we expect to see  going  on  in
procedure  Condition,  then,  is  the  evaluation  of  a  Boolean
expression.

But there's more to it than that.  A pure Boolean  expression can
indeed be the predicate of a control statement ... things like


          IF a AND NOT b THEN ....


But more often, we see Boolean algebra show up in such things as


     IF (x >= 0) and (x <= 100) THEN ...


Here,  the  two  terms in parens are Boolean expressions, but the
individual terms being compared:  x,  0, and 100,  are NUMERIC in
nature.  The RELATIONAL OPERATORS >= and <= are the  catalysts by
which the  Boolean  and  the  arithmetic  ingredients  get merged
together.

Now,  in the example above, the terms  being  compared  are  just
that:  terms.    However,  in  general  each  side  can be a math
expression.  So we can define a RELATION to be:


     <relation> ::= <expression> <relop> <expression>  ,


where  the  expressions  we're  talking  about here are  the  old
numeric type, and the relops are any of the usual symbols


               =, <> (or !=), <, >, <=, and >=


If you think about it a  bit,  you'll agree that, since this kind
of predicate has a single Boolean value, TRUE or  FALSE,  as  its
result, it is  really  just  another  kind  of factor.  So we can
expand the definition of a Boolean factor above to read:


    <b-factor> ::=    <b-literal>
                    | <b-variable>
                    | (<b-expression>)
                    | <relation>


THAT's the connection!  The relops and the  relation  they define
serve to wed the two kinds of algebra.  It  is  worth noting that
this implies a hierarchy  where  the  arithmetic expression has a
HIGHER precedence that  a  Boolean factor, and therefore than all
the  Boolean operators.    If you write out the precedence levels
for all the operators, you arrive at the following list:


          Level   Syntax Element     Operator

          0       factor             literal, variable
          1       signed factor      unary minus
          2       term               *, /
          3       expression         +, -
          4       b-factor           literal, variable, relop
          5       not-factor         NOT
          6       b-term             AND
          7       b-expression       OR, XOR


If  we're willing to accept that  many  precedence  levels,  this


grammar seems reasonable.  Unfortunately,  it  won't  work!   The
grammar may be great in theory,  but  it's  no good at all in the
practice of a top-down parser.  To see the problem,  consider the
code fragment:


     IF ((((((A + B + C) < 0 ) AND ....


When the parser is parsing this code, it knows after it  sees the
IF token that a Boolean expression is supposed to be next.  So it
can set up to begin evaluating such an expression.  But the first
expression in the example is an ARITHMETIC expression, A + B + C.
What's worse, at the point that the parser has read this  much of
the input line:


     IF ((((((A   ,


it  still has no way of knowing which  kind  of  expression  it's
dealing  with.  That won't do, because  we  must  have  different
recognizers  for the two cases.  The  situation  can  be  handled
without  changing  any  of  our  definitions, but only  if  we're
willing to accept an arbitrary amount of backtracking to work our
way out of bad guesses.  No compiler  writer  in  his  right mind
would agree to that.

What's going  on  here  is  that  the  beauty and elegance of BNF
grammar  has  met  face  to  face with the realities of  compiler
technology.

To  deal  with  this situation, compiler writers have had to make
compromises  so  that  a  single  parser can handle  the  grammar
without backtracking.


FIXING THE GRAMMAR

The  problem  that  we've  encountered  comes   up   because  our
definitions of both arithmetic and Boolean factors permit the use
of   parenthesized  expressions.    Since  the  definitions   are
recursive,  we  can  end  up  with  any  number   of   levels  of
parentheses, and the  parser  can't know which kind of expression
it's dealing with.

The  solution is simple, although it  ends  up  causing  profound
changes to our  grammar.    We  can only allow parentheses in one
kind  of factor.  The way to do  that  varies  considerably  from
language  to  language.  This is one  place  where  there  is  NO
agreement or convention to help us.

When Niklaus Wirth designed Pascal, the desire was  to  limit the
number of levels of precedence (fewer parse routines, after all).
So the OR  and  exclusive  OR  operators are treated just like an
Addop  and  processed   at   the  level  of  a  math  expression.
Similarly, the AND is  treated  like  a  Mulop and processed with
Term.  The precedence levels are


          Level   Syntax Element     Operator

          0       factor             literal, variable
          1       signed factor      unary minus, NOT
          2       term               *, /, AND
          3       expression         +, -, OR


Notice that there is only ONE set of syntax  rules,  applying  to
both  kinds  of  operators.    According to this  grammar,  then,
expressions like

     x + (y AND NOT z) DIV 3

are perfectly legal.  And, in  fact,  they  ARE ... as far as the
parser  is  concerned.    Pascal  doesn't  allow  the  mixing  of
arithmetic and Boolean variables, and things like this are caught
at the SEMANTIC level, when it comes time to  generate  code  for
them, rather than at the syntax level.

The authors of C took  a  diametrically  opposite  approach: they
treat the operators as  different,  and  have something much more
akin  to our seven levels of precedence.  In fact, in C there are
no fewer than 17 levels!  That's because C also has the operators
'=', '+=' and its kin, '<<', '>>', '++', '--', etc.   Ironically,
although in C the  arithmetic  and  Boolean operators are treated
separately, the variables are  NOT  ...  there  are no Boolean or
logical variables in  C,  so  a  Boolean  test can be made on any
integer value.

We'll do something that's  sort  of  in-between.   I'm tempted to
stick  mostly  with  the Pascal approach, since  that  seems  the
simplest from an implementation point  of view, but it results in
some funnies that I never liked very much, such as the fact that,
in the expression

     IF (c >= 'A') and (c <= 'Z') then ...

the  parens  above  are REQUIRED.  I never understood why before,
and  neither my compiler nor any human  ever  explained  it  very
well, either.  But now, we  can  all see that the 'and' operator,
having the precedence of a multiply, has a higher  one  than  the
relational operators, so without  the  parens  the  expression is
equivalent to

     IF c >= ('A' and c) <= 'Z' then

which doesn't make sense.

In  any  case,  I've  elected  to  separate  the  operators  into
different levels, although not as many as in C.


 <b-expression> ::= <b-term> [<orop> <b-term>]*
 <b-term>       ::= <not-factor> [AND <not-factor>]*
 <not-factor>   ::= [NOT] <b-factor>
 <b-factor>     ::= <b-literal> | <b-variable> | <relation>
 <relation>     ::= | <expression> [<relop> <expression]
 <expression>   ::= <term> [<addop> <term>]*
 <term>         ::= <signed factor> [<mulop> factor]*
 <signed factor>::= [<addop>] <factor>
 <factor>       ::= <integer> | <variable> | (<b-expression>)


This grammar  results  in  the  same  set  of seven levels that I
showed earlier.  Really, it's almost the same grammar ...  I just
removed the option of parenthesized b-expressions  as  a possible
b-factor, and added the relation as a legal form of b-factor.

There is one subtle but crucial difference, which  is  what makes
the  whole  thing  work.    Notice  the  square brackets  in  the
definition  of a relation.  This means that  the  relop  and  the
second expression are OPTIONAL.

A strange consequence of this grammar (and one shared  by  C)  is
that EVERY expression  is  potentially a Boolean expression.  The
parser will always be looking  for a Boolean expression, but will
"settle" for an arithmetic one.  To be honest,  that's  going  to
slow down the parser, because it has to wade through  more layers
of procedure calls.  That's  one reason why Pascal compilers tend
to compile faster than C compilers.  If it's raw speed  you want,
stick with the Pascal syntax.


THE PARSER

Now that we've gotten through the decision-making process, we can
press on with development of a parser.  You've done this  with me
several times now, so you know  the  drill: we begin with a fresh
copy of the cradle, and begin  adding  procedures one by one.  So
let's do it.

We begin, as we did in the arithmetic case, by dealing  only with
Boolean literals rather than variables.  This gives us a new kind
of input token, so we're also going to need a new recognizer, and
a  new procedure to read instances of that  token  type.    Let's
start by defining the two new procedures:


{--------------------------------------------------------------}
{ Recognize a Boolean Literal }

function IsBoolean(c: char): Boolean;
begin
   IsBoolean := UpCase(c) in ['T', 'F'];
end;


{--------------------------------------------------------------}
{ Get a Boolean Literal }

function GetBoolean: Boolean;
var c: char;
begin
   if not IsBoolean(Look) then Expected('Boolean Literal');
   GetBoolean := UpCase(Look) = 'T';
   GetChar;
end;
{--------------------------------------------------------------}


Type  these routines into your program.  You  can  test  them  by
adding into the main program the print statement


   WriteLn(GetBoolean);




OK, compile the program and test it.   As  usual,  it's  not very
impressive so far, but it soon will be.

Now, when we were dealing with numeric data we had to  arrange to
generate code to load the values into D0.  We need to do the same
for Boolean data.   The  usual way to encode Boolean variables is
to let 0 stand for FALSE,  and  some  other value for TRUE.  Many
languages, such as C, use an  integer  1  to represent it.  But I
prefer FFFF hex  (or  -1),  because  a bitwise NOT also becomes a
Boolean  NOT.  So now we need to emit the right assembler code to
load  those  values.    The  first cut at the Boolean  expression
parser (BoolExpression, of course) is:


{---------------------------------------------------------------}
{ Parse and Translate a Boolean Expression }

procedure BoolExpression;
begin
   if not IsBoolean(Look) then Expected('Boolean Literal');
   if GetBoolean then
      EmitLn('MOVE #-1,D0')
   else
      EmitLn('CLR D0');
end;
{---------------------------------------------------------------}


Add  this procedure to your parser, and call  it  from  the  main
program (replacing the  print  statement you had just put there).
As you  can  see,  we  still don't have much of a parser, but the
output code is starting to look more realistic.

Next, of course, we have to expand the definition  of  a  Boolean
expression.  We already have the BNF rule:


 <b-expression> ::= <b-term> [<orop> <b-term>]*


I prefer the Pascal versions of the "orops",  OR  and  XOR.   But
since we are keeping to single-character tokens here, I'll encode
those with '|' and  '~'.  The  next  version of BoolExpression is
almost a direct copy of the arithmetic procedure Expression:


{--------------------------------------------------------------}
{ Recognize and Translate a Boolean OR }

procedure BoolOr;
begin
   Match('|');
   BoolTerm;
   EmitLn('OR (SP)+,D0');
end;




{--------------------------------------------------------------}
{ Recognize and Translate an Exclusive Or }

procedure BoolXor;
begin
   Match('~');
   BoolTerm;
   EmitLn('EOR (SP)+,D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate a Boolean Expression }

procedure BoolExpression;
begin
   BoolTerm;
   while IsOrOp(Look) do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '|': BoolOr;
       '~': BoolXor;
      end;
   end;
end;
{---------------------------------------------------------------}


Note the new recognizer  IsOrOp,  which is also a copy, this time
of IsAddOp:


{--------------------------------------------------------------}
{ Recognize a Boolean Orop }

function IsOrop(c: char): Boolean;
begin
   IsOrop := c in ['|', '~'];
end;
{--------------------------------------------------------------}

OK, rename the old  version  of  BoolExpression to BoolTerm, then
enter  the  code  above.  Compile and test this version.  At this
point, the  output  code  is  starting  to  look pretty good.  Of
course, it doesn't make much sense to do a lot of Boolean algebra
on  constant values, but we'll soon be  expanding  the  types  of
Booleans we deal with.

You've  probably  already  guessed  what  the next step  is:  The
Boolean version of Term.

Rename the current procedure BoolTerm to NotFactor, and enter the
following new version of BoolTerm.  Note that is is  much simpler
than  the  numeric  version,  since  there  is  no equivalent  of
division.




{---------------------------------------------------------------}
{ Parse and Translate a Boolean Term }

procedure BoolTerm;
begin
   NotFactor;
   while Look = '&' do begin
      EmitLn('MOVE D0,-(SP)');
      Match('&');
      NotFactor;
      EmitLn('AND (SP)+,D0');
   end;
end;
{--------------------------------------------------------------}


Now,  we're  almost  home.  We are  translating  complex  Boolean
expressions, although only for constant values.  The next step is
to allow for the NOT.  Write the following procedure:


{--------------------------------------------------------------}
{ Parse and Translate a Boolean Factor with NOT }

procedure NotFactor;
begin
   if Look = '!' then begin
      Match('!');
      BoolFactor;
      EmitLn('EOR #-1,D0');
      end
   else
      BoolFactor;
end;
{--------------------------------------------------------------}

And  rename  the  earlier procedure to BoolFactor.  Now try that.
At this point  the  parser  should  be able to handle any Boolean
expression you care to throw at it.  Does it?  Does it trap badly
formed expressions?

If you've  been  following  what  we  did  in the parser for math
expressions, you know  that  what  we  did next was to expand the
definition of a factor to include variables and parens.  We don't
have  to do that for the Boolean  factor,  because  those  little
items get taken care of by the next step.  It  takes  just  a one
line addition to BoolFactor to take care of relations:


{--------------------------------------------------------------}
{ Parse and Translate a Boolean Factor }

procedure BoolFactor;
begin
   if IsBoolean(Look) then
      if GetBoolean then
         EmitLn('MOVE #-1,D0')
      else
         EmitLn('CLR D0')
      else Relation;
end;
{--------------------------------------------------------------}


You  might be wondering when I'm going  to  provide  for  Boolean
variables and parenthesized Boolean expressions.  The  answer is,
I'm NOT!   Remember,  we  took  those out of the grammar earlier.
Right now all I'm  doing  is  encoding  the grammar we've already
agreed  upon.    The compiler itself can't  tell  the  difference
between a Boolean variable  or  expression  and an arithmetic one
... all of those will be handled by Relation, either way.


Of course, it would help to have some code for Relation.  I don't
feel comfortable, though,  adding  any  more  code  without first
checking out what we already have.  So for now let's just write a
dummy  version  of  Relation  that  does nothing except  eat  the
current character, and write a little message:


{---------------------------------------------------------------}
{ Parse and Translate a Relation }

procedure Relation;
begin
   WriteLn('<Relation>');
   GetChar;
end;
{--------------------------------------------------------------}

OK, key  in  this  code  and  give  it a try.  All the old things
should still work ... you should be able to generate the code for
ANDs, ORs, and  NOTs.    In  addition, if you type any alphabetic
character you should get a little <Relation>  place-holder, where
a  Boolean factor should be.  Did you get that?  Fine, then let's
move on to the full-blown version of Relation.

To  get  that,  though, there is a bit of groundwork that we must
lay first.  Recall that a relation has the form


 <relation>     ::= | <expression> [<relop> <expression]


Since  we have a new kind of operator, we're also going to need a
new Boolean function to  recognize  it.    That function is shown
below.  Because of the single-character limitation,  I'm sticking
to the four operators  that  can be encoded with such a character
(the "not equals" is encoded by '#').


{--------------------------------------------------------------}
{ Recognize a Relop }

function IsRelop(c: char): Boolean;
begin
   IsRelop := c in ['=', '#', '<', '>'];
end;
{--------------------------------------------------------------}


Now, recall  that  we're  using  a zero or a -1 in register D0 to
represent  a Boolean value, and also  that  the  loop  constructs
expect the flags to be set to correspond.   In  implementing  all
this on the 68000, things get a a little bit tricky.

Since the loop constructs operate only on the flags, it  would be
nice (and also quite  efficient)  just to set up those flags, and


not load  anything  into  D0  at all.  This would be fine for the
loops  and  branches,  but remember that the relation can be used
ANYWHERE a Boolean factor could be  used.   We may be storing its
result to a Boolean variable.  Since we can't know at  this point
how the result is going to be used, we must allow for BOTH cases.

Comparing numeric data  is  easy  enough  ...  the  68000  has an
operation  for  that ... but it sets  the  flags,  not  a  value.
What's more,  the  flags  will  always  be  set the same (zero if
equal, etc.), while we need the zero flag set differently for the
each of the different relops.

The solution is found in the 68000 instruction Scc, which  sets a
byte value to 0000 or FFFF (funny how that works!) depending upon
the  result  of  the  specified   condition.    If  we  make  the
destination byte to be D0, we get the Boolean value needed.

Unfortunately,  there's one  final  complication:  unlike  almost
every other instruction in the 68000 set, Scc does NOT  reset the
condition flags to match the data being stored.  So we have to do
one last step, which is to test D0 and set the flags to match it.
It must seem to be a trip around the moon to get what we want: we
first perform the test, then test the flags to set data  into D0,
then test D0 to set the flags again.  It  is  sort of roundabout,
but it's the most straightforward way to get the flags right, and
after all it's only a couple of instructions.

I  might  mention  here that this area is, in my opinion, the one
that represents the biggest difference between the  efficiency of
hand-coded assembler language and  compiler-generated  code.   We
have  seen  already  that  we  lose   efficiency   in  arithmetic
operations, although later I plan to show you how to improve that
a  bit.    We've also seen that the control constructs themselves
can be done quite efficiently  ... it's usually very difficult to
improve  on  the  code generated for an  IF  or  a  WHILE.    But
virtually every compiler I've ever seen generates  terrible code,
compared to assembler, for the computation of a Boolean function,
and particularly for relations.    The  reason  is just what I've
hinted at above.  When I'm writing code in assembler, I  go ahead
and perform the test the most convenient way I can, and  then set
up the branch so that it goes the way it should.    In  effect, I
"tailor"  every  branch  to the situation.  The compiler can't do
that (practically), and it also can't know that we don't  want to
store the result of the test as a Boolean variable.    So it must
generate  the  code  in a very strict order, and it often ends up
loading  the  result  as  a  Boolean  that  never gets  used  for
anything.

In  any  case,  we're now ready to look at the code for Relation.
It's shown below with its companion procedures:


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Equals" }

procedure Equals;
begin
   Match('=');
   Expression;
   EmitLn('CMP (SP)+,D0');
   EmitLn('SEQ D0');
end;




{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Not Equals" }

procedure NotEquals;
begin
   Match('#');
   Expression;
   EmitLn('CMP (SP)+,D0');
   EmitLn('SNE D0');
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than" }

procedure Less;
begin
   Match('<');
   Expression;
   EmitLn('CMP (SP)+,D0');
   EmitLn('SGE D0');
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Greater Than" }

procedure Greater;
begin
   Match('>');
   Expression;
   EmitLn('CMP (SP)+,D0');
   EmitLn('SLE D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate a Relation }

procedure Relation;
begin
   Expression;
   if IsRelop(Look) then begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '=': Equals;
       '#': NotEquals;
       '<': Less;
       '>': Greater;
      end;
   EmitLn('TST D0');
   end;
end;
{---------------------------------------------------------------}

Now, that call to  Expression  looks familiar!  Here is where the
editor of your system comes in handy.  We have  already generated
code  for  Expression  and its buddies in previous sessions.  You
can  copy  them  into your file now.  Remember to use the single-
character  versions.  Just to be  certain,  I've  duplicated  the
arithmetic procedures below.  If  you're  observant,  you'll also
see that I've changed them a little to make  them  correspond  to
the latest version of the syntax.  This change is  NOT necessary,
so  you  may  prefer  to  hold  off  on  that  until you're  sure


everything is working.


{---------------------------------------------------------------}
{ Parse and Translate an Identifier }

procedure Ident;
var Name: char;
begin
   Name:= GetName;
   if Look = '(' then begin
      Match('(');
      Match(')');
      EmitLn('BSR ' + Name);
      end
   else
      EmitLn('MOVE ' + Name + '(PC),D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;
begin
   if Look = '(' then begin
      Match('(');
      Expression;
      Match(')');
      end
   else if IsAlpha(Look) then
      Ident
   else
      EmitLn('MOVE #' + GetNum + ',D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate the First Math Factor }


procedure SignedFactor;
begin
   if Look = '+' then
      GetChar;
   if Look = '-' then begin
      GetChar;
      if IsDigit(Look) then
         EmitLn('MOVE #-' + GetNum + ',D0')
      else begin
         Factor;
         EmitLn('NEG D0');
      end;
   end
   else Factor;
end;




{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }

procedure Multiply;
begin
   Match('*');
   Factor;
   EmitLn('MULS (SP)+,D0');
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Divide }

procedure Divide;
begin
   Match('/');
   Factor;
   EmitLn('MOVE (SP)+,D1');
   EmitLn('EXS.L D0');
   EmitLn('DIVS D1,D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Term }

procedure Term;
begin
   SignedFactor;
   while Look in ['*', '/'] do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '*': Multiply;
       '/': Divide;
      end;
   end;
end;


{---------------------------------------------------------------}
{ Recognize and Translate an Add }

procedure Add;
begin
   Match('+');
   Term;
   EmitLn('ADD (SP)+,D0');
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Subtract }

procedure Subtract;
begin
   Match('-');
   Term;
   EmitLn('SUB (SP)+,D0');
   EmitLn('NEG D0');
end;




{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
   Term;
   while IsAddop(Look) do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '+': Add;
       '-': Subtract;
      end;
   end;
end;
{---------------------------------------------------------------}


There you have it ... a parser that can  handle  both  arithmetic
AND Boolean algebra, and things  that combine the two through the
use of relops.   I suggest you file away a copy of this parser in
a safe place for future reference, because in our next step we're
going to be chopping it up.


MERGING WITH CONTROL CONSTRUCTS

At this point, let's go back to the file we had  previously built
that parses control  constructs.    Remember  those  little dummy
procedures called Condition and  Expression?    Now you know what
goes in their places!

I  warn you, you're going to have to  do  some  creative  editing
here, so take your time and get it right.  What you need to do is
to copy all of  the  procedures from the logic parser, from Ident
through  BoolExpression, into the parser for control  constructs.
Insert  them  at  the current location of Condition.  Then delete
that  procedure,  as  well as the dummy Expression.  Next, change
every call  to  Condition  to  refer  to  BoolExpression instead.
Finally, copy the procedures IsMulop, IsOrOp, IsRelop, IsBoolean,
and GetBoolean into place.  That should do it.

Compile  the  resulting program and give it  a  try.    Since  we
haven't  used  this  program in awhile, don't forget that we used
single-character tokens for IF,  WHILE,  etc.   Also don't forget
that any letter not a keyword just gets echoed as a block.

Try

     ia=bxlye

which stands for "IF a=b X ELSE Y ENDIF".

What do you think?  Did it work?  Try some others.


ADDING ASSIGNMENTS

As long as we're this far,  and  we already have the routines for
expressions in place, we might  as well replace the "blocks" with
real assignment statements.    We've already done that before, so
it won't be too hard.   Before  taking that step, though, we need
to fix something else.



We're soon going to find  that the one-line "programs" that we're
having to write here will really cramp our style.  At  the moment
we  have  no  cure for that, because our parser doesn't recognize
the end-of-line characters, the carriage return (CR) and the line
feed (LF).  So before going any further let's plug that hole.

There are  a  couple  of  ways to deal with the CR/LFs.  One (the
C/Unix approach) is just to  treat them as additional white space
characters  and  ignore  them.    That's actually not such a  bad
approach,  but  it  does  sort  of produce funny results for  our
parser as  it  stands  now.   If it were reading its input from a
source file as  any  self-respecting  REAL  compiler  does, there
would be no problem.  But we're reading input from  the keyboard,
and we're sort of conditioned  to expect something to happen when
we hit the return key.  It won't, if we just skip over the CR and
LF  (try it).  So I'm going to use a different method here, which
is NOT necessarily the  best  approach in the long run.  Consider
it a temporary kludge until we're further along.

Instead of skipping the CR/LF,  We'll let the parser go ahead and
catch them, then  introduce  a  special  procedure,  analogous to
SkipWhite, that skips them only in specified "legal" spots.

Here's the procedure:


{--------------------------------------------------------------}
{ Skip a CRLF }

procedure Fin;
begin
   if Look = CR then GetChar;
   if Look = LF then GetChar;
end;

{--------------------------------------------------------------}


Now, add two calls to Fin in procedure Block, like this:


{--------------------------------------------------------------}
{ Recognize and Translate a Statement Block }

procedure Block(L: string);
begin
   while not(Look in ['e', 'l', 'u']) do begin
      Fin;
      case Look of
       'i': DoIf(L);
       'w': DoWhile;
       'p': DoLoop;
       'r': DoRepeat;
       'f': DoFor;
       'd': DoDo;
       'b': DoBreak(L);
       else Other;
      end;
      Fin;
 end;
end;
{--------------------------------------------------------------}



Now, you'll find that you  can use multiple-line "programs."  The
only restriction is that you can't separate an IF or  WHILE token
from its predicate.

Now we're ready to include  the  assignment  statements.   Simply
change  that  call  to  Other  in  procedure  Block  to a call to
Assignment, and add  the  following procedure, copied from one of
our  earlier  programs.     Note   that   Assignment   now  calls
BoolExpression, so that we can assign Boolean variables.


{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
var Name: char;
begin
   Name := GetName;
   Match('=');
   BoolExpression;
   EmitLn('LEA ' + Name + '(PC),A0');
   EmitLn('MOVE D0,(A0)');
end;
{--------------------------------------------------------------}


With  that change, you should now be  able  to  write  reasonably
realistic-looking  programs,  subject  only  to our limitation on
single-character tokens.  My original intention was to get rid of
that limitation for you, too.  However, that's going to require a
fairly major change to what we've  done  so  far.  We need a true
lexical scanner, and that requires some structural changes.  They
are not BIG changes that require us to  throw  away  all  of what
we've done so far ... with care, it can be done with very minimal
changes, in fact.  But it does require that care.

This installment  has already gotten pretty long, and it contains
some pretty heavy stuff, so I've decided to leave that step until
next  time, when you've had a little more  time  to  digest  what
we've done and are ready to start fresh.

In the next installment, then,  we'll build a lexical scanner and
eliminate the single-character  barrier  once and for all.  We'll
also write our first complete  compiler, based on what we've done
in this session.  See you then.

































                     LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                         7 November 1988


                    Part VII: LEXICAL SCANNING





INTRODUCTION

In the last installment, I left you with a  compiler  that  would
ALMOST  work,  except  that  we  were  still  limited to  single-
character tokens.  The purpose of  this  session is to get rid of
that restriction, once and for all.  This means that we must deal
with the concept of the lexical scanner.

Maybe I should mention why we  need  a lexical scanner at all ...
after all, we've been able to manage all right  without  one,  up
till now, even when we provided for multi-character tokens.

The ONLY reason, really, has to do with keywords.  It's a fact of
computer life that the syntax for a keyword has the same  form as
that  for  any  other identifier.  We can't tell until we get the
complete word whether or not it  IS  a keyword.  For example, the
variable IFILE and the keyword IF look just alike, until  you get
to the third character.  In the examples to date, we  were always
able to make  a  decision  based  upon the first character of the
token, but that's  no  longer possible when keywords are present.
We  need to know that a given string is a keyword BEFORE we begin
to process it.  And that's why we need a scanner.

In the last session, I also promised that  we  would  be  able to
provide for normal tokens  without  making  wholesale  changes to
what we have  already done.  I didn't lie ... we can, as you will
see later.  But every time I set out to install these elements of
the software into  the  parser  we  have already built, I had bad
feelings about it.  The whole thing felt entirely too much like a
band-aid.  I finally figured out what was causing the  problem: I
was installing lexical scanning software without first explaining
to you what scanning is all about, and what the alternatives are.
Up  till  now, I have studiously avoided  giving  you  a  lot  of
theory,  and  certainly  not  alternatives.    I  generally don't
respond well to the textbooks that give you twenty-five different
ways  to do something, but no clue as to which way best fits your
needs.  I've tried to avoid that pitfall by just showing  you ONE
method, that WORKS.

But  this is an important area.  While  the  lexical  scanner  is
hardly the most  exciting  part  of  a compiler, it often has the
most  profound  effect  on  the  general  "look  & feel"  of  the
language, since after all it's the  part  closest to the user.  I
have a particular structure in mind for the scanner  to  be  used
with  KISS.    It fits the look &  feel  that  I  want  for  that
language.  But it may not work at  all  for  the  language YOU'RE
cooking  up,  so  in this one case I feel that it's important for
you to know your options.

So I'm going to depart, again, from my  usual  format.    In this
session we'll be getting  much  deeper  than usual into the basic
theory of languages and  grammars.    I'll  also be talking about
areas OTHER than compilers in  which  lexical  scanning  plays an
important role.  Finally, I will show you  some  alternatives for
the structure of the lexical scanner.  Then, and only  then, will
we get back to our parser  from  the last installment.  Bear with
me ... I think you'll find it's worth the wait.    In fact, since
scanners have many applications  outside  of  compilers,  you may
well find this to be the most useful session for you.


LEXICAL SCANNING

Lexical scanning is the process of scanning the  stream  of input
characters and separating it  into  strings  called tokens.  Most
compiler  texts  start  here,  and  devote  several  chapters  to
discussing various ways to build scanners.  This approach has its
place, but as you have already  seen,  there  is a lot you can do
without ever even addressing the issue, and in  fact  the scanner
we'll  end  up with here won't look  much  like  what  the  texts
describe.  The reason?    Compiler  theory and, consequently, the
programs resulting from it, must  deal with the most general kind
of parsing rules.  We don't.  In the real  world,  it is possible
to specify the language syntax in such a way that a pretty simple
scanner will suffice.  And as always, KISS is our motto.

Typically, lexical scanning is  done  in  a  separate part of the
compiler, so that the parser per  se  sees only a stream of input
tokens.  Now, theoretically it  is not necessary to separate this
function from the rest of the parser.  There is  only  one set of
syntax equations that define the  whole language, so in theory we
could write the whole parser in one module.

Why  the  separation?      The  answer  has  both  practical  and
theoretical bases.

In  1956,  Noam  Chomsky  defined  the  "Chomsky   Hierarchy"  of
grammars.  They are:

     o Type 0:  Unrestricted (e.g., English)

     o Type 1:  Context-Sensitive

     o Type 2:  Context-Free

     o Type 3:  Regular

A few features of the typical programming  language (particularly
the older ones, such as FORTRAN) are Type  1,  but  for  the most
part  all  modern  languages can be described using only the last
two types, and those are all we'll be dealing with here.

The  neat  part about these two types  is  that  there  are  very
specific ways to parse them.  It has been shown that  any regular
grammar can be parsed using a particular form of abstract machine
called the state machine (finite  automaton).    We  have already
implemented state machines in some of our recognizers.

Similarly, Type 2 (context-free) grammars  can  always  be parsed
using  a  push-down  automaton (a state machine  augmented  by  a
stack).  We have  also  implemented  these  machines.  Instead of
implementing  a literal stack, we have  relied  on  the  built-in
stack associated with recursive coding to do the job, and that in
fact is the preferred approach for top-down parsing.

Now, it happens that in  real, practical grammars, the parts that
qualify as  regular expressions tend to be the lower-level parts,
such as the definition of an identifier:

     <ident> ::= <letter> [ <letter> | <digit> ]*

Since it takes a different kind of abstract machine to  parse the
two  types  of  grammars, it makes sense to separate these lower-
level functions into  a  separate  module,  the  lexical scanner,
which is built around the idea of a state machine. The idea is to
use the simplest parsing technique needed for the job.

There is another, more practical  reason  for  separating scanner
from  parser.   We like to think of the input source  file  as  a
stream  of characters, which we process  right  to  left  without
backtracking.  In practice that  isn't  possible.    Almost every
language has certain keywords such as  IF,  WHILE, and END.  As I
mentioned  earlier,    we  can't  really  know  whether  a  given
character string is a keyword, until we've reached the end of it,
as defined by a space or other delimiter.  So  in  that sense, we
MUST  save  the  string long enough to find out whether we have a
keyword or not.  That's a limited form of backtracking.

So the structure of a conventional compiler involves splitting up
the functions of  the  lower-level and higher-level parsing.  The
lexical  scanner  deals  with  things  at  the  character  level,
collecting characters into strings, etc., and passing  them along
to the parser proper as indivisible tokens.  It's also considered
normal to let the scanner have the job of identifying keywords.


STATE MACHINES AND ALTERNATIVES

I  mentioned  that  the regular expressions can be parsed using a
state machine.   In  most  compiler  texts,  and  indeed  in most
compilers as well, you will find this taken literally.   There is
typically  a  real  implementation  of  the  state  machine, with
integers used to define the current state, and a table of actions
to  take   for  each  combination  of  current  state  and  input
character.  If you  write  a compiler front end using the popular
Unix tools LEX and YACC, that's  what  you'll get.  The output of
LEX is a state machine implemented in C, plus a table  of actions
corresponding to the input grammar given to LEX.  The YACC output
is  similar  ...  a canned table-driven parser,  plus  the  table
corresponding to the language syntax.

That  is  not  the  only  choice,  though.     In   our  previous
installments, you have seen over and over that it is  possible to
implement  parsers  without  dealing  specifically  with  tables,
stacks, or state variables.    In fact, in Installment V I warned
you that if you  find  yourself needing these things you might be
doing something wrong, and not taking advantage of  the  power of
Pascal.  There are basically two ways to define a state machine's
state: explicitly, with  a  state number or code, and implicitly,
simply by virtue of the fact that I'm at a  certain  place in the
code  (if  it's  Tuesday,  this  must be Belgium).  We've  relied
heavily on the implicit approaches  before,  and  I  think you'll
find that they work well here, too.

In practice, it may not even be necessary to HAVE  a well-defined
lexical scanner.  This isn't our first experience at dealing with
multi-character tokens.   In  Installment  III,  we  extended our
parser to provide  for  them,  and  we didn't even NEED a lexical
scanner.    That  was  because  in that narrow context, we  could
always tell, just  by  looking at the single lookahead character,
whether  we  were  dealing  with  a  number,  a variable,  or  an
operator.  In effect, we  built  a  distributed  lexical scanner,
using procedures GetName and GetNum.

With keywords present,  we  can't know anymore what we're dealing
with, until the entire token is  read.    This leads us to a more
localized  scanner; although,  as you will see,  the  idea  of  a
distributed scanner still has its merits.


SOME EXPERIMENTS IN SCANNING

Before  getting  back  to our compiler,  it  will  be  useful  to
experiment a bit with the general concepts.

Let's  begin with the two definitions most  often  seen  in  real
programming languages:

     <ident> ::= <letter> [ <letter> | <digit> ]*
     <number ::= [<digit>]+

(Remember, the '*' indicates zero or more occurences of the terms
in brackets, and the '+', one or more.)

We  have already dealt with similar  items  in  Installment  III.
Let's begin (as usual) with a bare cradle.  Not  surprisingly, we
are going to need a new recognizer:
                              

{--------------------------------------------------------------}
{ Recognize an Alphanumeric Character }

function IsAlNum(c: char): boolean;
begin
   IsAlNum := IsAlpha(c) or IsDigit(c);
end;
{--------------------------------------------------------------}


Using this let's write the following two routines, which are very
similar to those we've used before:


{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: string;
var x: string[8];
begin
   x := '';
   if not IsAlpha(Look) then Expected('Name');
   while IsAlNum(Look) do begin
     x := x + UpCase(Look);
     GetChar;
   end;
   GetName := x;
end;


{--------------------------------------------------------------}
{ Get a Number }

function GetNum: string;
var x: string[16];
begin
   x := '';
   if not IsDigit(Look) then Expected('Integer');
   while IsDigit(Look) do begin
     x := x + Look;
     GetChar;
   end;
   GetNum := x;
end;
{--------------------------------------------------------------}


(Notice  that this version of GetNum returns  a  string,  not  an
integer as before.)

You  can  easily  verify that these routines work by calling them
from the main program, as in

     WriteLn(GetName);

This  program  will  print any legal name typed in (maximum eight
characters, since that's what we told GetName).   It  will reject
anything else.

Test the other routine similarly.


WHITE SPACE

We  also  have  dealt with embedded white space before, using the
two  routines  IsWhite  and  SkipWhite.    Make  sure that  these
routines are in your  current  version of the cradle, and add the
the line

     SkipWhite;

at the end of both GetName and GetNum.

Now, let's define the new procedure:


{--------------------------------------------------------------}
{ Lexical Scanner }

Function Scan: string;
begin
   if IsAlpha(Look) then
      Scan := GetName
   else if IsDigit(Look) then
      Scan := GetNum
   else begin
      Scan := Look;
      GetChar;
   end;
   SkipWhite;
end;
{--------------------------------------------------------------}


We can call this from the new main program:


{--------------------------------------------------------------}
{ Main Program }


begin
   Init;
   repeat
      Token := Scan;
      writeln(Token);
   until Token = CR;
end.
{--------------------------------------------------------------}


(You will have to add the declaration of the string Token  at the
beginning of the program.  Make it any convenient length,  say 16
characters.)

Now,  run the program.  Note how the  input  string  is,  indeed,
separated into distinct tokens.


STATE MACHINES

For  the  record,  a  parse  routine  like  GetName  does  indeed
implement a state machine.  The state is implicit in  the current
position in the code.  A very useful trick for visualizing what's
going on is  the  syntax  diagram,  or  "railroad-track" diagram.
It's a little difficult to draw  one  in this medium, so I'll use
them very sparingly, but  the  figure  below  should give you the
idea:


           |-----> Other---------------------------> Error
           |
   Start -------> Letter ---------------> Other -----> Finish
           ^                        V
           |                        |
           |<----- Letter <---------|
           |                        |
           |<----- Digit  <----------


As  you  can  see,  this  diagram  shows  how  the logic flows as
characters  are  read.    Things  begin, of course, in the  start
state, and end when  a  character  other  than an alphanumeric is
found.  If  the  first  character  is not alpha, an error occurs.
Otherwise the machine will continue looping until the terminating
delimiter is found.

Note  that at any point in the flow,  our  position  is  entirely
dependent on the past  history  of the input characters.  At that
point, the action to be taken depends only on the  current state,
plus the current input character.  That's what make this  a state
machine.

Because of the difficulty of drawing  railroad-track  diagrams in
this medium, I'll continue to  stick to syntax equations from now
on.  But I highly recommend the diagrams to you for  anything you
do that involves parsing.  After a little practice you  can begin
to  see  how  to  write  a  parser  directly from  the  diagrams.
Parallel paths get coded into guarded actions (guarded by IF's or
CASE statements),  serial  paths  into  sequential  calls.   It's
almost like working from a schematic.

We didn't even discuss SkipWhite, which  was  introduced earlier,
but it also is a simple state machine, as is GetNum.  So is their
parent procedure, Scan.  Little machines make big machines.

The neat thing that I'd like  you  to note is how painlessly this
implicit approach creates these  state  machines.    I personally
prefer it a lot over the table-driven approach.  It  also results
is a small, tight, and fast scanner.


NEWLINES

Moving right along, let's modify  our scanner to handle more than
one line.  As I mentioned last time, the most straightforward way
to  do  this  is to simply treat the newline characters, carriage
return  and line feed, as white space.  This is, in fact, the way
the  C  standard  library  routine,  iswhite, works.   We  didn't
actually try this  before.  I'd like to do it now, so you can get
a feel for the results.

To do this, simply modify the single executable  line  of IsWhite
to read:


   IsWhite := c in [' ', TAB, CR, LF];


We need to give the main  program  a new stop condition, since it
will never see a CR.  Let's just use:


   until Token = '.';


OK, compile this  program  and  run  it.   Try a couple of lines,
terminated by the period.  I used:


     now is the time
     for all good men.

Hey,  what  happened?   When I tried it, I didn't  get  the  last
token, the period.  The program didn't halt.  What's more, when I
pressed the  'enter'  key  a  few  times,  I still didn't get the
period.

If you're still stuck in your program, you'll find that  typing a
period on a new line will terminate it.

What's going on here?  The answer is  that  we're  hanging  up in
SkipWhite.  A quick look at  that  routine will show that as long
as we're typing null lines, we're going to just continue to loop.
After SkipWhite encounters an LF,  it tries to execute a GetChar.
But since the input buffer is now empty, GetChar's read statement
insists  on  having  another  line.    Procedure  Scan  gets  the
terminating period, all right,  but  it  calls SkipWhite to clean
up, and SkipWhite won't return until it gets a non-null line.

This kind of behavior is not quite as bad as it seems.  In a real
compiler,  we'd  be  reading  from  an input file instead of  the
console, and as long  as  we have some procedure for dealing with
end-of-files, everything will come out  OK.  But for reading data
from the console, the behavior is just too bizarre.  The  fact of
the matter is that the C/Unix convention is  just  not compatible
with the structure of  our  parser,  which  calls for a lookahead
character.    The  code that the Bell  wizards  have  implemented
doesn't use that convention, which is why they need 'ungetc'.

OK, let's fix the problem.  To do that, we need to go back to the
old definition of IsWhite (delete the CR and  LF  characters) and
make  use  of  the procedure Fin that I introduced last time.  If
it's not in your current version of the cradle, put it there now.

Also, modify the main program to read:


{--------------------------------------------------------------}
{ Main Program }


begin
   Init;
   repeat
      Token := Scan;
      writeln(Token);
      if Token = CR then Fin;
   until Token = '.';
end.
{--------------------------------------------------------------}


Note the "guard"  test  preceding  the  call to Fin.  That's what
makes the whole thing work, and ensures that we don't try to read
a line ahead.
                             
Try the code now. I think you'll like it better.

If you refer to the code  we  did in the last installment, you'll
find that I quietly sprinkled calls to Fin  throughout  the code,
wherever  a line break was appropriate.  This  is  one  of  those
areas that really affects the look  &  feel that I mentioned.  At
this  point  I  would  urge  you  to  experiment  with  different
arrangements  and  see  how  you  like  them.    If you want your
language  to  be  truly  free-field,  then  newlines   should  be
transparent.   In  this  case,  the  best  approach is to put the
following lines at the BEGINNING of Scan:


          while Look = CR do
             Fin;


If, on the other  hand,  you  want  a line-oriented language like
Assembler, BASIC, or FORTRAN  (or  even  Ada...  note that it has
comments terminated by newlines),  then  you'll  need for Scan to
return CR's as tokens.  It  must  also  eat the trailing LF.  The
best way to do that is to use this line,  again  at the beginning
of Scan:

          if Look = LF then Fin;


For other conventions, you'll  have  to  use  other arrangements.
In my example  of  the  last  session, I allowed newlines only at
specific places, so I was somewhere in the middle ground.  In the
rest of these sessions, I'll be picking ways  to  handle newlines
that I happen to like, but I want you to know how to choose other
ways for yourselves.


OPERATORS

We  could  stop now and have a  pretty  useful  scanner  for  our
purposes.  In the fragments of KISS that we've built so  far, the
only tokens that have multiple characters are the identifiers and
numbers.    All  operators  were  single  characters.   The  only
exception I can think of is the relops <=, >=,  and  <>, but they
could be dealt with as special cases.

Still, other languages have  multi-character  operators,  such as
the ':=' of  Pascal or the '++' and '>>' of C.  So  while  we may
not need multi-character operators, it's  nice to know how to get
them if necessary.

Needless to say, we  can  handle operators very much the same way
as the other tokens.  Let's start with a recognizer:
                             

{--------------------------------------------------------------}
{ Recognize Any Operator }

function IsOp(c: char): boolean;
begin
   IsOp := c in ['+', '-', '*', '/', '<', '>', ':', '='];
end;
{--------------------------------------------------------------}


It's important to  note  that  we  DON'T  have  to  include every
possible  operator in this list.   For  example,  the  paretheses
aren't  included, nor is the terminating  period.    The  current
version of Scan handles single-character operators  just  fine as
it is.  The list above includes only those  characters  that  can
appear in multi-character operators.  (For specific languages, of
course, the list can always be edited.)

Now, let's modify Scan to read:


{--------------------------------------------------------------}
{ Lexical Scanner }

Function Scan: string;
begin
   while Look = CR do
      Fin;
   if IsAlpha(Look) then
      Scan := GetName
   else if IsDigit(Look) then
      Scan := GetNum
   else if IsOp(Look) then
      Scan := GetOp
   else begin
      Scan := Look;
      GetChar;
   end;
   SkipWhite;
end;
{--------------------------------------------------------------}


Try the program now.  You  will  find that any code fragments you
care  to throw at it will be neatly  broken  up  into  individual
tokens.


LISTS, COMMAS AND COMMAND LINES

Before getting back to the main thrust of our study, I'd  like to
get on my soapbox for a moment.
                             
How many times have you worked with a program or operating system
that had rigid rules about how you must separate items in a list?
(Try,  the  last  time  you  used MSDOS!)  Some programs  require
spaces as delimiters, and  some  require  commas.   Worst of all,
some  require  both,  in  different  places.    Most  are  pretty
unforgiving about violations of their rules.

I think this is inexcusable.  It's too  easy  to  write  a parser
that will handle  both  spaces  and  commas  in  a  flexible way.
Consider the following procedure:


{--------------------------------------------------------------}
{ Skip Over a Comma }

procedure SkipComma;
begin
   SkipWhite;
   if Look = ',' then begin
      GetChar;
      SkipWhite;
   end;
end;
{--------------------------------------------------------------}


This eight-line procedure will skip over  a  delimiter consisting
of any number (including zero)  of spaces, with zero or one comma
embedded in the string.

TEMPORARILY, change the call to SkipWhite in Scan to  a  call  to
SkipComma,  and  try  inputting some lists.   Works  nicely,  eh?
Don't you wish more software authors knew about SkipComma?

For the record, I found that adding the  equivalent  of SkipComma
to my Z80 assembler-language programs took all of  6  (six) extra
bytes of  code.    Even  in a 64K machine, that's not a very high
price to pay for user-friendliness!

I  think  you can see where I'm going here.  Even  if  you  never
write a line of a compiler code in your life, there are places in
every program where  you  can  use  the concepts of parsing.  Any
program that processes a command line needs them.   In  fact,  if
you  think  about  it for a bit, you'll have to conclude that any
time  you  write  a program that processes  user  inputs,  you're
defining a  language.  People communicate with languages, and the
syntax implicit in your program  defines that language.  The real
question  is:  are  you  going  to  define  it  deliberately  and
explicitly, or just let it turn out to be  whatever  the  program
ends up parsing?

I claim that you'll have  a better, more user-friendly program if
you'll take the time to define the syntax explicitly.  Write down
the syntax equations or  draw  the  railroad-track  diagrams, and
code the parser using the techniques I've shown you here.  You'll
end  up with a better program, and it will be easier to write, to
boot.


GETTING FANCY

OK, at this point we have a pretty nice lexical scanner that will
break  an  input stream up into tokens.  We could use  it  as  it
stands and have a servicable compiler.  But there are  some other
aspects of lexical scanning that we need to cover.

The main consideration is <shudder> efficiency.  Remember when we
were dealing  with  single-character  tokens,  every  test  was a
comparison of a single character, Look, with a byte constant.  We
also used the Case statement heavily.

With the multi-character tokens being returned by Scan, all those
tests now become string comparisons.  Much slower.  And  not only
slower, but more awkward, since  there is no string equivalent of
the  Case  statement  in Pascal.  It seems especially wasteful to
test for what used to be single characters ... the '=',  '+', and
other operators ... using string comparisons.

Using string comparison is not  impossible ... Ron Cain used just
that approach in writing Small C.  Since we're  sticking  to  the
KISS principle here, we would  be truly justified in settling for
this  approach.    But then I would have failed to tell you about
one of the key approaches used in "real" compilers.

You have to remember: the lexical scanner is going to be called a
_LOT_!   Once for every token in the  whole  source  program,  in
fact.   Experiments  have  indicated  that  the  average compiler
spends  anywhere  from 20% to 40% of  its  time  in  the  scanner
routines.  If there were ever a place  where  efficiency deserves
real consideration, this is it.

For this reason, most compiler writers ask the lexical scanner to
do  a  little  more work, by "tokenizing"  the input stream.  The
idea  is  to  match every token  against  a  list  of  acceptable
keywords  and operators, and return unique  codes  for  each  one
recognized.  In the case of ordinary variable  names  or numbers,
we  just return a code that says what kind of token they are, and
save the actual string somewhere else.

One  of the first things we're going to need is a way to identify
keywords.  We can always do  it  with successive IF tests, but it
surely would be nice  if  we  had  a general-purpose routine that
could compare a given string with  a  table of keywords.  (By the
way, we're also going  to  need such a routine later, for dealing
with symbol tables.)  This  usually presents a problem in Pascal,
because standard Pascal  doesn't  allow  for  arrays  of variable
lengths.   It's  a  real  bother  to  have to declare a different
search routine for every table.    Standard  Pascal  also doesn't
allow for initializing arrays, so you tend to see code like

     Table[1] := 'IF';
     Table[2] := 'ELSE';
     .
     .
     Table[n] := 'END';

which can get pretty old if there are many keywords.

Fortunately, Turbo Pascal 4.0 has extensions that  eliminate both
of  these  problems.   Constant arrays can be declared using TP's
"typed constant" facility, and  the  variable  dimensions  can be
handled with its C-like extensions for pointers.

First, modify your declarations like this:


{--------------------------------------------------------------}
{ Type Declarations  }

type Symbol = string[8];

     SymTab = array[1..1000] of Symbol;

     TabPtr = ^SymTab;
{--------------------------------------------------------------}


(The dimension  used  in  SymTab  is  not  real ... no storage is
allocated by the declaration itself,  and the number need only be
"big enough.")

Now, just beneath those declarations, add the following:


{--------------------------------------------------------------}
{ Definition of Keywords and Token Types }

const KWlist: array [1..4] of Symbol =
              ('IF', 'ELSE', 'ENDIF', 'END');

{--------------------------------------------------------------}


Next, insert the following new function:


{--------------------------------------------------------------}
{ Table Lookup }

{ If the input string matches a table entry, return the entry
  index.  If not, return a zero.  }
                             
function Lookup(T: TabPtr; s: string; n: integer): integer;
var i: integer;
    found: boolean;
begin
   found := false;
   i := n;
   while (i > 0) and not found do
      if s = T^[i] then
         found := true
      else
         dec(i);
   Lookup := i;
end;
{--------------------------------------------------------------}


To test it,  you  can  temporarily  change  the  main  program as
follows:


{--------------------------------------------------------------}
{ Main Program }


begin
   ReadLn(Token);
   WriteLn(Lookup(Addr(KWList), Token, 4));
end.
{--------------------------------------------------------------}


Notice how Lookup is called: The Addr function sets up  a pointer
to KWList, which gets passed to Lookup.

OK, give this  a  try.    Since we're bypassing Scan here, you'll
have to type the keywords in upper case to get any matches.

Now that we can recognize keywords, the next thing is  to arrange
to return codes for them.

So what kind of code should we return?  There are really only two
reasonable choices.  This seems like an ideal application for the
Pascal enumerated type.   For  example,  you can define something
like

     SymType = (IfSym, ElseSym, EndifSym, EndSym, Ident, Number,
                    Operator);

and arrange to return a variable of this type.   Let's  give it a
try.  Insert the line above into your type definitions.

Now, add the two variable declarations:
                             

    Token: Symtype;          { Current Token  }
    Value: String[16];       { String Token of Look }


Modify the scanner to read:


{--------------------------------------------------------------}
{ Lexical Scanner }

procedure Scan;
var k: integer;
begin
   while Look = CR do
      Fin;
   if IsAlpha(Look) then begin
      Value := GetName;
      k := Lookup(Addr(KWlist), Value, 4);
      if k = 0 then
         Token := Ident
      else
         Token := SymType(k - 1);
      end
   else if IsDigit(Look) then begin
      Value := GetNum;
      Token := Number;
      end
   else if IsOp(Look) then begin
      Value := GetOp;
      Token := Operator;
      end
   else begin
      Value := Look;
      Token := Operator;
      GetChar;
   end;
   SkipWhite;
end;
{--------------------------------------------------------------}


(Notice that Scan is now a procedure, not a function.)


Finally, modify the main program to read:


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
   repeat
      Scan;
      case Token of
        Ident: write('Ident ');
        Number: Write('Number ');
        Operator: Write('Operator ');
        IfSym, ElseSym, EndifSym, EndSym: Write('Keyword ');
      end;
      Writeln(Value);
   until Token = EndSym;
end.
{--------------------------------------------------------------}


What we've done here is to replace the string Token  used earlier
with an enumerated type. Scan returns the type in variable Token,
and returns the string itself in the new variable Value.

OK, compile this and give it a whirl.  If everything  goes right,
you should see that we are now recognizing keywords.

What  we  have  now is working right, and it was easy to generate
from what  we  had  earlier.    However,  it still seems a little
"busy" to me.  We can  simplify  things a bit by letting GetName,
GetNum, GetOp, and Scan be  procedures  working  with  the global
variables Token and Value, thereby eliminating the  local copies.
It  also seems a little cleaner to move  the  table  lookup  into
GetName.  The new form for the four procedures is, then:


{--------------------------------------------------------------}
{ Get an Identifier }

procedure GetName;
var k: integer;
begin
   Value := '';
   if not IsAlpha(Look) then Expected('Name');
   while IsAlNum(Look) do begin
     Value := Value + UpCase(Look);
     GetChar;
   end;
   k := Lookup(Addr(KWlist), Value, 4);
   if k = 0 then
      Token := Ident
   else
      Token := SymType(k-1);
end;
                             
{--------------------------------------------------------------}
{ Get a Number }

procedure GetNum;
begin
   Value := '';
   if not IsDigit(Look) then Expected('Integer');
   while IsDigit(Look) do begin
     Value := Value + Look;
     GetChar;
   end;
   Token := Number;
end;


{--------------------------------------------------------------}
{ Get an Operator }

procedure GetOp;
begin
   Value := '';
   if not IsOp(Look) then Expected('Operator');
   while IsOp(Look) do begin
     Value := Value + Look;
     GetChar;
   end;
   Token := Operator;
end;


{--------------------------------------------------------------}
{ Lexical Scanner }

procedure Scan;
var k: integer;
begin
   while Look = CR do
      Fin;
   if IsAlpha(Look) then
      GetName
   else if IsDigit(Look) then
      GetNum
   else if IsOp(Look) then
      GetOp
   else begin
      Value := Look;
      Token := Operator;
      GetChar;
   end;
   SkipWhite;
end;
{--------------------------------------------------------------}
                             

RETURNING A CHARACTER

Essentially  every scanner I've ever seen  that  was  written  in
Pascal  used  the  mechanism of an enumerated type that I've just
described.  It is certainly  a workable mechanism, but it doesn't
seem the simplest approach to me.

For one thing, the  list  of possible symbol types can get pretty
long. Here, I've used just one symbol, "Operator,"  to  stand for
all of the operators, but I've seen other  designs  that actually
return different codes for each one.

There is, of course, another simple type that can be  returned as
a  code: the character.  Instead  of  returning  the  enumeration
value 'Operator' for a '+' sign, what's wrong with just returning
the character itself?  A character is just as good a variable for
encoding the different  token  types,  it  can  be  used  in case
statements  easily, and it's sure a lot easier  to  type.    What
could be simpler?

Besides, we've already  had  experience with the idea of encoding
keywords as single characters.  Our previous programs are already
written  that  way,  so  using  this approach will  minimize  the
changes to what we've already done.

Some of you may feel that this idea of returning  character codes
is too mickey-mouse.  I must  admit  it gets a little awkward for
multi-character operators like '<='.   If you choose to stay with
the  enumerated  type,  fine.  For the rest, I'd like to show you
how to change what we've done above to support that approach.

First, you can delete the SymType declaration now ... we won't be
needing that.  And you can change the type of Token to char.

Next, to replace SymType, add the following constant string:


   const KWcode: string[5] = 'xilee';


(I'll be encoding all idents with the single character 'x'.)


Lastly, modify Scan and its relatives as follows:


{--------------------------------------------------------------}
{ Get an Identifier }

procedure GetName;
begin
   Value := '';
   if not IsAlpha(Look) then Expected('Name');
   while IsAlNum(Look) do begin
     Value := Value + UpCase(Look);
     GetChar;
   end;
   Token := KWcode[Lookup(Addr(KWlist), Value, 4) + 1];
end;


{--------------------------------------------------------------}
{ Get a Number }

procedure GetNum;
begin
   Value := '';
   if not IsDigit(Look) then Expected('Integer');
   while IsDigit(Look) do begin
     Value := Value + Look;
     GetChar;
   end;
   Token := '#';
end;


{--------------------------------------------------------------}
{ Get an Operator }

procedure GetOp;
begin
   Value := '';
   if not IsOp(Look) then Expected('Operator');
   while IsOp(Look) do begin
     Value := Value + Look;
     GetChar;
   end;
   if Length(Value) = 1 then
      Token := Value[1]
   else
      Token := '?';
end;


{--------------------------------------------------------------}
{ Lexical Scanner }

procedure Scan;
var k: integer;
begin
   while Look = CR do
      Fin;
   if IsAlpha(Look) then
      GetName
   else if IsDigit(Look) then
      GetNum
   else if IsOp(Look) then begin
      GetOp
   else begin
      Value := Look;
      Token := '?';
      GetChar;
   end;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Main Program }


begin
   Init;
   repeat
      Scan;
      case Token of
        'x': write('Ident ');
        '#': Write('Number ');
        'i', 'l', 'e': Write('Keyword ');
        else Write('Operator ');
      end;
      Writeln(Value);
   until Value = 'END';
end.
{--------------------------------------------------------------}


This program should  work  the  same  as the previous version.  A
minor  difference  in  structure,  maybe,  but   it   seems  more
straightforward to me.


DISTRIBUTED vs CENTRALIZED SCANNERS

The structure for the lexical scanner that I've just shown you is
very conventional, and  about  99% of all compilers use something
very  close  to it.  This is  not,  however,  the  only  possible
structure, or even always the best one.
                             
The problem with the  conventional  approach  is that the scanner
has no knowledge of context.  For example,  it  can't distinguish
between the assignment operator '=' and  the  relational operator
'=' (perhaps that's why both C and Pascal  use  different strings
for the  two).    All  the scanner can do is to pass the operator
along  to  the  parser, which can hopefully tell from the context
which operator is meant.    Similarly, a keyword like 'IF' has no
place in the middle of a  math  expression, but if one happens to
appear there, the scanner  will  see no problem with it, and will
return it to the parser, properly encoded as an 'IF'.

With this  kind  of  approach,  we  are  not really using all the
information at our disposal.  In the middle of an expression, for
example, the parser  "knows"  that  there  is no need to look for
keywords,  but it has no way of telling the scanner that.  So the
scanner  continues to do so.  This, of  course,  slows  down  the
compilation.

In real-world compilers, the  designers  often  arrange  for more
information  to be passed between parser  and  scanner,  just  to
avoid  this  kind of problem.  But  that  can  get  awkward,  and
certainly destroys a lot of the modularity of the structure.

The  alternative  is  to seek some  way  to  use  the  contextual
information that comes from knowing where we are  in  the parser.
This leads us  back  to  the  notion of a distributed scanner, in
which various portions  of  the scanner are called depending upon
the context.

In KISS, as  in  most  languages,  keywords  ONLY  appear  at the
beginning of a statement.  In places like  expressions,  they are
not allowed.  Also, with one minor exception (the multi-character
relops)  that  is  easily  handled,  all  operators   are  single
characters, which means that we don't need GetOp at all.

So it turns out  that  even  with  multi-character tokens, we can
still always tell from the  current  lookahead  character exactly
what kind of token is coming,  except  at the very beginning of a
statement.

Even at that point, the ONLY  kind  of  token we can accept is an
identifier.  We need only to determine if that  identifier  is  a
keyword or the target of an assignment statement.

We end up, then, still needing only GetName and GetNum, which are
used very much as we've used them in earlier installments.

It may seem  at first to you that this is a step backwards, and a
rather  primitive  approach.   In fact, it is an improvement over
the classical scanner, since we're  using  the  scanning routines
only where they're really needed.  In places  where  keywords are
not allowed, we don't slow things down by looking for them.


MERGING SCANNER AND PARSER

Now that we've covered  all  of the theory and general aspects of
lexical scanning that we'll be needing, I'm FINALLY ready to back
up my claim that  we  can  accomodate multi-character tokens with
minimal change to our previous work.  To keep  things  short  and
simple I will restrict myself here to a subset of what we've done
before; I'm allowing only one control construct (the  IF)  and no
Boolean expressions.  That's enough to demonstrate the parsing of
both keywords and expressions.  The extension to the full  set of
constructs should be  pretty  apparent  from  what  we've already
done.

All  the  elements  of  the  program to parse this subset,  using
single-character tokens, exist  already in our previous programs.
I built it by judicious copying of these files,  but  I  wouldn't
dare try to lead you through that process.  Instead, to avoid any
confusion, the whole program is shown below:


{--------------------------------------------------------------}
program KISS;

{--------------------------------------------------------------}
{ Constant Declarations }

const TAB = ^I;
      CR  = ^M;
      LF  = ^J;

{--------------------------------------------------------------}
{ Type Declarations  }

type Symbol = string[8];

     SymTab = array[1..1000] of Symbol;

     TabPtr = ^SymTab;


{--------------------------------------------------------------}
{ Variable Declarations }

var Look  : char;              { Lookahead Character }
    Lcount: integer;           { Label Counter       }


{--------------------------------------------------------------}
{ Read New Character From Input Stream }

procedure GetChar;
begin
   Read(Look);
end;
                             

{--------------------------------------------------------------}
{ Report an Error }

procedure Error(s: string);
begin
   WriteLn;
   WriteLn(^G, 'Error: ', s, '.');
end;


{--------------------------------------------------------------}
{ Report Error and Halt }

procedure Abort(s: string);
begin
   Error(s);
   Halt;
end;


{--------------------------------------------------------------}
{ Report What Was Expected }

procedure Expected(s: string);
begin
   Abort(s + ' Expected');
end;

{--------------------------------------------------------------}
{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;
begin
   IsAlpha := UpCase(c) in ['A'..'Z'];
end;


{--------------------------------------------------------------}
{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;
begin
   IsDigit := c in ['0'..'9'];
end;


{--------------------------------------------------------------}
{ Recognize an AlphaNumeric Character }

function IsAlNum(c: char): boolean;
begin
   IsAlNum := IsAlpha(c) or IsDigit(c);
end;
                             
{--------------------------------------------------------------}
{ Recognize an Addop }

function IsAddop(c: char): boolean;
begin
   IsAddop := c in ['+', '-'];
end;


{--------------------------------------------------------------}
{ Recognize a Mulop }

function IsMulop(c: char): boolean;
begin
   IsMulop := c in ['*', '/'];
end;


{--------------------------------------------------------------}
{ Recognize White Space }

function IsWhite(c: char): boolean;
begin
   IsWhite := c in [' ', TAB];
end;


{--------------------------------------------------------------}
{ Skip Over Leading White Space }

procedure SkipWhite;
begin
   while IsWhite(Look) do
      GetChar;
end;


{--------------------------------------------------------------}
{ Match a Specific Input Character }

procedure Match(x: char);
begin
   if Look <> x then Expected('''' + x + '''');
   GetChar;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Skip a CRLF }

procedure Fin;
begin
   if Look = CR then GetChar;
   if Look = LF then GetChar;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: char;
begin
   while Look = CR do
      Fin;
   if not IsAlpha(Look) then Expected('Name');
   Getname := UpCase(Look);
   GetChar;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Get a Number }

function GetNum: char;
begin
   if not IsDigit(Look) then Expected('Integer');
   GetNum := Look;
   GetChar;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Generate a Unique Label }

function NewLabel: string;
var S: string;
begin
   Str(LCount, S);
   NewLabel := 'L' + S;
   Inc(LCount);
end;


{--------------------------------------------------------------}
{ Post a Label To Output }

procedure PostLabel(L: string);
begin
   WriteLn(L, ':');
end;


{--------------------------------------------------------------}
{ Output a String with Tab }
                             
procedure Emit(s: string);
begin
   Write(TAB, s);
end;


{--------------------------------------------------------------}

{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);
begin
   Emit(s);
   WriteLn;
end;


{---------------------------------------------------------------}
{ Parse and Translate an Identifier }

procedure Ident;
var Name: char;
begin
   Name := GetName;
   if Look = '(' then begin
      Match('(');
      Match(')');
      EmitLn('BSR ' + Name);
      end
   else
      EmitLn('MOVE ' + Name + '(PC),D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;
begin
   if Look = '(' then begin
      Match('(');
      Expression;
      Match(')');
      end
   else if IsAlpha(Look) then
      Ident
   else
      EmitLn('MOVE #' + GetNum + ',D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate the First Math Factor }


procedure SignedFactor;
var s: boolean;
begin
   s := Look = '-';
   if IsAddop(Look) then begin
      GetChar;
      SkipWhite;
   end;
   Factor;
   if s then
      EmitLn('NEG D0');
end;


{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }

procedure Multiply;
begin
   Match('*');
   Factor;
   EmitLn('MULS (SP)+,D0');
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Divide }

procedure Divide;
begin
   Match('/');
   Factor;
   EmitLn('MOVE (SP)+,D1');
   EmitLn('EXS.L D0');
   EmitLn('DIVS D1,D0');
end;


{---------------------------------------------------------------}
{ Completion of Term Processing  (called by Term and FirstTerm }

procedure Term1;
begin
   while IsMulop(Look) do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '*': Multiply;
       '/': Divide;
      end;
   end;
end;
                             

{---------------------------------------------------------------}
{ Parse and Translate a Math Term }

procedure Term;
begin
   Factor;
   Term1;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Term with Possible Leading Sign }

procedure FirstTerm;
begin
   SignedFactor;
   Term1;
end;


{---------------------------------------------------------------}
{ Recognize and Translate an Add }

procedure Add;
begin
   Match('+');
   Term;
   EmitLn('ADD (SP)+,D0');
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Subtract }

procedure Subtract;
begin
   Match('-');
   Term;
   EmitLn('SUB (SP)+,D0');
   EmitLn('NEG D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
   FirstTerm;
   while IsAddop(Look) do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '+': Add;
       '-': Subtract;
      end;
   end;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Boolean Condition }
{ This version is a dummy }

Procedure Condition;
begin
   EmitLn('Condition');
end;


{---------------------------------------------------------------}
{ Recognize and Translate an IF Construct }

procedure Block;
 Forward;

procedure DoIf;
var L1, L2: string;
begin
   Match('i');
   Condition;
   L1 := NewLabel;
   L2 := L1;
   EmitLn('BEQ ' + L1);
   Block;
   if Look = 'l' then begin
      Match('l');
      L2 := NewLabel;
      EmitLn('BRA ' + L2);
      PostLabel(L1);
      Block;
   end;
   PostLabel(L2);
   Match('e');
end;


{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
var Name: char;
begin
   Name := GetName;
   Match('=');
   Expression;
   EmitLn('LEA ' + Name + '(PC),A0');
   EmitLn('MOVE D0,(A0)');
end;
                             

{--------------------------------------------------------------}
{ Recognize and Translate a Statement Block }

procedure Block;
begin
   while not(Look in ['e', 'l']) do begin
      case Look of
       'i': DoIf;
       CR: while Look = CR do
              Fin;
       else Assignment;
      end;
   end;
end;


{--------------------------------------------------------------}
{ Parse and Translate a Program }

procedure DoProgram;
begin
   Block;
   if Look <> 'e' then Expected('END');
   EmitLn('END')
end;


{--------------------------------------------------------------}

{ Initialize }

procedure Init;
begin
   LCount := 0;
   GetChar;
end;


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
   DoProgram;
end.
{--------------------------------------------------------------}


A couple of comments:

 (1) The form for the expression parser,  using  FirstTerm, etc.,
     is  a  little  different from what you've seen before.  It's
     yet another variation on the same theme.  Don't let it throw
     you ... the change is not required for what follows.

 (2) Note that, as usual, I had to add calls to Fin  at strategic
     spots to allow for multiple lines.

Before we proceed to adding the scanner, first copy this file and
verify that it does indeed  parse things correctly.  Don't forget
the "codes": 'i' for IF, 'l' for ELSE, and 'e' for END or ENDIF.

If the program works, then let's press on.  In adding the scanner
modules to the program, it helps  to  have a systematic plan.  In
all  the  parsers  we've  written  to  date,  we've  stuck  to  a
convention that the current lookahead character should  always be
a non-blank character.  We  preload  the  lookahead  character in
Init, and keep the "pump primed"  after  that.  To keep the thing
working right at newlines, we had to modify this a bit  and treat
the newline as a legal token.

In the  multi-character version, the rule is similar: The current
lookahead character should always be left at the BEGINNING of the
next token, or at a newline.

The multi-character version is shown next.  To get it,  I've made
the following changes:


 o Added the variables Token  and Value, and the type definitions
   needed by Lookup.

 o Added the definitions of KWList and KWcode.

 o Added Lookup.

 o Replaced GetName and GetNum by their multi-character versions.
   (Note that the call  to  Lookup has been moved out of GetName,
   so  that  it  will  not   be  executed  for  calls  within  an
   expression.)

 o Created a new,  vestigial  Scan that calls GetName, then scans
   for keywords.

 o Created  a  new  procedure,  MatchString,  that  looks  for  a
   specific keyword.  Note that, unlike  Match,  MatchString does
   NOT read the next keyword.

 o Modified Block to call Scan.

 o Changed the calls  to  Fin  a  bit.   Fin is now called within
   GetName.

Here is the program in its entirety:


{--------------------------------------------------------------}
program KISS;
                             
{--------------------------------------------------------------}
{ Constant Declarations }

const TAB = ^I;
      CR  = ^M;
      LF  = ^J;

{--------------------------------------------------------------}
{ Type Declarations  }

type Symbol = string[8];

     SymTab = array[1..1000] of Symbol;

     TabPtr = ^SymTab;


{--------------------------------------------------------------}
{ Variable Declarations }

var Look  : char;              { Lookahead Character }
    Token : char;              { Encoded Token       }
    Value : string[16];        { Unencoded Token     }
    Lcount: integer;           { Label Counter       }


{--------------------------------------------------------------}
{ Definition of Keywords and Token Types }

const KWlist: array [1..4] of Symbol =
              ('IF', 'ELSE', 'ENDIF', 'END');

const KWcode: string[5] = 'xilee';


{--------------------------------------------------------------}
{ Read New Character From Input Stream }

procedure GetChar;
begin
   Read(Look);
end;

{--------------------------------------------------------------}
{ Report an Error }

procedure Error(s: string);
begin
   WriteLn;
   WriteLn(^G, 'Error: ', s, '.');
end;


{--------------------------------------------------------------}
{ Report Error and Halt }

procedure Abort(s: string);
begin
   Error(s);
   Halt;
end;


{--------------------------------------------------------------}
{ Report What Was Expected }

procedure Expected(s: string);
begin
   Abort(s + ' Expected');
end;

{--------------------------------------------------------------}
{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;
begin
   IsAlpha := UpCase(c) in ['A'..'Z'];
end;


{--------------------------------------------------------------}
{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;
begin
   IsDigit := c in ['0'..'9'];
end;


{--------------------------------------------------------------}
{ Recognize an AlphaNumeric Character }

function IsAlNum(c: char): boolean;
begin
   IsAlNum := IsAlpha(c) or IsDigit(c);
end;


{--------------------------------------------------------------}
{ Recognize an Addop }

function IsAddop(c: char): boolean;
begin
   IsAddop := c in ['+', '-'];
end;


{--------------------------------------------------------------}
{ Recognize a Mulop }

function IsMulop(c: char): boolean;
begin
   IsMulop := c in ['*', '/'];
end;


{--------------------------------------------------------------}
{ Recognize White Space }

function IsWhite(c: char): boolean;
begin
   IsWhite := c in [' ', TAB];
end;


{--------------------------------------------------------------}
{ Skip Over Leading White Space }

procedure SkipWhite;
begin
   while IsWhite(Look) do
      GetChar;
end;


{--------------------------------------------------------------}
{ Match a Specific Input Character }

procedure Match(x: char);
begin
   if Look <> x then Expected('''' + x + '''');
   GetChar;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Skip a CRLF }

procedure Fin;
begin
   if Look = CR then GetChar;
   if Look = LF then GetChar;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Table Lookup }

function Lookup(T: TabPtr; s: string; n: integer): integer;
var i: integer;
    found: boolean;
begin
   found := false;
   i := n;
   while (i > 0) and not found do
      if s = T^[i] then
         found := true
      else
         dec(i);
   Lookup := i;
end;


{--------------------------------------------------------------}
{ Get an Identifier }

procedure GetName;
begin
   while Look = CR do
      Fin;
   if not IsAlpha(Look) then Expected('Name');
   Value := '';
   while IsAlNum(Look) do begin
     Value := Value + UpCase(Look);
     GetChar;
   end;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Get a Number }

procedure GetNum;
begin
   if not IsDigit(Look) then Expected('Integer');
   Value := '';
   while IsDigit(Look) do begin
     Value := Value + Look;
     GetChar;
   end;
   Token := '#';
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Get an Identifier and Scan it for Keywords }

procedure Scan;
begin
   GetName;
   Token := KWcode[Lookup(Addr(KWlist), Value, 4) + 1];
end;
                             

{--------------------------------------------------------------}
{ Match a Specific Input String }

procedure MatchString(x: string);
begin
   if Value <> x then Expected('''' + x + '''');
end;


{--------------------------------------------------------------}
{ Generate a Unique Label }

function NewLabel: string;
var S: string;
begin
   Str(LCount, S);
   NewLabel := 'L' + S;
   Inc(LCount);
end;


{--------------------------------------------------------------}
{ Post a Label To Output }

procedure PostLabel(L: string);
begin
   WriteLn(L, ':');
end;


{--------------------------------------------------------------}
{ Output a String with Tab }

procedure Emit(s: string);
begin
   Write(TAB, s);
end;


{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);
begin
   Emit(s);
   WriteLn;
end;


{---------------------------------------------------------------}
{ Parse and Translate an Identifier }

procedure Ident;
begin
   GetName;
   if Look = '(' then begin
      Match('(');
      Match(')');
      EmitLn('BSR ' + Value);
      end
   else
      EmitLn('MOVE ' + Value + '(PC),D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;
begin
   if Look = '(' then begin
      Match('(');
      Expression;
      Match(')');
      end
   else if IsAlpha(Look) then
      Ident
   else begin
      GetNum;
      EmitLn('MOVE #' + Value + ',D0');
   end;
end;


{---------------------------------------------------------------}
{ Parse and Translate the First Math Factor }

procedure SignedFactor;
var s: boolean;
begin
   s := Look = '-';
   if IsAddop(Look) then begin
      GetChar;
      SkipWhite;
   end;
   Factor;
   if s then
      EmitLn('NEG D0');
end;


{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }

procedure Multiply;
begin
   Match('*');
   Factor;
   EmitLn('MULS (SP)+,D0');
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Divide }

procedure Divide;
begin
   Match('/');
   Factor;
   EmitLn('MOVE (SP)+,D1');
   EmitLn('EXS.L D0');
   EmitLn('DIVS D1,D0');
end;


{---------------------------------------------------------------}
{ Completion of Term Processing  (called by Term and FirstTerm }

procedure Term1;
begin
   while IsMulop(Look) do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '*': Multiply;
       '/': Divide;
      end;
   end;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Math Term }

procedure Term;
begin
   Factor;
   Term1;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Term with Possible Leading Sign }

procedure FirstTerm;
begin
   SignedFactor;
   Term1;
end;


{---------------------------------------------------------------}
{ Recognize and Translate an Add }

procedure Add;
begin
   Match('+');
   Term;
   EmitLn('ADD (SP)+,D0');
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Subtract }

procedure Subtract;
begin
   Match('-');
   Term;
   EmitLn('SUB (SP)+,D0');
   EmitLn('NEG D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
   FirstTerm;
   while IsAddop(Look) do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '+': Add;
       '-': Subtract;
      end;
   end;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Boolean Condition }
{ This version is a dummy }

Procedure Condition;
begin
   EmitLn('Condition');
end;


{---------------------------------------------------------------}
{ Recognize and Translate an IF Construct }

procedure Block; Forward;


procedure DoIf;
var L1, L2: string;
begin
   Condition;
   L1 := NewLabel;
   L2 := L1;
   EmitLn('BEQ ' + L1);
   Block;
   if Token = 'l' then begin
      L2 := NewLabel;
      EmitLn('BRA ' + L2);
      PostLabel(L1);
      Block;
   end;
   PostLabel(L2);
   MatchString('ENDIF');
end;


{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
var Name: string;
begin
   Name := Value;
   Match('=');
   Expression;
   EmitLn('LEA ' + Name + '(PC),A0');
   EmitLn('MOVE D0,(A0)');
end;


{--------------------------------------------------------------}
{ Recognize and Translate a Statement Block }

procedure Block;
begin
   Scan;
   while not (Token in ['e', 'l']) do begin
      case Token of
       'i': DoIf;
       else Assignment;
      end;
      Scan;
   end;
end;


{--------------------------------------------------------------}

{ Parse and Translate a Program }

procedure DoProgram;
begin
   Block;
   MatchString('END');
   EmitLn('END')
end;


{--------------------------------------------------------------}

{ Initialize }

procedure Init;
begin
   LCount := 0;
   GetChar;
end;


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
   DoProgram;
end.
{--------------------------------------------------------------}


Compare this program with its  single-character  counterpart.   I
think you will agree that the differences are minor.


CONCLUSION

At this point, you have learned how to parse  and  generate  code
for expressions,  Boolean  expressions,  and  control structures.
You have now learned how to develop lexical scanners, and  how to
incorporate their elements into a translator.  You have still not
seen ALL the elements combined into one program, but on the basis
of  what  we've  done before you should find it a straightforward
matter to extend our earlier programs to include scanners.

We are very  close  to  having  all  the elements that we need to
build a real, functional compiler.  There are still a  few things
missing, notably procedure  calls  and type definitions.  We will
deal with  those  in  the  next  few  sessions.  Before doing so,
however, I thought it  would  be fun to turn the translator above
into a true compiler.  That's what we'll  be  doing  in  the next
installment.

Up till now, we've taken  a rather bottom-up approach to parsing,
beginning with low-level constructs and working our way  up.   In
the next installment,  I'll  also  be  taking a look from the top
down,  and  we'll  discuss how the structure of the translator is
altered by changes in the language definition.

See you then.


































                     LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                           2 April 1989


                  Part VIII: A LITTLE PHILOSOPHY





INTRODUCTION

This is going to be a  different  kind of session than the others
in our series on  parsing  and  compiler  construction.  For this
session, there won't be  any  experiments to do or code to write.
This  once,  I'd  like  to  just  talk  with  you  for  a  while.
Mercifully, it will be a short  session,  and then we can take up
where we left off, hopefully with renewed vigor.

When  I  was  in college, I found that I could  always  follow  a
prof's lecture a lot better if I knew where he was going with it.
I'll bet you were the same.

So I thought maybe it's about  time  I told you where we're going
with this series: what's coming up in future installments, and in
general what all  this  is  about.   I'll also share some general
thoughts concerning the usefulness of what we've been doing.


THE ROAD HOME

So far, we've  covered  the parsing and translation of arithmetic
expressions,  Boolean expressions, and combinations connected  by
relational  operators.    We've also done the  same  for  control
constructs.    In  all of this we've leaned heavily on the use of
top-down, recursive  descent  parsing,  BNF  definitions  of  the
syntax, and direct generation of assembly-language code.  We also
learned the value of  such  tricks  as single-character tokens to
help  us  see  the  forest  through  the  trees.    In  the  last
installment  we dealt with lexical scanning,  and  I  showed  you
simple but powerful ways to remove the single-character barriers.

Throughout the whole study, I've emphasized  the  KISS philosophy
... Keep It Simple, Sidney ... and I hope by now  you've realized
just  how  simple  this stuff can really be.  While there are for
sure areas of compiler  theory  that  are truly intimidating, the
ultimate message of this series is that in practice you  can just
politely  sidestep   many  of  these  areas.    If  the  language
definition  cooperates  or,  as in this series, if you can define
the language as you go, it's possible to write down  the language
definition in BNF with reasonable ease.  And, as we've  seen, you
can crank out parse procedures from the BNF just about as fast as
you can type.

As our compiler has taken form, it's gotten more parts,  but each
part  is  quite small and simple, and  very  much  like  all  the
others.

At this point, we have many  of  the makings of a real, practical
compiler.  As a matter of  fact,  we  already have all we need to
build a toy  compiler  for  a  language as powerful as, say, Tiny
BASIC.  In the next couple of installments, we'll  go  ahead  and
define that language.

To round out  the  series,  we  still  have a few items to cover.
These include:

   o Procedure calls, with and without parameters

   o Local and global variables

   o Basic types, such as character and integer types

   o Arrays

   o Strings

   o User-defined types and structures

   o Tree-structured parsers and intermediate languages

   o Optimization

These will all be  covered  in  future  installments.  When we're
finished, you'll have all the tools you need to design  and build
your own languages, and the compilers to translate them.

I can't  design  those  languages  for  you,  but I can make some
comments  and  recommendations.    I've  already  sprinkled  some
throughout past installments.    You've  seen,  for  example, the
control constructs I prefer.

These constructs are going  to  be part of the languages I build.
I  have  three  languages in mind at this point, two of which you
will see in installments to come:

TINY - A  minimal,  but  usable  language  on the order  of  Tiny
       BASIC or Tiny C.  It won't be very practical, but  it will
       have enough power to let you write and  run  real programs
       that do something worthwhile.

KISS - The  language  I'm  building for my  own  use.    KISS  is
       intended to be  a  systems programming language.  It won't
       have strong typing  or  fancy data structures, but it will
       support most of  the  things  I  want to do with a higher-
       order language (HOL), except perhaps writing compilers.
                              
I've also  been  toying  for  years  with  the idea of a HOL-like
assembler,  with  structured  control  constructs   and  HOL-like
assignment statements.  That, in  fact, was the impetus behind my
original foray into the jungles of compiler theory.  This one may
never be built, simply  because  I've  learned that it's actually
easier to implement a language like KISS, that only uses a subset
of the CPU instructions.    As you know, assembly language can be
bizarre  and  irregular  in the extreme, and a language that maps
one-for-one onto it can be a real challenge.  Still,  I've always
felt that the syntax used  in conventional assemblers is dumb ...
why is

     MOVE.L A,B

better, or easier to translate, than

     B=A ?

I  think  it  would  be  an  interesting  exercise to  develop  a
"compiler" that  would give the programmer complete access to and
control over the full complement  of the CPU instruction set, and
would allow you to generate  programs  as  efficient  as assembly
language, without the pain  of  learning a set of mnemonics.  Can
it be done?  I don't  know.  The  real question may be, "Will the
resulting language be any  easier  to  write  than assembly"?  If
not, there's no point in it.  I think that it  can  be  done, but
I'm not completely sure yet how the syntax should look.

Perhaps you have some  comments  or suggestions on this one.  I'd
love to hear them.

You probably won't be surprised to learn that I've already worked
ahead in most  of the areas that we will cover.  I have some good
news:  Things  never  get  much  harder than they've been so far.
It's  possible  to  build a complete, working compiler for a real
language, using nothing  but  the same kinds of techniques you've
learned so far.  And THAT brings up some interesting questions.


WHY IS IT SO SIMPLE?

Before embarking  on this series, I always thought that compilers
were just naturally complex computer  programs  ...  the ultimate
challenge.  Yet the things we have done here have  usually turned
out to be quite simple, sometimes even trivial.

For awhile, I thought  is  was simply because I hadn't yet gotten
into the meat  of  the  subject.    I had only covered the simple
parts.  I will freely admit  to  you  that, even when I began the
series,  I  wasn't  sure how far we would be able  to  go  before
things got too complex to deal with in the ways  we  have so far.
But at this point I've already  been  down the road far enough to
see the end of it.  Guess what?
                              

                     THERE ARE NO HARD PARTS!


Then, I thought maybe it was because we were not  generating very
good object  code.    Those  of  you  who have been following the
series and trying sample compiles know that, while the code works
and  is  rather  foolproof,  its  efficiency is pretty awful.   I
figured that if we were  concentrating on turning out tight code,
we would soon find all that missing complexity.

To  some  extent,  that one is true.  In particular, my first few
efforts at trying to improve efficiency introduced  complexity at
an alarming rate.  But since then I've been tinkering around with
some simple optimizations and I've found some that result in very
respectable code quality, WITHOUT adding a lot of complexity.

Finally, I thought that  perhaps  the  saving  grace was the "toy
compiler" nature of the study.   I  have made no pretense that we
were  ever  going  to be able to build a compiler to compete with
Borland and Microsoft.  And yet, again, as I get deeper into this
thing the differences are starting to fade away.

Just  to make sure you get the message here, let me state it flat
out:

   USING THE TECHNIQUES WE'VE USED  HERE,  IT  IS  POSSIBLE TO
   BUILD A PRODUCTION-QUALITY, WORKING COMPILER WITHOUT ADDING
   A LOT OF COMPLEXITY TO WHAT WE'VE ALREADY DONE.


Since  the series began I've received  some  comments  from  you.
Most of them echo my own thoughts:  "This is easy!    Why  do the
textbooks make it seem so hard?"  Good question.

Recently, I've gone back and looked at some of those texts again,
and even bought and read some new ones.  Each  time,  I come away
with the same feeling: These guys have made it seem too hard.

What's going on here?  Why does the whole thing seem difficult in
the texts, but easy to us?    Are  we that much smarter than Aho,
Ullman, Brinch Hansen, and all the rest?

Hardly.  But we  are  doing some things differently, and more and
more  I'm  starting  to appreciate the value of our approach, and
the way that  it  simplifies  things.    Aside  from  the obvious
shortcuts that I outlined in Part I, like single-character tokens
and console I/O, we have  made some implicit assumptions and done
some things differently from those who have designed compilers in
the past. As it turns out, our approach makes life a lot easier.

So why didn't all those other guys use it?

You have to remember the context of some of the  earlier compiler
development.  These people were working with very small computers
of  limited  capacity.      Memory  was  very  limited,  the  CPU
instruction  set  was  minimal, and programs ran  in  batch  mode
rather  than  interactively.   As it turns out, these caused some
key design decisions that have  really  complicated  the designs.
Until recently,  I hadn't realized how much of classical compiler
design was driven by the available hardware.

Even in cases where these  limitations  no  longer  apply, people
have  tended  to  structure their programs in the same way, since
that is the way they were taught to do it.

In  our case, we have started with a blank sheet of paper.  There
is a danger there, of course,  that  you will end up falling into
traps that other people have long since learned to avoid.  But it
also has allowed us to  take different approaches that, partly by
design  and partly by pure dumb luck, have  allowed  us  to  gain
simplicity.

Here are the areas that I think have  led  to  complexity  in the
past:

  o  Limited RAM Forcing Multiple Passes

     I  just  read  "Brinch  Hansen  on  Pascal   Compilers"  (an
     excellent book, BTW).  He  developed a Pascal compiler for a
     PC, but he started the effort in 1981 with a 64K system, and
     so almost every design decision  he made was aimed at making
     the compiler fit  into  RAM.    To do this, his compiler has
     three passes, one of which is the lexical scanner.  There is
     no way he could, for  example, use the distributed scanner I
     introduced  in  the last installment,  because  the  program
     structure wouldn't allow it.  He also required  not  one but
     two intermediate  languages,  to  provide  the communication
     between phases.

     All the early compiler writers  had to deal with this issue:
     Break the compiler up into enough parts so that it  will fit
     in memory.  When  you  have multiple passes, you need to add
     data structures to support the  information  that  each pass
     leaves behind for the next.   That adds complexity, and ends
     up driving the  design.    Lee's  book,  "The  Anatomy  of a
     Compiler,"  mentions a FORTRAN compiler developed for an IBM
     1401.  It had no fewer than 63 separate passes!  Needless to
     say,  in a compiler like this  the  separation  into  phases
     would dominate the design.

     Even in  situations  where  RAM  is  plentiful,  people have
     tended  to  use  the same techniques because  that  is  what
     they're familiar with.   It  wasn't  until Turbo Pascal came
     along that we found how simple a compiler could  be  if  you
     started with different assumptions.


  o  Batch Processing
                              
     In the early days, batch  processing was the only choice ...
     there was no interactive computing.   Even  today, compilers
     run in essentially batch mode.

     In a mainframe compiler as  well  as  many  micro compilers,
     considerable effort is expended on error recovery ... it can
     consume as much as 30-40%  of  the  compiler  and completely
     drive the design.  The idea is to avoid halting on the first
     error, but rather to keep going at all costs,  so  that  you
     can  tell  the  programmer about as many errors in the whole
     program as possible.

     All of that harks back to the days of the  early mainframes,
     where turnaround time was measured  in hours or days, and it
     was important to squeeze every last ounce of information out
     of each run.

     In this series, I've been very careful to avoid the issue of
     error recovery, and instead our compiler  simply  halts with
     an error message on  the  first error.  I will frankly admit
     that it was mostly because I wanted to take the easy way out
     and keep things simple.   But  this  approach,  pioneered by
     Borland in Turbo Pascal, also has a lot going for it anyway.
     Aside from keeping the  compiler  simple,  it also fits very
     well  with   the  idea  of  an  interactive  system.    When
     compilation is  fast, and especially when you have an editor
     such as Borland's that  will  take you right to the point of
     the error, then it makes a  lot  of sense to stop there, and
     just restart the compilation after the error is fixed.


  o  Large Programs

     Early compilers were designed to handle  large  programs ...
     essentially infinite ones.    In those days there was little
     choice;  the  idea  of  subroutine  libraries  and  separate
     compilation  were  still  in  the  future.      Again,  this
     assumption led to  multi-pass designs and intermediate files
     to hold the results of partial processing.

     Brinch Hansen's  stated goal was that the compiler should be
     able to compile itself.   Again, because of his limited RAM,
     this drove him to a multi-pass design.  He needed  as little
     resident compiler code as possible,  so  that  the necessary
     tables and other data structures would fit into RAM.

     I haven't stated this one yet, because there  hasn't  been a
     need  ... we've always just read and  written  the  data  as
     streams, anyway.  But  for  the  record,  my plan has always
     been that, in  a  production compiler, the source and object
     data should all coexist  in  RAM with the compiler, a la the
     early Turbo Pascals.  That's why I've been  careful  to keep
     routines like GetChar  and  Emit  as  separate  routines, in
     spite of their small size.   It  will be easy to change them
     to read to and write from memory.


  o  Emphasis on Efficiency

     John  Backus has stated that, when  he  and  his  colleagues
     developed the original FORTRAN compiler, they KNEW that they
     had to make it produce tight code.  In those days, there was
     a strong sentiment against HOLs  and  in  favor  of assembly
     language, and  efficiency was the reason.  If FORTRAN didn't
     produce very good  code  by  assembly  standards,  the users
     would simply refuse to use it.  For the record, that FORTRAN
     compiler turned out to  be  one  of  the most efficient ever
     built, in terms of code quality.  But it WAS complex!

     Today,  we have CPU power and RAM size  to  spare,  so  code
     efficiency is not  so  much  of  an  issue.    By studiously
     ignoring this issue, we  have  indeed  been  able to Keep It
     Simple.    Ironically,  though, as I have said, I have found
     some optimizations that we can  add  to  the  basic compiler
     structure, without having to add a lot of complexity.  So in
     this  case we get to have our cake and eat it too:  we  will
     end up with reasonable code quality, anyway.


  o  Limited Instruction Sets

     The early computers had primitive instruction sets.   Things
     that  we  take  for granted, such as  stack  operations  and
     indirect addressing, came only with great difficulty.

     Example: In most compiler designs, there is a data structure
     called the literal pool.  The compiler  typically identifies
     all literals used in the program, and collects  them  into a
     single data structure.    All references to the literals are
     done  indirectly  to  this  pool.    At  the   end   of  the
     compilation, the  compiler  issues  commands  to  set  aside
     storage and initialize the literal pool.

     We haven't had to address that  issue  at all.  When we want
     to load a literal, we just do it, in line, as in

          MOVE #3,D0

     There is something to be said for the use of a literal pool,
     particularly on a machine like  the 8086 where data and code
     can  be separated.  Still, the whole  thing  adds  a  fairly
     large amount of complexity with little in return.

     Of course, without the stack we would be lost.  In  a micro,
     both  subroutine calls and temporary storage depend  heavily
     on the stack, and  we  have used it even more than necessary
     to ease expression parsing.


  o  Desire for Generality

     Much of the content of the typical compiler text is taken up
     with issues we haven't addressed here at all ... things like
     automated  translation  of  grammars,  or generation of LALR
     parse tables.  This is not simply because  the  authors want
     to impress you.  There are good, practical  reasons  why the
     subjects are there.

     We have been concentrating on the use of a recursive-descent
     parser to parse a  deterministic  grammar,  i.e.,  a grammar
     that is not ambiguous and, therefore, can be parsed with one
     level of lookahead.  I haven't made much of this limitation,
     but  the  fact  is  that  this represents a small subset  of
     possible grammars.  In fact,  there is an infinite number of
     grammars that we can't parse using our techniques.    The LR
     technique is a more powerful one, and can deal with grammars
     that we can't.

     In compiler theory, it's important  to know how to deal with
     these  other  grammars,  and  how  to  transform  them  into
     grammars  that  are  easier to deal with.  For example, many
     (but not all) ambiguous  grammars  can  be  transformed into
     unambiguous ones.  The way to do this is not always obvious,
     though, and so many people  have  devoted  years  to develop
     ways to transform them automatically.

     In practice, these  issues  turn out to be considerably less
     important.  Modern languages tend  to be designed to be easy
     to parse, anyway.   That  was a key motivation in the design
     of Pascal.   Sure,  there are pathological grammars that you
     would be hard pressed to write unambiguous BNF  for,  but in
     the  real  world  the best answer is probably to avoid those
     grammars!

     In  our  case,  of course, we have sneakily let the language
     evolve  as  we  go, so we haven't painted ourselves into any
     corners here.  You may not always have that luxury.   Still,
     with a little  care  you  should  be able to keep the parser
     simple without having to resort to automatic  translation of
     the grammar.


We have taken  a  vastly  different  approach in this series.  We
started with a clean sheet  of  paper,  and  developed techniques
that work in the context that  we  are in; that is, a single-user
PC  with  rather  ample CPU power and RAM space.  We have limited
ourselves to reasonable grammars that  are easy to parse, we have
used the instruction set of the CPU to advantage, and we have not
concerned ourselves with efficiency.  THAT's why it's been easy.

Does this mean that we are forever doomed  to  be  able  to build
only toy compilers?   No, I don't think so.  As I've said, we can
add  certain   optimizations   without   changing   the  compiler
structure.  If we want to process large files, we can  always add
file  buffering  to do that.  These  things  do  not  affect  the
overall program design.

And I think  that's  a  key  factor.   By starting with small and
limited  cases,  we  have been able to concentrate on a structure
for  the  compiler  that is natural  for  the  job.    Since  the
structure naturally fits the job, it is almost bound to be simple
and transparent.   Adding  capability doesn't have to change that
basic  structure.    We  can  simply expand things like the  file
structure or add an optimization layer.  I guess  my  feeling  is
that, back when resources were tight, the structures people ended
up  with  were  artificially warped to make them work under those
conditions, and weren't optimum  structures  for  the  problem at
hand.


CONCLUSION

Anyway, that's my arm-waving  guess  as to how we've been able to
keep things simple.  We started with something simple and  let it
evolve  naturally,  without  trying  to   force   it   into  some
traditional mold.

We're going to  press on with this.  I've given you a list of the
areas  we'll  be  covering in future installments.    With  those
installments, you  should  be  able  to  build  complete, working
compilers for just about any occasion, and build them simply.  If
you REALLY want to build production-quality compilers,  you'll be
able to do that, too.

For those of you who are chafing at the bit for more parser code,
I apologize for this digression.  I just thought  you'd  like  to
have things put  into  perspective  a  bit.  Next time, we'll get
back to the mainstream of the tutorial.

So far, we've only looked at pieces of compilers,  and  while  we
have  many  of  the  makings  of a complete language, we  haven't
talked about how to put  it  all  together.    That  will  be the
subject of our next  two  installments.  Then we'll press on into
the new subjects I listed at the beginning of this installment.

See you then.































                     LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                          16 April 1989


                       Part IX: A TOP VIEW





INTRODUCTION

In  the  previous  installments,  we  have  learned  many of  the
techniques required to  build  a full-blown compiler.  We've done
both  assignment   statements   (with   Boolean   and  arithmetic
expressions),  relational operators, and control constructs.   We
still haven't  addressed procedure or function calls, but even so
we  could  conceivably construct a  mini-language  without  them.
I've  always  thought  it would be fun to see just  how  small  a
language  one  could  build  that  would still be useful.   We're
ALMOST in a position to do that now.  The  problem  is: though we
know  how  to  parse and translate the constructs, we still don't
know quite how to put them all together into a language.

In those earlier installments, the  development  of  our programs
had  a decidedly bottom-up flavor.  In  the  case  of  expression
parsing,  for  example,  we  began  with  the  very lowest  level
constructs, the individual constants  and  variables,  and worked
our way up to more complex expressions.

Most people regard  the  top-down design approach as being better
than  the  bottom-up  one.  I do too,  but  the  way  we  did  it
certainly seemed natural enough for the kinds of  things  we were
parsing.

You mustn't get  the  idea, though, that the incremental approach
that  we've  been  using  in  all these tutorials  is  inherently
bottom-up.  In  this  installment  I'd  like to show you that the
approach can work just as well when applied from the top down ...
maybe better.  We'll consider languages such as C and Pascal, and
see how complete compilers can be built starting from the top.

In the next installment, we'll  apply the same technique to build
a  complete  translator  for a subset of the KISS language, which
I'll be  calling  TINY.    But one of my goals for this series is
that you will  not only be able to see how a compiler for TINY or
KISS  works,  but  that you will also be able to design and build
compilers for your own languages.  The C and Pascal examples will
help.    One  thing I'd like you  to  see  is  that  the  natural
structure of the compiler depends very much on the language being
translated, so the simplicity and  ease  of  construction  of the
compiler  depends  very  much  on  letting the language  set  the
program structure.
                              
It's  a bit much to produce a full C or Pascal compiler here, and
we won't try.   But we can flesh out the top levels far enough so
that you can see how it goes.

Let's get started.


THE TOP LEVEL

One of the biggest  mistakes  people make in a top-down design is
failing  to start at the true top.  They think they know what the
overall structure of the  design  should be, so they go ahead and
write it down.

Whenever  I  start a new design, I always like to do  it  at  the
absolute beginning.   In  program design language (PDL), this top
level looks something like:


     begin
        solve the problem
     end


OK, I grant  you that this doesn't give much of a hint as to what
the next level is, but I  like  to  write it down anyway, just to
give me that warm feeling that I am indeed starting at the top.

For our problem, the overall function of a compiler is to compile
a complete program.  Any definition of the  language,  written in
BNF,  begins here.  What does the top level BNF look like?  Well,
that depends quite a bit on the language to be translated.  Let's
take a look at Pascal.


THE STRUCTURE OF PASCAL

Most  texts  for  Pascal  include  a   BNF   or  "railroad-track"
definition of the language.  Here are the first few lines of one:


     <program> ::= <program-header> <block> '.'

     <program-header> ::= PROGRAM <ident>

     <block> ::= <declarations> <statements>


We can write recognizers  to  deal  with  each of these elements,
just as we've done before.  For each one, we'll use  our familiar
single-character tokens to represent the input, then flesh things
out a little at a time.    Let's begin with the first recognizer:
the program itself.
                              
To translate this, we'll  start  with a fresh copy of the Cradle.
Since we're back to single-character  names, we'll just use a 'p'
to stand for 'PROGRAM.'

To a fresh copy of the cradle, add the following code, and insert
a call to it from the main program:


{--------------------------------------------------------------}
{ Parse and Translate A Program }

procedure Prog;
var  Name: char;
begin
   Match('p');            { Handles program header part }
   Name := GetName;
   Prolog(Name);
   Match('.');
   Epilog(Name);
end;
{--------------------------------------------------------------}


The procedures  Prolog and Epilog perform whatever is required to
let the program interface with the operating system,  so  that it
can execute as a program.  Needless to  say,  this  part  will be
VERY OS-dependent.  Remember, I've been emitting code for a 68000
running under the OS I use, which is SK*DOS.   I  realize most of
you are using PC's  and  would rather see something else, but I'm
in this thing too deep to change now!

Anyhow, SK*DOS is a  particularly  easy OS to interface to.  Here
is the code for Prolog and Epilog:


{--------------------------------------------------------------}
{ Write the Prolog }

procedure Prolog;
begin
   EmitLn('WARMST EQU $A01E');
end;


{--------------------------------------------------------------}
{ Write the Epilog }

procedure Epilog(Name: char);
begin
   EmitLn('DC WARMST');
   EmitLn('END ' + Name);
end;
{--------------------------------------------------------------}
                              
As usual, add  this  code  and  try  out the "compiler."  At this
point, there is only one legal input:


     px.   (where x is any single letter, the program name)


Well,  as  usual  our first effort is rather unimpressive, but by
now  I'm sure you know that things  will  get  more  interesting.
There is one important thing to  note:   THE OUTPUT IS A WORKING,
COMPLETE, AND EXECUTABLE PROGRAM (at least after it's assembled).

This  is  very  important.  The  nice  feature  of  the  top-down
approach is that at any stage you can  compile  a  subset  of the
complete language and get  a  program that will run on the target
machine.    From here on, then, we  need  only  add  features  by
fleshing out the language constructs.  It's all  very  similar to
what we've been doing all along, except that we're approaching it
from the other end.


FLESHING IT OUT

To flesh out  the  compiler,  we  only have to deal with language
features  one by one.  I like to start with a stub procedure that
does  nothing, then add detail in  incremental  fashion.    Let's
begin  by  processing  a block, in accordance with its PDL above.
We can do this in two stages.  First, add the null procedure:


{--------------------------------------------------------------}
{ Parse and Translate a Pascal Block }

procedure DoBlock(Name: char);
begin
end;
{--------------------------------------------------------------}


and modify Prog to read:


{--------------------------------------------------------------}
{ Parse and Translate A Program }

procedure Prog;
var  Name: char;
begin
   Match('p');
   Name := GetName;
   Prolog;
   DoBlock(Name);
   Match('.');
   Epilog(Name);
end;
{--------------------------------------------------------------}


That certainly  shouldn't change the behavior of the program, and
it doesn't.  But now the  definition  of Prog is complete, and we
can proceed to flesh out DoBlock.  That's done right from its BNF
definition:


{--------------------------------------------------------------}
{ Parse and Translate a Pascal Block }

procedure DoBlock(Name: char);
begin
   Declarations;
   PostLabel(Name);
   Statements;
end;
{--------------------------------------------------------------}


The  procedure  PostLabel  was  defined  in  the  installment  on
branches.  Copy it into your cradle.

I probably need to  explain  the  reason  for inserting the label
where I have.  It has to do with the operation of SK*DOS.  Unlike
some OS's,  SK*DOS allows the entry point to the main  program to
be  anywhere in the program.  All you have to do is to give  that
point a name.  The call  to  PostLabel puts that name just before
the first executable statement  in  the  main  program.  How does
SK*DOS know which of the many labels is the entry point, you ask?
It's the one that matches the END statement  at  the  end  of the
program.

OK,  now  we  need  stubs  for  the  procedures Declarations  and
Statements.  Make them null procedures as we did before.

Does the program  still run the same?  Then we can move on to the
next stage.


DECLARATIONS

The BNF for Pascal declarations is:


     <declarations> ::= ( <label list>    |
                          <constant list> |
                          <type list>     |
                          <variable list> |
                          <procedure>     |
                          <function>         )*
                              

(Note  that  I'm  using the more liberal definition used by Turbo
Pascal.  In the standard Pascal definition, each  of  these parts
must be in a specific order relative to the rest.)

As  usual,  let's  let a single character represent each of these
declaration types.  The new form of Declarations is:


{--------------------------------------------------------------}
{ Parse and Translate the Declaration Part }

procedure Declarations;
begin
   while Look in ['l', 'c', 't', 'v', 'p', 'f'] do
      case Look of
       'l': Labels;
       'c': Constants;
       't': Types;
       'v': Variables;
       'p': DoProcedure;
       'f': DoFunction;
      end;
end;
{--------------------------------------------------------------}


Of course, we need stub  procedures for each of these declaration
types.  This time,  they  can't  quite  be null procedures, since
otherwise we'll end up with an infinite While loop.  At  the very
least, each recognizer must  eat  the  character that invokes it.
Insert the following procedures:


{--------------------------------------------------------------}
{ Process Label Statement }

procedure Labels;
begin
   Match('l');
end;


{--------------------------------------------------------------}
{ Process Const Statement }

procedure Constants;
begin
   Match('c');
end;


{--------------------------------------------------------------}
{ Process Type Statement }
procedure Types;
begin
   Match('t');
end;


{--------------------------------------------------------------}
{ Process Var Statement }

procedure Variables;
begin
   Match('v');
end;


{--------------------------------------------------------------}
{ Process Procedure Definition }

procedure DoProcedure;
begin
   Match('p');
end;


{--------------------------------------------------------------}
{ Process Function Definition }

procedure DoFunction;
begin
   Match('f');
end;
{--------------------------------------------------------------}


Now try out the  compiler  with a few representative inputs.  You
can  mix  the  declarations any way you like, as long as the last
character  in  the  program is'.' to  indicate  the  end  of  the
program.  Of course,  none  of  the declarations actually declare
anything, so you don't need  (and can't use) any characters other
than those standing for the keywords.

We can flesh out the statement  part  in  a similar way.  The BNF
for it is:


     <statements> ::= <compound statement>

     <compound statement> ::= BEGIN <statement>
                                   (';' <statement>) END


Note that statements can  begin  with  any identifier except END.
So the first stub form of procedure Statements is:
                              

{--------------------------------------------------------------}
{ Parse and Translate the Statement Part }

procedure Statements;
begin
   Match('b');
   while Look <> 'e' do
      GetChar;
   Match('e');
end;
{--------------------------------------------------------------}


At  this  point  the  compiler   will   accept   any   number  of
declarations, followed by the  BEGIN  block  of the main program.
This  block  itself  can contain any characters at all (except an
END), but it must be present.

The simplest form of input is now

     'pxbe.'

Try  it.    Also  try  some  combinations  of  this.   Make  some
deliberate errors and see what happens.

At this point you should be beginning to see the drill.  We begin
with a stub translator to process a program, then  we  flesh  out
each procedure in turn,  based  upon its BNF definition.  Just as
the lower-level BNF definitions add detail and elaborate upon the
higher-level ones, the lower-level  recognizers  will  parse more
detail  of  the  input  program.    When  the  last stub has been
expanded,  the  compiler  will  be  complete.    That's  top-down
design/implementation in its purest form.

You might note that even though we've been adding procedures, the
output of the program hasn't changed.  That's as  it  should  be.
At these  top  levels  there  is  no  emitted code required.  The
recognizers are  functioning as just that: recognizers.  They are
accepting input sentences, catching bad ones, and channeling good
input to the right places, so  they  are  doing their job.  If we
were to pursue this a bit longer, code would start to appear.

The  next  step  in our expansion should  probably  be  procedure
Statements.  The Pascal definition is:


    <statement> ::= <simple statement> | <structured statement>

    <simple statement> ::= <assignment> | <procedure call> | null

    <structured statement> ::= <compound statement> |
                               <if statement>       |
                               <case statement>     |
                               <while statement>    |
                               <repeat statement>   |
                               <for statement>      |
                               <with statement>


These  are  starting  to look familiar.  As a matter of fact, you
have already gone  through  the process of parsing and generating
code for both assignment statements and control structures.  This
is where the top level meets our bottom-up  approach  of previous
sessions.  The constructs will be a little  different  from those
we've  been  using  for KISS, but the differences are nothing you
can't handle.

I  think  you can get the picture now as to the  procedure.    We
begin with a complete BNF  description of the language.  Starting
at  the  top  level, we code  up  the  recognizer  for  that  BNF
statement, using stubs  for  the next-level recognizers.  Then we
flesh those lower-level statements out one by one.

As it happens, the definition of Pascal is  very  compatible with
the  use of BNF, and BNF descriptions  of  the  language  abound.
Armed  with  such   a   description,  you  will  find  it  fairly
straightforward to continue the process we've begun.

You  might  have  a go at fleshing a few of these constructs out,
just  to get a feel for it.  I don't expect you  to  be  able  to
complete a Pascal compiler  here  ...  there  are too many things
such  as  procedures  and types that we haven't addressed yet ...
but  it  might  be helpful to try some of the more familiar ones.
It will do  you  good  to  see executable programs coming out the
other end.

If I'm going to address those issues that we haven't covered yet,
I'd rather  do  it  in  the context of KISS.  We're not trying to
build a complete Pascal  compiler  just yet, so I'm going to stop
the expansion of Pascal here.    Let's  take  a  look  at  a very
different language.


THE STRUCTURE OF C

The C language is quite another matter, as you'll see.   Texts on
C  rarely  include  a BNF definition of  the  language.  Probably
that's because the language is quite hard to write BNF for.

One reason I'm showing you these structures now is so that  I can
impress upon you these two facts:

 (1) The definition of  the  language drives the structure of the
     compiler.  What works for one language may be a disaster for
     another.    It's  a very bad idea to try to  force  a  given
     structure upon the compiler.  Rather, you should let the BNF
     drive the structure, as we have done here.
                             
 (2) A language that is hard to write BNF for  will  probably  be
     hard  to  write  a compiler for, as well.  C  is  a  popular
     language,  and  it  has  a  reputation  for  letting you  do
     virtually  anything that is possible to  do.    Despite  the
     success of Small C, C is _NOT_ an easy language to parse.


A C program has  less  structure than its Pascal counterpart.  At
the top level, everything in C is a static declaration, either of
data or of a function.  We can capture this thought like this:


     <program> ::= ( <global declaration> )*

     <global declaration> ::= <data declaration>  |
                              <function>

In Small C, functions  can  only have the default type int, which
is not declared.  This makes  the  input easy to parse: the first
token is either "int," "char," or the name  of  a  function.   In
Small  C, the preprocessor commands are  also  processed  by  the
compiler proper, so the syntax becomes:


     <global declaration> ::= '#' <preprocessor command>  |
                              'int' <data list>           |
                              'char' <data list>          |
                              <ident> <function body>     |


Although we're really more interested in full C  here,  I'll show
you the  code corresponding to this top-level structure for Small
C.


{--------------------------------------------------------------}
{ Parse and Translate A Program }

procedure Prog;
begin
   while Look <> ^Z do begin
      case Look of
       '#': PreProc;
       'i': IntDecl;
       'c': CharDecl;
      else DoFunction(Int);
      end;
   end;
end;
{--------------------------------------------------------------}

Note that I've had to use a ^Z to indicate the end of the source.
C has no keyword such as END or the '.' to otherwise indicate the
end.
                             
With full C,  things  aren't  even  this easy.  The problem comes
about because in full C, functions can also have types.   So when
the compiler sees a  keyword  like  "int,"  it still doesn't know
whether to expect a  data  declaration  or a function definition.
Things get more  complicated  since  the  next token may not be a
name  ... it may start with an '*' or '(', or combinations of the
two.

More specifically, the BNF for full C begins with:


     <program> ::= ( <top-level decl> )*

     <top-level decl> ::= <function def> | <data decl>

     <data decl> ::= [<class>] <type> <decl-list>

     <function def> ::= [<class>] [<type>] <function decl>


You  can  now  see the problem:   The  first  two  parts  of  the
declarations for data and functions can be the same.   Because of
the  ambiguity  in  the grammar as  written  above,  it's  not  a
suitable  grammar  for  a  recursive-descent  parser.     Can  we
transform it into one that is suitable?  Yes, with a little work.
Suppose we write it this way:


     <top-level decl> ::= [<class>] <decl>

     <decl> ::= <type> <typed decl> | <function decl>

     <typed decl> ::= <data list> | <function decl>


We  can  build  a  parsing  routine  for  the   class   and  type
definitions, and have them store away their findings  and  go on,
without their ever having to "know" whether a function or  a data
declaration is being processed.

To begin, key in the following version of the main program:


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
   while Look <> ^Z do begin
      GetClass;
      GetType;
      TopDecl;
   end;
end.

{--------------------------------------------------------------}


For the first round, just make the three procedures stubs that do
nothing _BUT_ call GetChar.

Does this program work?  Well, it would be hard put NOT to, since
we're not really asking it to do anything.  It's been said that a
C compiler will accept virtually any input without choking.  It's
certainly true of THIS  compiler,  since in effect all it does is
to eat input characters until it finds a ^Z.

Next, let's make  GetClass  do something worthwhile.  Declare the
global variable


     var Class: char;


and change GetClass to do the following:


{--------------------------------------------------------------}
{  Get a Storage Class Specifier }

Procedure GetClass;
begin
   if Look in ['a', 'x', 's'] then begin
      Class := Look;
      GetChar;
      end
   else Class := 'a';
end;
{--------------------------------------------------------------}


Here, I've used three  single  characters  to represent the three
storage classes "auto," "extern,"  and  "static."   These are not
the only three possible classes ... there are also "register" and
"typedef," but this should  give  you the picture.  Note that the
default class is "auto."

We  can  do  a  similar  thing  for  types.   Enter the following
procedure next:


{--------------------------------------------------------------}
{  Get a Type Specifier }

procedure GetType;
begin
   Typ := ' ';
   if Look = 'u' then begin
      Sign := 'u';
      Typ := 'i';
      GetChar;
      end
   else Sign := 's';
   if Look in ['i', 'l', 'c'] then begin
      Typ := Look;
      GetChar;
   end;
end;
{--------------------------------------------------------------}

Note that you must add two more global variables, Sign and Typ.

With these two procedures in place, the compiler will process the
class and type definitions and store away their findings.  We can
now process the rest of the declaration.

We  are by no means out of the woods yet, because there are still
many complexities just in the definition of the  type,  before we
even get to the actual data or function names.  Let's pretend for
the moment that we have passed all those gates, and that the next
thing in the  input stream is a name.  If the name is followed by
a left paren, we have a function declaration.  If not, we have at
least one data item,  and  possibly a list, each element of which
can have an initializer.

Insert the following version of TopDecl:


{--------------------------------------------------------------}
{ Process a Top-Level Declaration }

procedure TopDecl;
var Name: char;
begin
   Name := Getname;
   if Look = '(' then
      DoFunc(Name)
   else
      DoData(Name);
end;
{--------------------------------------------------------------}


(Note that, since we have already read the name, we must  pass it
along to the appropriate routine.)

Finally, add the two procedures DoFunc and DoData:


{--------------------------------------------------------------}
{ Process a Function Definition }

procedure DoFunc(n: char);
begin
   Match('(');
   Match(')');
   Match('{');
   Match('}');
   if Typ = ' ' then Typ := 'i';
   Writeln(Class, Sign, Typ, ' function ', n);
end;

{--------------------------------------------------------------}
{ Process a Data Declaration }

procedure DoData(n: char);
begin
   if Typ = ' ' then Expected('Type declaration');
   Writeln(Class, Sign, Typ, ' data ', n);
   while Look = ',' do begin
      Match(',');
      n := GetName;
      WriteLn(Class, Sign, Typ, ' data ', n);
   end;
   Match(';');
end;
{--------------------------------------------------------------}


Since  we're  still  a long way from producing executable code, I
decided to just have these two routines tell us what they found.

OK, give this program a try.    For data declarations, it's OK to
give a list separated by commas.  We  can't  process initializers
as yet.  We also can't process argument lists for  the functions,
but the "(){}" characters should be there.

We're still a _VERY_ long way from having a C compiler,  but what
we have is starting to process the right kinds of inputs,  and is
recognizing both good  and  bad  inputs.    In  the  process, the
natural structure of the compiler is starting to take form.

Can we continue this until we have something that acts  more like
a compiler. Of course we can.  Should we?  That's another matter.
I don't know about you, but I'm beginning to get dizzy, and we've
still  got  a  long  way  to  go  to  even  get  past   the  data
declarations.

At  this  point,  I think you can see how the  structure  of  the
compiler evolves from the language  definition.    The structures
we've seen for our  two  examples, Pascal and C, are as different
as night and day.  Pascal was designed at least partly to be easy
to parse, and that's  reflected  in the compiler.  In general, in
Pascal there is more structure and we have a better idea  of what
kinds of constructs to expect at any point.  In  C,  on the other
hand,  the  program  is  essentially  a  list   of  declarations,
terminated only by the end of file.

We  could  pursue  both  of  these structures much  farther,  but
remember that our purpose here is  not  to  build a Pascal or a C
compiler, but rather to study compilers in general.  For those of
you  who DO want to deal with Pascal or C, I hope I've given  you
enough of a start so that you can  take  it  from  here (although
you'll soon need some of the stuff we still haven't  covered yet,
such as typing and procedure calls).    For the rest of you, stay
with me through the next installment.  There, I'll be leading you
through the development of a complete compiler for TINY, a subset
of KISS.

See you then.































                     LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                           21 May 1989


                    Part X: INTRODUCING "TINY"





INTRODUCTION

In the last installment, I showed you the general  idea  for  the
top-down development of  a  compiler.    I gave you the first few
steps  of  the process for compilers for  Pascal  and  C,  but  I
stopped  far  short  of  pushing  it through to completion.   The
reason was simple: if we're going to produce  a  real, functional
compiler  for  any  language, I'd rather  do  it  for  KISS,  the
language that I've been defining in this tutorial series.

In this installment, we're going to do just that, for a subset of
KISS which I've chosen to call TINY.

The process  will be essentially that outlined in Installment IX,
except  for  one  notable  difference.   In that  installment,  I
suggested  that  you  begin  with  a full BNF description of  the
language.  That's fine for something like Pascal or C,  for which
the language definition is  firm.   In the case of TINY, however,
we don't yet have a full  description  ... we seem to be defining
the language as we go.  That's OK.    In  fact,  it's preferable,
since we can tailor the language  slightly  as we go, to keep the
parsing easy.

So in the development  that  follows,  we'll  actually be doing a
top-down development of BOTH the  language and its compiler.  The
BNF description will grow along with the compiler.

In this process, there will be a number of decisions to  be made,
each of which will influence the BNF and therefore the  nature of
the language.   At  each  decision  point I'll try to remember to
explain  the  decision  and the rationale behind my choice.  That
way, if you happen to hold a different opinion and would prefer a
different option, you can choose it instead.  You  now  have  the
background  to  do  that.  I guess the important thing to note is
that  nothing  we  do  here  is  cast  in  concrete.  When YOU'RE
designing YOUR language, you should feel free to do it YOUR way.

Many of you may be asking at this point: Why bother starting over
from  scratch?  We had a working subset of KISS as the outcome of
Installment  VII  (lexical  scanning).  Why not just extend it as
needed?  The  answer  is  threefold.    First of all, I have been
making  a  number  of changes to further simplify the program ...
changes  like  encapsulating  the  code generation procedures, so
that  we  can  convert to a different target machine more easily.
Second, I want you to see how the development can indeed  be done
from the top down as outlined in the last installment.   Finally,
we both need the practice.  Each time I go through this exercise,
I get a little better at it, and you will, also.


GETTING STARTED

Many  years  ago  there were languages called  Tiny  BASIC,  Tiny
Pascal, and Tiny C, each of which was a subset of its parent full
language.  Tiny BASIC,  for  example,  had  only single-character
variable names and global variables.   It supported only a single
data type.  Sound familiar?  At this point we have almost all the
tools we need to build a compiler like that.

Yet a language called Tiny-anything  still  carries  some baggage
inherited from its parent language.   I've often wondered if this
is a  good  idea.    Granted,  a  language based upon some parent
language will have the  advantage  of  familiarity, but there may
also  be  some  peculiar syntax carried over from the parent that
may tend  to add unnecessary complexity to the compiler. (Nowhere
is this more true than in Small C.)

I've wondered just how small and simple a compiler could  be made
and  still  be  useful, if it were designed from the outset to be
both easy to use and to  parse.    Let's find out.  This language
will just be called "TINY," period.  It's a subset of KISS, which
I  also  haven't  fully  defined,  so  that  at  least  makes  us
consistent (!).  I suppose you could call it TINY KISS.  But that
opens  up a whole can of worms involving  cuter  and  cuter  (and
perhaps more risque) names, so let's just stick with TINY.

The main limitations  of  TINY  will  be because of the things we
haven't yet covered, such as data types.  Like its cousins Tiny C
and Tiny BASIC,  TINY  will  have  only one data type, the 16-bit
integer.    The  first  version  we  develop  will also  have  no
procedure  calls  and  will  use single-character variable names,
although as you will see we can remove these restrictions without
much effort.

The language I have in mind will share some of the  good features
of  Pascal,  C,  and Ada.  Taking a lesson from the comparison of
the Pascal and  C  compilers in the previous installment, though,
TINY will have a decided Pascal flavor.  Wherever  feasible,    a
language structure will  be  bracketed by keywords or symbols, so
that  the parser will know where it's  going  without  having  to
guess.

One other ground rule:  As we go, I'd like  to  keep the compiler
producing real, executable code.  Even though it may not  DO much
at the beginning, it will at least do it correctly.

Finally,  I'll  use  a couple of Pascal  restrictions  that  make
sense:  All data and procedures must be declared before  they are
used.  That makes good sense,  even  though for now the only data
type we'll use  is a word.  This rule in turn means that the only
reasonable place to put the  executable code for the main program
is at the end of the listing.

The top-level definition will be similar to Pascal:


     <program> ::= PROGRAM <top-level decl> <main> '.'


Already, we've reached a decision point.  My first thought was to
make the main block optional.   It  doesn't seem to make sense to
write a "program" with no main program, but it does make sense if
we're  allowing  for  multiple modules, linked together.    As  a
matter of fact,  I intend to allow for this in KISS.  But then we
begin  to open up a can of worms that I'd rather leave closed for
now.  For example, the  term "PROGRAM" really becomes a misnomer.
The MODULE of Modula-2 or the Unit of Turbo Pascal would  be more
appropriate.  Second,  what  about  scope  rules?    We'd  need a
convention for  dealing  with  name  visibility  across  modules.
Better  for  now  to  just  keep  it  simple  and ignore the idea
altogether.

There's also a decision in choosing to require  the  main program
to  be  last.    I  toyed  with  the idea of making its  position
optional,  as  in  C.  The nature of SK*DOS, the OS I'm compiling
for, make this very easy to do.   But  this  doesn't  really make
much sense in view of the Pascal-like requirement  that  all data
and procedures  be declared before they're referenced.  Since the
main  program can only call procedures  that  have  already  been
declared, the only position that makes sense is at the end,  a la
Pascal.

Given  the  BNF  above, let's write a parser that just recognizes
the brackets:


{--------------------------------------------------------------}
{  Parse and Translate a Program }

procedure Prog;
begin
   Match('p');
   Header;
   Prolog;
   Match('.');
   Epilog;
end;
{--------------------------------------------------------------}


The procedure Header just emits  the startup code required by the
assembler:
                              

{--------------------------------------------------------------}
{ Write Header Info }

procedure Header;
begin
   WriteLn('WARMST', TAB, 'EQU $A01E');
end;
{--------------------------------------------------------------}


The procedures Prolog and  Epilog  emit  the code for identifying
the main program, and for returning to the OS:


{--------------------------------------------------------------}
{ Write the Prolog }

procedure Prolog;
begin
   PostLabel('MAIN');
end;


{--------------------------------------------------------------}
{ Write the Epilog }

procedure Epilog;
begin
   EmitLn('DC WARMST');
   EmitLn('END MAIN');
end;
{--------------------------------------------------------------}


The  main program just calls Prog, and then  looks  for  a  clean
ending:


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
   Prog;
   if Look <> CR then Abort('Unexpected data after ''.''');
end.
{--------------------------------------------------------------}


At this point, TINY  will  accept  only  one input "program," the
null program:


     PROGRAM .   (or 'p.' in our shorthand.)

Note, though, that the  compiler  DOES  generate correct code for
this program.  It will run, and do  what  you'd  expect  the null
program to do, that is, nothing but return gracefully to the OS.

As a matter of interest, one of my  favorite  compiler benchmarks
is to compile, link,  and  execute  the  null program in whatever
language   is   involved.     You  can  learn  a  lot  about  the
implementation by measuring  the  overhead  in  time  required to
compile what should be a trivial case.  It's also  interesting to
measure the amount of code produced.  In many compilers, the code
can be fairly large, because they always include  the  whole run-
time  library whether they need it or not.    Early  versions  of
Turbo Pascal produced a 12K object file for  this  case.    VAX C
generates 50K!

The  smallest  null  programs  I've  seen are those  produced  by
Modula-2 compilers, and they run about 200-800 bytes.

In the case of TINY, we HAVE no run-time library  as  yet, so the
object code is indeed tiny:  two  bytes.    That's  got  to  be a
record, and it's  likely  to  remain  one since it is the minimum
size required by the OS.

The  next step is to process the code for the main program.  I'll
use the Pascal BEGIN-block:


     <main> ::= BEGIN <block> END


Here,  again,  we  have made a decision.  We could have chosen to
require a "PROCEDURE MAIN" sort of declaration, similar to C.   I
must  admit  that  this  is  not  a bad idea at all ...  I  don't
particularly  like  the  Pascal  approach  since I tend  to  have
trouble locating the main  program  in a Pascal listing.  But the
alternative is a little awkward, too, since you have to deal with
the  error condition where the user omits  the  main  program  or
misspells its name.  Here I'm taking the easy way out.

Another solution to the "where is the main program" problem might
be to require a name for  the  program, and then bracket the main
by


     BEGIN <name>
     END <name>


similar to the convention of  Modula  2.    This  adds  a  bit of
"syntactic sugar" to the language.  Things like this are  easy to
add or change to your liking, if the language is your own design.

To parse this definition of a main block,  change  procedure Prog
to read:

{--------------------------------------------------------------}
{  Parse and Translate a Program }

procedure Prog;
begin
   Match('p');
   Header;
   Main;
   Match('.');
end;
{--------------------------------------------------------------}


and add the new procedure:


{--------------------------------------------------------------}
{ Parse and Translate a Main Program }

procedure Main;
begin
   Match('b');
   Prolog;
   Match('e');
   Epilog;
end;
{--------------------------------------------------------------}


Now, the only legal program is:


     PROGRAM BEGIN END . (or 'pbe.')


Aren't we making progress???  Well, as usual it gets better.  You
might try some deliberate errors here, like omitting  the  'b' or
the 'e', and see what happens.  As always,  the  compiler  should
flag all illegal inputs.


DECLARATIONS

The obvious next step is to decide what we mean by a declaration.
My  intent  here  is to have two kinds of declarations: variables
and  procedures/functions.    At  the  top  level,   only  global
declarations are allowed, just as in C.

For now, there  can  only be variable declarations, identified by
the keyword VAR (abbreviated 'v'):


     <top-level decls> ::= ( <data declaration> )*

     <data declaration> ::= VAR <var-list>


Note that since there is only one variable type, there is no need
to  declare the type.  Later on, for full KISS, we can easily add
a type description.

The procedure Prog becomes:


{--------------------------------------------------------------}
{  Parse and Translate a Program }

procedure Prog;
begin
   Match('p');
   Header;
   TopDecls;
   Main;
   Match('.');
end;
{--------------------------------------------------------------}


Now, add the two new procedures:


{--------------------------------------------------------------}
{ Process a Data Declaration }

procedure Decl;
begin
   Match('v');
   GetChar;
end;


{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }

procedure TopDecls;
begin
   while Look <> 'b' do
      case Look of
        'v': Decl;
      else Abort('Unrecognized Keyword ''' + Look + '''');
      end;
end;
{--------------------------------------------------------------}


Note that at this point, Decl  is  just  a stub.  It generates no
code, and it doesn't process a list ... every variable must occur
in a separate VAR statement.

OK,  now  we  can have any  number  of  data  declarations,  each
starting with a 'v' for VAR,  before  the BEGIN-block.  Try a few
cases and see what happens.


DECLARATIONS AND SYMBOLS

That looks pretty good, but  we're still only generating the null
program  for  output.    A  real compiler would  issue  assembler
directives to allocate storage for  the  variables.    It's about
time we actually produced some code.

With  a  little  extra  code,  that's  an  easy  thing to do from
procedure Decl.  Modify it as follows:


{--------------------------------------------------------------}
{ Parse and Translate a Data Declaration }

procedure Decl;
var Name: char;
begin
   Match('v');
   Alloc(GetName);
end;
{--------------------------------------------------------------}


The procedure Alloc just  issues  a  command  to the assembler to
allocate storage:


{--------------------------------------------------------------}
{ Allocate Storage for a Variable }

procedure Alloc(N: char);
begin
   WriteLn(N, ':', TAB, 'DC 0');
end;
{--------------------------------------------------------------}


Give  this  one  a  whirl.    Try  an  input  that declares  some
variables, such as:

     pvxvyvzbe.

See how the storage is allocated?    Simple, huh?  Note also that
the entry point, "MAIN," comes out in the right place.

For the record, a "real" compiler would also have a  symbol table
to record the variables being used.  Normally,  the  symbol table
is necessary to record the type  of  each variable.  But since in
this case  all  variables  have  the  same  type, we don't need a
symbol  table  for  that reason.  As it turns out, we're going to
find a symbol  necessary  even without different types, but let's
postpone that need until it arises.

Of course, we haven't really parsed the correct syntax for a data
declaration, since it involves a variable list.  Our version only
permits a single variable.  That's easy to fix, too.

The BNF for <var-list> is


     <var-list> ::= <ident> (, <ident>)*


Adding this syntax to Decl gives this new version:


{--------------------------------------------------------------}
{ Parse and Translate a Data Declaration }

procedure Decl;
var Name: char;
begin
   Match('v');
   Alloc(GetName);
   while Look = ',' do begin
      GetChar;
      Alloc(GetName);
   end;
end;
{--------------------------------------------------------------}


OK, now compile this code and give it  a  try.    Try a number of
lines of VAR declarations, try a list of several variables on one
line, and try combinations of the two.  Does it work?


INITIALIZERS

As long as we're dealing with data declarations, one thing that's
always  bothered  me  about  Pascal  is  that  it  doesn't  allow
initializing  data items in the declaration.    That  feature  is
admittedly sort of a frill, and it  may  be  out  of  place  in a
language that purports to  be  a minimal language.  But it's also
SO easy to add that it seems a shame not  to  do  so.    The  BNF
becomes:


     <var-list> ::= <var> ( <var> )*

     <var> ::= <ident> [ = <integer> ]

Change Alloc as follows:


{--------------------------------------------------------------}
{ Allocate Storage for a Variable }

procedure Alloc(N: char);
begin
   Write(N, ':', TAB, 'DC ');
   if Look = '=' then begin
      Match('=');
      WriteLn(GetNum);
      end
   else
      WriteLn('0');
end;
{--------------------------------------------------------------}


There you are: an initializer with six added lines of Pascal.

OK, try this  version  of  TINY  and verify that you can, indeed,
give the variables initial values.

By golly, this thing is starting to look  real!    Of  course, it
still doesn't DO anything, but it looks good, doesn't it?

Before leaving this section, I should point out  that  we've used
two versions of function GetNum.  One, the earlier one, returns a
character value, a single digit.  The other accepts a multi-digit
integer and returns an integer value.  Either one will work here,
since WriteLn will handle either type.  But there's no  reason to
limit ourselves  to  single-digit  values  here,  so  the correct
version to use is the one that returns an integer.  Here it is:


{--------------------------------------------------------------}
{ Get a Number }

function GetNum: integer;
var Val: integer;
begin
   Val := 0;
   if not IsDigit(Look) then Expected('Integer');
   while IsDigit(Look) do begin
      Val := 10 * Val + Ord(Look) - Ord('0');
      GetChar;
   end;
   GetNum := Val;
end;
{--------------------------------------------------------------}

As a matter  of  fact,  strictly  speaking  we  should  allow for
expressions in the data field of the initializer, or at  the very
least  for  negative  values.  For  now,  let's  just  allow  for
negative values by changing the code for Alloc as follows:


{--------------------------------------------------------------}
{ Allocate Storage for a Variable }

procedure Alloc(N: char);
begin
   if InTable(N) then Abort('Duplicate Variable Name ' + N);
   ST[N] := 'v';
   Write(N, ':', TAB, 'DC ');
   if Look = '=' then begin
      Match('=');
      If Look = '-' then begin
         Write(Look);
         Match('-');
      end;
      WriteLn(GetNum);
      end
   else
      WriteLn('0');
end;
{--------------------------------------------------------------}


Now  you should be able to  initialize  variables  with  negative
and/or multi-digit values.


THE SYMBOL TABLE

There's one problem  with  the  compiler  as it stands so far: it
doesn't do anything to record a variable when we declare it.   So
the compiler is perfectly content to allocate storage for several
variables with the same name.  You can easily verify this with an
input like


     pvavavabe.


Here we've declared the variable A three times.  As you  can see,
the compiler will  cheerfully  accept  that,  and  generate three
identical labels.  Not good.

Later on,  when we start referencing variables, the compiler will
also let us reference variables  that don't exist.  The assembler
will  catch  both  of these error conditions, but it doesn't seem
friendly at all to pass such errors along to the assembler.   The
compiler should catch such things at the source language level.

So even though we don't need a symbol table to record data types,
we ought to install  one  just to check for these two conditions.
Since at this  point  we are still restricted to single-character
variable names, the symbol table can be trivial.  To  provide for
it, first add the following  declaration at the beginning of your
program:


     var ST: array['A'..'Z'] of char;


and insert the following function:


{--------------------------------------------------------------}
{ Look for Symbol in Table }

function InTable(n: char): Boolean;
begin
   InTable := ST[n] <> ' ';
end;
{--------------------------------------------------------------}


We  also  need  to initialize the  table  to  all  blanks.    The
following lines in Init will do the job:


var i: char;
begin
   for i := 'A' to 'Z' do
      ST[i] := ' ';
   ...


Finally,  insert  the  following two lines at  the  beginning  of
Alloc:


   if InTable(N) then Abort('Duplicate Variable Name ' + N);
   ST[N] := 'v';


That  should  do  it.  The  compiler  will  now  catch  duplicate
declarations.  Later, we can  also  use  InTable  when generating
references to the variables.


EXECUTABLE STATEMENTS

At this point, we can generate a null program that has  some data
variables  declared  and  possibly initialized.  But  so  far  we
haven't arranged to generate the first line of executable code.

Believe  it or not, though, we almost  have  a  usable  language!
What's missing is the executable code that must go into  the main
program.  But that code is just assignment statements and control
statements ... all stuff we have done before.   So  it  shouldn't
take us long to provide for them, as well.

The BNF definition given earlier  for the main program included a
statement block, which we have so far ignored:


     <main> ::= BEGIN <block> END


For now,  we  can  just  consider  a  block  to  be  a  series of
assignment statements:


     <block> ::= (Assignment)*


Let's start things off by adding  a  parser for the block.  We'll
begin with a stub for the assignment statement:


{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
begin
   GetChar;
end;


{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure Block;
begin
   while Look <> 'e' do
      Assignment;
end;
{--------------------------------------------------------------}


Modify procedure Main to call Block as shown below:


{--------------------------------------------------------------}
{ Parse and Translate a Main Program }

procedure Main;
begin
   Match('b');
   Prolog;
   Block;
   Match('e');
   Epilog;
end;
{--------------------------------------------------------------}


This version still won't generate any code for  the   "assignment
statements" ... all it does is to eat characters  until  it  sees
the 'e' for 'END.'  But it sets the stage for what is to follow.

The  next  step,  of  course,  is  to  flesh out the code for  an
assignment statement.  This  is  something  we've done many times
before,  so  I  won't belabor it.  This time, though, I'd like to
deal with the code generation a little differently.  Up till now,
we've always just inserted the Emits that generate output code in
line with  the parsing routines.  A little unstructured, perhaps,
but it seemed the most straightforward approach, and made it easy
to see what kind of code would be emitted for each construct.

However, I realize that most of you are using an  80x86 computer,
so  the 68000 code generated is of little use to you.  Several of
you have asked me if the CPU-dependent code couldn't be collected
into one spot  where  it  would  be easier to retarget to another
CPU.  The answer, of course, is yes.

To  accomplish  this,  insert  the  following  "code  generation"
routines:


{---------------------------------------------------------------}
{ Clear the Primary Register }

procedure Clear;
begin
   EmitLn('CLR D0');
end;


{---------------------------------------------------------------}
{ Negate the Primary Register }

procedure Negate;
begin
   EmitLn('NEG D0');
end;


{---------------------------------------------------------------}
{ Load a Constant Value to Primary Register }

procedure LoadConst(n: integer);
begin
   Emit('MOVE #');
   WriteLn(n, ',D0');
end;


{---------------------------------------------------------------}
{ Load a Variable to Primary Register }

procedure LoadVar(Name: char);
begin
   if not InTable(Name) then Undefined(Name);
   EmitLn('MOVE ' + Name + '(PC),D0');
end;


{---------------------------------------------------------------}
{ Push Primary onto Stack }

procedure Push;
begin
   EmitLn('MOVE D0,-(SP)');
end;


{---------------------------------------------------------------}
{ Add Top of Stack to Primary }

procedure PopAdd;
begin
   EmitLn('ADD (SP)+,D0');
end;


{---------------------------------------------------------------}
{ Subtract Primary from Top of Stack }

procedure PopSub;
begin
   EmitLn('SUB (SP)+,D0');
   EmitLn('NEG D0');
end;


{---------------------------------------------------------------}
{ Multiply Top of Stack by Primary }

procedure PopMul;
begin
   EmitLn('MULS (SP)+,D0');
end;


{---------------------------------------------------------------}
{ Divide Top of Stack by Primary }

procedure PopDiv;
begin
   EmitLn('MOVE (SP)+,D7');
   EmitLn('EXT.L D7');
   EmitLn('DIVS D0,D7');
   EmitLn('MOVE D7,D0');
end;


{---------------------------------------------------------------}
{ Store Primary to Variable }

procedure Store(Name: char);
begin
   if not InTable(Name) then Undefined(Name);
   EmitLn('LEA ' + Name + '(PC),A0');
   EmitLn('MOVE D0,(A0)')
end;
{---------------------------------------------------------------}


The  nice  part  of  this  approach,  of  course,  is that we can
retarget  the compiler to a new CPU  simply  by  rewriting  these
"code generator" procedures.  In  addition,  we  will  find later
that we can improve the code quality by tweaking these routines a
bit, without having to modify the compiler proper.

Note that both LoadVar  and  Store check the symbol table to make
sure that the variable is defined.  The  error  handler Undefined
simply calls Abort:


{--------------------------------------------------------------}
{ Report an Undefined Identifier }

procedure Undefined(n: string);
begin
   Abort('Undefined Identifier ' + n);
end;
{--------------------------------------------------------------}


OK, we are now finally ready to begin processing executable code.
We'll  do  that  by  replacing  the  stub  version  of  procedure
Assignment.

We've been down this  road  many times before, so this should all
be familiar to you.    In fact, except for the changes associated
with the code generation, we  could just copy the procedures from
Part  VII.    Since we are making some changes, I won't just copy
them, but we will go a little faster than usual.

The BNF for the assignment statement is:

     <assignment> ::= <ident> = <expression>

     <expression> ::= <first term> ( <addop> <term> )*

     <first term> ::= <first factor> <rest>

     <term> ::= <factor> <rest>

     <rest> ::= ( <mulop> <factor> )*

     <first factor> ::= [ <addop> ] <factor>

     <factor> ::= <var> | <number> | ( <expression> )


This version of the BNF is  also  a bit different than we've used
before ... yet another "variation on the theme of an expression."
This particular version  has  what  I  consider  to  be  the best
treatment  of  the  unary minus.  As you'll see later, it lets us
handle   negative  constant  values  efficiently.    It's   worth
mentioning  here  that  we  have  often  seen  the advantages  of
"tweaking"  the  BNF  as we go, to help make the language easy to
parse.    What  you're looking at here is a bit different:  we've
tweaked  the  BNF  to make the CODE  GENERATION  more  efficient!
That's a first for this series.

Anyhow, the following code implements the BNF:


{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;
begin
   if Look = '(' then begin
      Match('(');
      Expression;
      Match(')');
      end
   else if IsAlpha(Look) then
      LoadVar(GetName)
   else
      LoadConst(GetNum);
end;


{--------------------------------------------------------------}
{ Parse and Translate a Negative Factor }

procedure NegFactor;
begin
   Match('-');
   if IsDigit(Look) then
      LoadConst(-GetNum)
   else begin
      Factor;
      Negate;
   end;
end;


{--------------------------------------------------------------}
{ Parse and Translate a Leading Factor }

procedure FirstFactor;
begin
   case Look of
     '+': begin
             Match('+');
             Factor;
          end;
     '-': NegFactor;
   else  Factor;
   end;
end;


{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }

procedure Multiply;
begin
   Match('*');
   Factor;
   PopMul;
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Divide }

procedure Divide;
begin
   Match('/');
   Factor;
   PopDiv;
end;


{---------------------------------------------------------------}
{ Common Code Used by Term and FirstTerm }

procedure Term1;
begin
   while IsMulop(Look) do begin
      Push;
      case Look of
       '*': Multiply;
       '/': Divide;
      end;
   end;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Term }

procedure Term;
begin
   Factor;
   Term1;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Leading Term }

procedure FirstTerm;
begin
   FirstFactor;
   Term1;
end;


{--------------------------------------------------------------}
{ Recognize and Translate an Add }

procedure Add;
begin
   Match('+');
   Term;
   PopAdd;
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }

procedure Subtract;
begin
   Match('-');
   Term;
   PopSub;
end;


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
   FirstTerm;
   while IsAddop(Look) do begin
      Push;
      case Look of
       '+': Add;
       '-': Subtract;
      end;
   end;
end;


{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
var Name: char;
begin
   Name := GetName;
   Match('=');
   Expression;
   Store(Name);
end;
{--------------------------------------------------------------}


OK, if you've  got  all  this  code inserted, then compile it and
check  it out.  You should  be  seeing  reasonable-looking  code,
representing a complete program that will  assemble  and execute.
We have a compiler!


BOOLEANS

The next step should also  be  familiar  to  you.    We  must add
Boolean  expressions  and relational operations.    Again,  since
we've already dealt with them more than once,  I  won't elaborate
much on them, except  where  they  are  different from what we've
done before.  Again, we won't just copy from other  files because
I've changed a few things just a bit.  Most  of  the changes just
involve encapsulating the machine-dependent parts as  we  did for
the   arithmetic  operations.    I've  also  modified   procedure
NotFactor  somewhat,  to  parallel  the structure of FirstFactor.
Finally,  I  corrected  an  error  in  the  object code  for  the
relational operators:  The Scc instruction I used  only  sets the
low 8 bits of D0.  We want all 16 bits set for a logical true, so
I've added an instruction to sign-extend the low byte.

To begin, we're going to need some more recognizers:


{--------------------------------------------------------------}
{ Recognize a Boolean Orop }

function IsOrop(c: char): boolean;
begin
   IsOrop := c in ['|', '~'];
end;


{--------------------------------------------------------------}
{ Recognize a Relop }

function IsRelop(c: char): boolean;
begin
   IsRelop := c in ['=', '#', '<', '>'];
end;
{--------------------------------------------------------------}


Also, we're going to need some more code generation routines:


{---------------------------------------------------------------}
{ Complement the Primary Register }

procedure NotIt;
begin
   EmitLn('NOT D0');
end;
{---------------------------------------------------------------}
.
.
.
{---------------------------------------------------------------}
{ AND Top of Stack with Primary }

procedure PopAnd;
begin
   EmitLn('AND (SP)+,D0');
end;


{---------------------------------------------------------------}
{ OR Top of Stack with Primary }

procedure PopOr;
begin
   EmitLn('OR (SP)+,D0');
end;


{---------------------------------------------------------------}
{ XOR Top of Stack with Primary }

procedure PopXor;
begin
   EmitLn('EOR (SP)+,D0');
end;


{---------------------------------------------------------------}
{ Compare Top of Stack with Primary }

procedure PopCompare;
begin
   EmitLn('CMP (SP)+,D0');
end;


{---------------------------------------------------------------}
{ Set D0 If Compare was = }

procedure SetEqual;
begin
   EmitLn('SEQ D0');
   EmitLn('EXT D0');
end;


{---------------------------------------------------------------}
{ Set D0 If Compare was != }

procedure SetNEqual;
begin
   EmitLn('SNE D0');
   EmitLn('EXT D0');
end;


{---------------------------------------------------------------}
{ Set D0 If Compare was > }

procedure SetGreater;
begin
   EmitLn('SLT D0');
   EmitLn('EXT D0');
end;


{---------------------------------------------------------------}
{ Set D0 If Compare was < }

procedure SetLess;
begin
   EmitLn('SGT D0');
   EmitLn('EXT D0');
end;
{---------------------------------------------------------------}

All of this  gives us the tools we need.  The BNF for the Boolean
expressions is:


     <bool-expr> ::= <bool-term> ( <orop> <bool-term> )*

     <bool-term> ::= <not-factor> ( <andop> <not-factor> )*

     <not-factor> ::= [ '!' ] <relation>

     <relation> ::= <expression> [ <relop> <expression> ]


Sharp-eyed readers might  note  that this syntax does not include
the non-terminal  "bool-factor" used in earlier versions.  It was
needed then because I also allowed for the Boolean constants TRUE
and FALSE.   But  remember  that  in TINY there is no distinction
made between Boolean and arithmetic  types ... they can be freely
intermixed.   So there is really no  need  for  these  predefined
values ... we can just use -1 and 0, respectively.

In C terminology, we could always use the defines:


     #define TRUE -1
     #define FALSE 0


(That is, if TINY had a  preprocessor.)   Later on, when we allow
for  declarations  of  constants,  these  two   values   will  be
predefined by the language.

The reason that I'm harping on this is that  I've  already  tried
the alternative, which is to  include TRUE and FALSE as keywords.
The problem with that approach is that it  then  requires lexical
scanning for EVERY variable name  in every expression.  If you'll
recall,  I pointed out in Installment VII  that  this  slows  the
compiler  down considerably.  As long as  keywords  can't  be  in
expressions, we need to do the scanning only at the  beginning of
every  new  statement  ...  quite  an improvement.  So using  the
syntax above not only simplifies the parsing, but  speeds  up the
scanning as well.

OK, given that we're  all  satisfied  with  the syntax above, the
corresponding code is shown below:


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Equals" }

procedure Equals;
begin
   Match('=');
   Expression;
   PopCompare;
   SetEqual;
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Not Equals" }

procedure NotEquals;
begin
   Match('#');
   Expression;
   PopCompare;
   SetNEqual;
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than" }

procedure Less;
begin
   Match('<');
   Expression;
   PopCompare;
   SetLess;
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Greater Than" }

procedure Greater;
begin
   Match('>');
   Expression;
   PopCompare;
   SetGreater;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Relation }


procedure Relation;
begin
   Expression;
   if IsRelop(Look) then begin
      Push;
      case Look of
       '=': Equals;
       '#': NotEquals;
       '<': Less;
       '>': Greater;
      end;
   end;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Boolean Factor with Leading NOT }

procedure NotFactor;
begin
   if Look = '!' then begin
      Match('!');
      Relation;
      NotIt;
      end
   else
      Relation;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Boolean Term }

procedure BoolTerm;
begin
   NotFactor;
   while Look = '&' do begin
      Push;
      Match('&');
      NotFactor;
      PopAnd;
   end;
end;


{--------------------------------------------------------------}
{ Recognize and Translate a Boolean OR }

procedure BoolOr;
begin
   Match('|');
   BoolTerm;
   PopOr;
end;


{--------------------------------------------------------------}
{ Recognize and Translate an Exclusive Or }

procedure BoolXor;
begin
   Match('~');
   BoolTerm;
   PopXor;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Boolean Expression }

procedure BoolExpression;
begin
   BoolTerm;
   while IsOrOp(Look) do begin
      Push;
      case Look of
       '|': BoolOr;
       '~': BoolXor;
      end;
   end;
end;
{--------------------------------------------------------------}


To tie it all together, don't forget to change the  references to
Expression in  procedures Factor and Assignment so that they call
BoolExpression instead.

OK, if  you've  got  all  that typed in, compile it and give it a
whirl.    First,  make  sure  you  can  still parse  an  ordinary
arithmetic expression.  Then, try a Boolean one.    Finally, make
sure  that you can assign the results of  relations.    Try,  for
example:

     pvx,y,zbx=z>ye.

which stands for:

     PROGRAM
     VAR X,Y,Z
     BEGIN
     X = Z > Y
     END.


See how this assigns a Boolean value to X?

CONTROL STRUCTURES

We're almost home.   With  Boolean  expressions  in place, it's a
simple  matter  to  add control structures.  For TINY, we'll only
allow two kinds of them, the IF and the WHILE:


     <if> ::= IF <bool-expression> <block> [ ELSE <block>] ENDIF

     <while> ::= WHILE <bool-expression> <block> ENDWHILE

Once  again,  let  me  spell  out the decisions implicit in  this
syntax, which departs strongly from that of C or Pascal.  In both
of those languages, the "body" of an IF or WHILE is regarded as a
single  statement.  If you intend to use a block of more than one
statement, you have to build a compound statement using BEGIN-END
(in Pascal) or  '{}' (in C).  In TINY (and KISS) there is no such
thing as a compound statement  ... single or multiple they're all
just blocks to these languages.

In KISS, all the control structures will have explicit and unique
keywords  bracketing  the  statement block, so there  can  be  no
confusion as to where things begin  and  end.  This is the modern
approach, used in such respected languages as Ada  and  Modula 2,
and it completely eliminates the problem of the "dangling else."

Note  that I could have chosen to use the same keyword END to end
all  the constructs, as is done in Pascal.  (The closing '}' in C
serves the same purpose.)  But this has always led  to confusion,
which is why Pascal programmers tend to write things like


     end { loop }

or   end { if }


As I explained in  Part  V,  using  unique terminal keywords does
increase  the  size  of the keyword list and therefore slows down
the  scanning, but in this case it seems a small price to pay for
the added insurance.   Better  to find the errors at compile time
rather than run time.

One last thought:  The two constructs above each  have  the  non-
terminals


      <bool-expression> and <block>


juxtaposed with no separating keyword.  In Pascal we would expect
the keywords THEN and DO in these locations.

I have no problem with leaving out these keywords, and the parser
has no trouble either, ON CONDITION that we make no errors in the
bool-expression part.  On  the  other hand, if we were to include
these extra keywords we would get yet one more level of insurance
at very little  cost,  and  I  have no problem with that, either.
Use your best judgment as to which way to go.

OK, with that bit of explanation let's proceed.  As  usual, we're
going to need some new  code generation routines.  These generate
the code for conditional and unconditional branches:

{---------------------------------------------------------------}
{ Branch Unconditional  }

procedure Branch(L: string);
begin
   EmitLn('BRA ' + L);
end;


{---------------------------------------------------------------}
{ Branch False }

procedure BranchFalse(L: string);
begin
   EmitLn('TST D0');
   EmitLn('BEQ ' + L);
end;
{--------------------------------------------------------------}


Except for the encapsulation of  the code generation, the code to
parse the control constructs is the same as you've seen before:


{---------------------------------------------------------------}
{ Recognize and Translate an IF Construct }

procedure Block; Forward;


procedure DoIf;
var L1, L2: string;
begin
   Match('i');
   BoolExpression;
   L1 := NewLabel;
   L2 := L1;
   BranchFalse(L1);
   Block;
   if Look = 'l' then begin
      Match('l');
      L2 := NewLabel;
      Branch(L2);
      PostLabel(L1);
      Block;
   end;
   PostLabel(L2);
   Match('e');
end;


{--------------------------------------------------------------}
{ Parse and Translate a WHILE Statement }

procedure DoWhile;
var L1, L2: string;
begin
   Match('w');
   L1 := NewLabel;
   L2 := NewLabel;
   PostLabel(L1);
   BoolExpression;
   BranchFalse(L2);
   Block;
   Match('e');
   Branch(L1);
   PostLabel(L2);
end;
{--------------------------------------------------------------}


To tie everything  together,  we need only modify procedure Block
to recognize the "keywords" for the  IF  and WHILE.  As usual, we
expand the definition of a block like so:


     <block> ::= ( <statement> )*


where


     <statement> ::= <if> | <while> | <assignment>


The corresponding code is:


{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure Block;
begin
   while not(Look in ['e', 'l']) do begin
      case Look of
       'i': DoIf;
       'w': DoWhile;
      else Assignment;
      end;
   end;
end;
{--------------------------------------------------------------}


OK,  add the routines I've given, compile and  test  them.    You
should be able to parse the single-character versions  of  any of
the control constructs.  It's looking pretty good!

As a matter  of  fact, except for the single-character limitation
we've got a virtually complete version of TINY.  I call  it, with
tongue planted firmly in cheek, TINY Version 0.1.


LEXICAL SCANNING

Of course, you know what's next:  We have to convert  the program
so that  it can deal with multi-character keywords, newlines, and
whitespace.   We have just gone through all  that  in  Part  VII.
We'll use the distributed scanner  technique that I showed you in
that  installment.    The  actual  implementation  is   a  little
different because the way I'm handling newlines is different.

To begin with, let's simply  allow for whitespace.  This involves
only adding calls to SkipWhite at the end of the  three routines,
GetName, GetNum, and Match.    A call to SkipWhite in Init primes
the pump in case there are leading spaces.

Next, we need to deal with  newlines.   This is really a two-step
process,  since  the  treatment  of  the  newlines  with  single-
character tokens is different from that for multi-character ones.
We can eliminate some work by doing both  steps  at  once,  but I
feel safer taking things one step at a time.

Insert the new procedure:


{--------------------------------------------------------------}
{ Skip Over an End-of-Line }

procedure NewLine;
begin
   while Look = CR do begin
      GetChar;
      if Look = LF then GetChar;
      SkipWhite;
   end;
end;
{--------------------------------------------------------------}


Note that  we  have  seen  this  procedure  before in the form of
Procedure Fin.  I've changed the name since this  new  one  seems
more descriptive of the actual function.  I've  also  changed the
code  to  allow  for multiple newlines and lines with nothing but
white space.

The next step is to insert calls to NewLine wherever we  decide a
newline is permissible.  As I've pointed out before, this  can be
very different in different languages.   In TINY, I've decided to
allow them virtually anywhere.  This means that we need  calls to
NewLine at the BEGINNING (not the end, as with SkipWhite)  of the
procedures GetName, GetNum, and Match.

For procedures that have while loops, such as TopDecl, we  need a
call  to NewLine at the beginning of the  procedure  AND  at  the
bottom  of  each  loop.  That way, we can be assured that NewLine
has just been called at the beginning of each  pass  through  the
loop.

If you've got all this done, try the program out and  verify that
it will indeed handle white space and newlines.

If it does, then we're  ready to deal with multi-character tokens
and keywords.   To begin, add the additional declarations (copied
almost verbatim from Part VII):


{--------------------------------------------------------------}
{ Type Declarations }

type Symbol = string[8];

     SymTab = array[1..1000] of Symbol;

     TabPtr = ^SymTab;


{--------------------------------------------------------------}
{ Variable Declarations }

var Look : char;             { Lookahead Character }
    Token: char;             { Encoded Token       }
    Value: string[16];       { Unencoded Token     }

    ST: Array['A'..'Z'] of char;

{--------------------------------------------------------------}
{ Definition of Keywords and Token Types }

const NKW =   9;
      NKW1 = 10;

const KWlist: array[1..NKW] of Symbol =
              ('IF', 'ELSE', 'ENDIF', 'WHILE', 'ENDWHILE',
               'VAR', 'BEGIN', 'END', 'PROGRAM');

const KWcode: string[NKW1] = 'xilewevbep';
{--------------------------------------------------------------}


Next, add the three procedures, also from Part VII:


{--------------------------------------------------------------}
{ Table Lookup }

function Lookup(T: TabPtr; s: string; n: integer): integer;
var i: integer;
    found: Boolean;
begin
   found := false;
   i := n;
   while (i > 0) and not found do
      if s = T^[i] then
         found := true
      else
         dec(i);
   Lookup := i;
end;
{--------------------------------------------------------------}
.
.
{--------------------------------------------------------------}
{ Get an Identifier and Scan it for Keywords }

procedure Scan;
begin
   GetName;
   Token := KWcode[Lookup(Addr(KWlist), Value, NKW) + 1];
end;
{--------------------------------------------------------------}
.
.
{--------------------------------------------------------------}
{ Match a Specific Input String }

procedure MatchString(x: string);
begin
   if Value <> x then Expected('''' + x + '''');
end;
{--------------------------------------------------------------}


Now, we have to make a  fairly  large number of subtle changes to
the remaining procedures.  First,  we  must  change  the function
GetName to a procedure, again as we did in Part VII:


{--------------------------------------------------------------}
{ Get an Identifier }

procedure GetName;
begin
   NewLine;
   if not IsAlpha(Look) then Expected('Name');
   Value := '';
   while IsAlNum(Look) do begin
      Value := Value + UpCase(Look);
      GetChar;
   end;
   SkipWhite;
end;
{--------------------------------------------------------------}


Note that this procedure leaves its result in  the  global string
Value.

Next, we have to change every reference to GetName to reflect its
new form. These occur in Factor, Assignment, and Decl:


{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure BoolExpression; Forward;

procedure Factor;
begin
   if Look = '(' then begin
      Match('(');
      BoolExpression;
      Match(')');
      end
   else if IsAlpha(Look) then begin
      GetName;
      LoadVar(Value[1]);
      end
   else
      LoadConst(GetNum);
end;
{--------------------------------------------------------------}
.
.
{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
var Name: char;
begin
   Name := Value[1];
   Match('=');
   BoolExpression;
   Store(Name);
end;
{---------------------------------------------------------------}
.
.
{--------------------------------------------------------------}
{ Parse and Translate a Data Declaration }

procedure Decl;
begin
   GetName;
   Alloc(Value[1]);
   while Look = ',' do begin
      Match(',');
      GetName;
      Alloc(Value[1]);
   end;
end;
{--------------------------------------------------------------}


(Note that we're still  only  allowing  single-character variable
names,  so we take the easy way out here and simply use the first
character of the string.)

Finally, we must make the changes to use Token instead of Look as
the  test  character  and to call Scan at the appropriate places.
Mostly, this  involves  deleting  calls  to  Match,  occasionally
replacing calls to  Match  by calls to MatchString, and Replacing
calls  to  NewLine  by  calls  to  Scan.    Here are the affected
routines:

{---------------------------------------------------------------}
{ Recognize and Translate an IF Construct }

procedure Block; Forward;


procedure DoIf;
var L1, L2: string;
begin
   BoolExpression;
   L1 := NewLabel;
   L2 := L1;
   BranchFalse(L1);
   Block;
   if Token = 'l' then begin
      L2 := NewLabel;
      Branch(L2);
      PostLabel(L1);
      Block;
   end;
   PostLabel(L2);
   MatchString('ENDIF');
end;


{--------------------------------------------------------------}
{ Parse and Translate a WHILE Statement }

procedure DoWhile;
var L1, L2: string;
begin
   L1 := NewLabel;
   L2 := NewLabel;
   PostLabel(L1);
   BoolExpression;
   BranchFalse(L2);
   Block;
   MatchString('ENDWHILE');
   Branch(L1);
   PostLabel(L2);
end;


{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure Block;
begin
   Scan;
   while not(Token in ['e', 'l']) do begin
      case Token of
       'i': DoIf;
       'w': DoWhile;
      else Assignment;
      end;
      Scan;
   end;
end;


{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }

procedure TopDecls;
begin
   Scan;
   while Token <> 'b' do begin
      case Token of
        'v': Decl;
      else Abort('Unrecognized Keyword ' + Value);
      end;
      Scan;
   end;
end;


{--------------------------------------------------------------}
{ Parse and Translate a Main Program }

procedure Main;
begin
   MatchString('BEGIN');
   Prolog;
   Block;
   MatchString('END');
   Epilog;
end;

{--------------------------------------------------------------}
{  Parse and Translate a Program }

procedure Prog;
begin
   MatchString('PROGRAM');
   Header;
   TopDecls;
   Main;
   Match('.');
end;


{--------------------------------------------------------------}
{ Initialize }

procedure Init;
var i: char;
begin
   for i := 'A' to 'Z' do
      ST[i] := ' ';
   GetChar;
   Scan;
end;
{--------------------------------------------------------------}


That should do  it.    If  all  the changes got in correctly, you
should now be parsing programs that look like programs.   (If you
didn't  make  it  through all the  changes,  don't  despair.    A
complete listing of the final form is given later.)

Did it work?  If so, then we're just about home.  In fact, with a
few minor  exceptions we've already got a compiler that's usable.
There are still a few areas that need improvement.


MULTI-CHARACTER VARIABLE NAMES

One of those is  the  restriction  that  we still have, requiring
single-character variable names.    Now that we can handle multi-
character keywords, this one  begins  to  look  very much like an
arbitrary  and  unnecessary  limitation.    And  indeed   it  is.
Basically, its only virtue is  that it permits a trivially simple
implementation  of  the   symbol   table.    But  that's  just  a
convenience to the compiler writers, and needs to be eliminated.

We've done this step before.  This time, as usual, I'm doing it a
little differently.  I think  the approach used here keeps things
just about as simple as possible.

The natural  way  to  implement  a  symbol  table in Pascal is by
declaring a record type, and making the symbol table an  array of
such records.  Here, though, we don't really need  a  type  field
yet  (there is only one kind of entry allowed so far), so we only
need an array of symbols.  This has the advantage that we can use
the existing procedure Lookup to  search the symbol table as well
as the  keyword  list.    As it turns out, even when we need more
fields we can still use the same approach, simply by  storing the
other fields in separate arrays.

OK, here are the changes that  need  to  be made.  First, add the
new typed constant:


      NEntry: integer = 0;


Then change the definition of the symbol table as follows:


const MaxEntry = 100;

var ST   : array[1..MaxEntry] of Symbol;


(Note that ST is _NOT_ declared as a SymTab.  That declaration is
a phony one to get Lookup to work.  A SymTab  would  take  up too
much RAM space, and so one is never actually allocated.)

Next, we need to replace InTable:


{--------------------------------------------------------------}
{ Look for Symbol in Table }

function InTable(n: Symbol): Boolean;
begin
   InTable := Lookup(@ST, n, MaxEntry) <> 0;
end;
{--------------------------------------------------------------}


We also need a new procedure, AddEntry, that adds a new  entry to
the table:


{--------------------------------------------------------------}
{ Add a New Entry to Symbol Table }

procedure AddEntry(N: Symbol; T: char);
begin
   if InTable(N) then Abort('Duplicate Identifier ' + N);
   if NEntry = MaxEntry then Abort('Symbol Table Full');
   Inc(NEntry);
   ST[NEntry] := N;
   SType[NEntry] := T;
end;
{--------------------------------------------------------------}


This procedure is called by Alloc:


{--------------------------------------------------------------}
{ Allocate Storage for a Variable }

procedure Alloc(N: Symbol);
begin
   if InTable(N) then Abort('Duplicate Variable Name ' + N);
   AddEntry(N, 'v');
.
.
.
{--------------------------------------------------------------}


Finally, we must change all the routines that currently treat the
variable name as a single character.  These include   LoadVar and
Store (just change the  type  from  char  to string), and Factor,
Assignment, and Decl (just change Value[1] to Value).

One  last  thing:  change  procedure  Init to clear the array  as
shown:


{--------------------------------------------------------------}
{ Initialize }

procedure Init;
var i: integer;
begin
   for i := 1 to MaxEntry do begin
      ST[i] := '';
      SType[i] := ' ';
   end;
   GetChar;
   Scan;
end;
{--------------------------------------------------------------}


That should do it.  Try it out and verify  that  you can, indeed,
use multi-character variable names.


MORE RELOPS

We still have one remaining single-character restriction: the one
on relops.  Some of the relops are indeed single  characters, but
others  require two.  These are '<=' and '>='.  I also prefer the
Pascal '<>' for "not equals,"  instead of '#'.

If you'll recall, in Part VII I pointed out that the conventional
way  to  deal  with  relops  is  to  include them in the list  of
keywords, and let the  lexical  scanner  find  them.  But, again,
this requires scanning throughout the expression parsing process,
whereas so far we've been able to limit the use of the scanner to
the beginning of a statement.

I mentioned then that we can still get away with this,  since the
multi-character relops are so few  and so limited in their usage.
It's easy to just treat them as special cases and handle  them in
an ad hoc manner.

The changes required affect only the code generation routines and
procedures Relation and friends.   First, we're going to need two
more code generation routines:


{---------------------------------------------------------------}
{ Set D0 If Compare was <= }

procedure SetLessOrEqual;
begin
   EmitLn('SGE D0');
   EmitLn('EXT D0');
end;


{---------------------------------------------------------------}
{ Set D0 If Compare was >= }

procedure SetGreaterOrEqual;
begin
   EmitLn('SLE D0');
   EmitLn('EXT D0');
end;
{---------------------------------------------------------------}


Then, modify the relation parsing routines as shown below:


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than or Equal" }

procedure LessOrEqual;
begin
   Match('=');
   Expression;
   PopCompare;
   SetLessOrEqual;
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Not Equals" }

procedure NotEqual;
begin
   Match('>');
   Expression;
   PopCompare;
   SetNEqual;
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than" }

procedure Less;
begin
   Match('<');
   case Look of
     '=': LessOrEqual;
     '>': NotEqual;
   else begin
           Expression;
           PopCompare;
           SetLess;
        end;
   end;
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Greater Than" }

procedure Greater;
begin
   Match('>');
   if Look = '=' then begin
      Match('=');
      Expression;
      PopCompare;
      SetGreaterOrEqual;
      end
   else begin
      Expression;
      PopCompare;
      SetGreater;
   end;
end;
{---------------------------------------------------------------}


That's all it takes.  Now  you  can  process all the relops.  Try
it.


INPUT/OUTPUT

We  now  have  a complete, working language, except for one minor
embarassment: we have no way to get data in or out.  We need some
I/O.

Now, the convention these days, established in C and continued in
Ada and Modula 2, is to leave I/O statements out of  the language
itself,  and  just  include them in the subroutine library.  That
would  be  fine, except that so far  we  have  no  provision  for
subroutines.  Anyhow, with this approach you run into the problem
of variable-length argument lists.  In Pascal, the I/O statements
are built into the language because they are the  only  ones  for
which  the  argument  list can have a variable number of entries.
In C, we settle for kludges like scanf and printf, and  must pass
the argument count to the called procedure.  In Ada and  Modula 2
we must use the  awkward  (and SLOW!) approach of a separate call
for each argument.

So I think I prefer the  Pascal  approach of building the I/O in,
even though we don't need to.

As  usual,  for  this we need some more code generation routines.
These turn out  to be the easiest of all, because all we do is to
call library procedures to do the work:


{---------------------------------------------------------------}
{ Read Variable to Primary Register }

procedure ReadVar;
begin
   EmitLn('BSR READ');
   Store(Value);
end;


{---------------------------------------------------------------}
{ Write Variable from Primary Register }

procedure WriteVar;
begin
   EmitLn('BSR WRITE');
end;
{--------------------------------------------------------------}


The idea is that READ loads the value from input  to  the D0, and
WRITE outputs it from there.

These two procedures represent  our  first  encounter with a need
for library procedures ... the components of a  Run  Time Library
(RTL).    Of  course, someone (namely  us)  has  to  write  these
routines, but they're not  part  of the compiler itself.  I won't
even bother  showing the routines here, since these are obviously
very much OS-dependent.   I  _WILL_  simply  say that for SK*DOS,
they  are  particularly  simple ... almost trivial.  One reason I
won't show them here is that  you  can add all kinds of fanciness
to the things, for  example  by prompting in READ for the inputs,
and by giving the user a chance to reenter a bad input.

But that is really separate from compiler design, so for now I'll
just assume that a library call TINYLIB.LIB exists.  Since we now
need  it  loaded,  we need to add a statement to  include  it  in
procedure Header:


{--------------------------------------------------------------}
{ Write Header Info }

procedure Header;
begin

   WriteLn('WARMST', TAB, 'EQU $A01E');
   EmitLn('LIB TINYLIB');
end;
{--------------------------------------------------------------}

That takes care of that part.  Now, we also need to recognize the
read  and  write  commands.  We can do this by  adding  two  more
keywords to our list:


{--------------------------------------------------------------}
{ Definition of Keywords and Token Types }

const NKW =   11;
      NKW1 = 12;

const KWlist: array[1..NKW] of Symbol =
              ('IF', 'ELSE', 'ENDIF', 'WHILE', 'ENDWHILE',
               'READ',    'WRITE',    'VAR',    'BEGIN',   'END',
'PROGRAM');

const KWcode: string[NKW1] = 'xileweRWvbep';
{--------------------------------------------------------------}


(Note how I'm using upper case codes here to avoid  conflict with
the 'w' of WHILE.)

Next, we need procedures for processing the  read/write statement
and its argument list:


{--------------------------------------------------------------}
{ Process a Read Statement }
procedure DoRead;
begin
   Match('(');
   GetName;
   ReadVar;
   while Look = ',' do begin
      Match(',');
      GetName;
      ReadVar;
   end;
   Match(')');
end;


{--------------------------------------------------------------}
{ Process a Write Statement }

procedure DoWrite;
begin
   Match('(');
   Expression;
   WriteVar;
   while Look = ',' do begin
      Match(',');
      Expression;
      WriteVar;
   end;
   Match(')');
end;
{--------------------------------------------------------------}


Finally,  we  must  expand  procedure  Block  to  handle the  new
statement types:


{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure Block;
begin
   Scan;
   while not(Token in ['e', 'l']) do begin
      case Token of
       'i': DoIf;
       'w': DoWhile;
       'R': DoRead;
       'W': DoWrite;
      else Assignment;
      end;
      Scan;
   end;
end;
{--------------------------------------------------------------}

That's all there is to it.  _NOW_ we have a language!


CONCLUSION

At this point we have TINY completely defined.  It's not much ...
actually a toy  compiler.    TINY  has  only one data type and no
subroutines  ... but it's a complete,  usable  language.    While
you're not likely to be able to write another compiler in  it, or
do anything else very seriously, you could write programs to read
some input, perform calculations,  and  output  the results.  Not
too bad for a toy.

Most importantly, we have a firm base upon which to build further
extensions.  I know you'll be glad to hear this: this is the last
time  I'll  start  over in building a parser ... from  now  on  I
intend to just add features to  TINY  until it becomes KISS.  Oh,
there'll be other times we will  need  to try things out with new
copies  of  the  Cradle, but once we've found out how to do those
things they'll be incorporated into TINY.

What  will  those  features  be?    Well,  for starters  we  need
subroutines and functions.    Then  we  need to be able to handle
different types, including arrays, strings, and other structures.
Then we need to deal with the idea of pointers.  All this will be
upcoming in future installments.

See you then.

For references purposes, the complete listing of TINY Version 1.0
is shown below:


{--------------------------------------------------------------}
program Tiny10;

{--------------------------------------------------------------}
{ Constant Declarations }

const TAB = ^I;
      CR  = ^M;
      LF  = ^J;

      LCount: integer = 0;
      NEntry: integer = 0;


{--------------------------------------------------------------}
{ Type Declarations }

type Symbol = string[8];

     SymTab = array[1..1000] of Symbol;
     TabPtr = ^SymTab;


{--------------------------------------------------------------}
{ Variable Declarations }

var Look : char;             { Lookahead Character }
    Token: char;             { Encoded Token       }
    Value: string[16];       { Unencoded Token     }


const MaxEntry = 100;

var ST   : array[1..MaxEntry] of Symbol;
    SType: array[1..MaxEntry] of char;


{--------------------------------------------------------------}
{ Definition of Keywords and Token Types }

const NKW =   11;
      NKW1 = 12;

const KWlist: array[1..NKW] of Symbol =
              ('IF', 'ELSE', 'ENDIF', 'WHILE', 'ENDWHILE',
               'READ',    'WRITE',    'VAR',    'BEGIN',   'END',
'PROGRAM');

const KWcode: string[NKW1] = 'xileweRWvbep';


{--------------------------------------------------------------}
{ Read New Character From Input Stream }

procedure GetChar;
begin
   Read(Look);
end;

{--------------------------------------------------------------}
{ Report an Error }

procedure Error(s: string);
begin
   WriteLn;
   WriteLn(^G, 'Error: ', s, '.');
end;


{--------------------------------------------------------------}
{ Report Error and Halt }

procedure Abort(s: string);
begin
   Error(s);
   Halt;
end;


{--------------------------------------------------------------}
{ Report What Was Expected }

procedure Expected(s: string);
begin
   Abort(s + ' Expected');
end;

{--------------------------------------------------------------}
{ Report an Undefined Identifier }

procedure Undefined(n: string);
begin
   Abort('Undefined Identifier ' + n);
end;


{--------------------------------------------------------------}
{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;
begin
   IsAlpha := UpCase(c) in ['A'..'Z'];
end;


{--------------------------------------------------------------}
{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;
begin
   IsDigit := c in ['0'..'9'];
end;


{--------------------------------------------------------------}
{ Recognize an AlphaNumeric Character }

function IsAlNum(c: char): boolean;
begin
   IsAlNum := IsAlpha(c) or IsDigit(c);
end;


{--------------------------------------------------------------}
{ Recognize an Addop }

function IsAddop(c: char): boolean;
begin
   IsAddop := c in ['+', '-'];
end;


{--------------------------------------------------------------}
{ Recognize a Mulop }

function IsMulop(c: char): boolean;
begin
   IsMulop := c in ['*', '/'];
end;


{--------------------------------------------------------------}
{ Recognize a Boolean Orop }

function IsOrop(c: char): boolean;
begin
   IsOrop := c in ['|', '~'];
end;


{--------------------------------------------------------------}
{ Recognize a Relop }

function IsRelop(c: char): boolean;
begin
   IsRelop := c in ['=', '#', '<', '>'];
end;


{--------------------------------------------------------------}
{ Recognize White Space }

function IsWhite(c: char): boolean;
begin
   IsWhite := c in [' ', TAB];
end;


{--------------------------------------------------------------}
{ Skip Over Leading White Space }

procedure SkipWhite;
begin
   while IsWhite(Look) do
      GetChar;
end;


{--------------------------------------------------------------}
{ Skip Over an End-of-Line }

procedure NewLine;
begin
   while Look = CR do begin
      GetChar;
      if Look = LF then GetChar;
      SkipWhite;
   end;
end;


{--------------------------------------------------------------}
{ Match a Specific Input Character }

procedure Match(x: char);
begin
   NewLine;
   if Look = x then GetChar
   else Expected('''' + x + '''');
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Table Lookup }

function Lookup(T: TabPtr; s: string; n: integer): integer;
var i: integer;
    found: Boolean;
begin
   found := false;
   i := n;
   while (i > 0) and not found do
      if s = T^[i] then
         found := true
      else
         dec(i);
   Lookup := i;
end;


{--------------------------------------------------------------}
{ Locate a Symbol in Table }
{ Returns the index of the entry.  Zero if not present. }

function Locate(N: Symbol): integer;
begin
   Locate := Lookup(@ST, n, MaxEntry);
end;


{--------------------------------------------------------------}
{ Look for Symbol in Table }

function InTable(n: Symbol): Boolean;
begin
   InTable := Lookup(@ST, n, MaxEntry) <> 0;
end;


{--------------------------------------------------------------}
{ Add a New Entry to Symbol Table }

procedure AddEntry(N: Symbol; T: char);
begin
   if InTable(N) then Abort('Duplicate Identifier ' + N);
   if NEntry = MaxEntry then Abort('Symbol Table Full');
   Inc(NEntry);
   ST[NEntry] := N;
   SType[NEntry] := T;
end;


{--------------------------------------------------------------}
{ Get an Identifier }

procedure GetName;
begin
   NewLine;
   if not IsAlpha(Look) then Expected('Name');
   Value := '';
   while IsAlNum(Look) do begin
      Value := Value + UpCase(Look);
      GetChar;
   end;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Get a Number }

function GetNum: integer;
var Val: integer;
begin
   NewLine;
   if not IsDigit(Look) then Expected('Integer');
   Val := 0;
   while IsDigit(Look) do begin
      Val := 10 * Val + Ord(Look) - Ord('0');
      GetChar;
   end;
   GetNum := Val;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Get an Identifier and Scan it for Keywords }

procedure Scan;
begin
   GetName;
   Token := KWcode[Lookup(Addr(KWlist), Value, NKW) + 1];
end;


{--------------------------------------------------------------}
{ Match a Specific Input String }

procedure MatchString(x: string);
begin
   if Value <> x then Expected('''' + x + '''');
end;


{--------------------------------------------------------------}
{ Output a String with Tab }

procedure Emit(s: string);
begin
   Write(TAB, s);
end;


{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);
begin
   Emit(s);
   WriteLn;
end;


{--------------------------------------------------------------}
{ Generate a Unique Label }

function NewLabel: string;
var S: string;
begin
   Str(LCount, S);
   NewLabel := 'L' + S;
   Inc(LCount);
end;


{--------------------------------------------------------------}
{ Post a Label To Output }

procedure PostLabel(L: string);
begin
   WriteLn(L, ':');
end;


{---------------------------------------------------------------}
{ Clear the Primary Register }

procedure Clear;
begin
   EmitLn('CLR D0');
end;


{---------------------------------------------------------------}
{ Negate the Primary Register }

procedure Negate;
begin
   EmitLn('NEG D0');
end;


{---------------------------------------------------------------}
{ Complement the Primary Register }

procedure NotIt;
begin
   EmitLn('NOT D0');
end;


{---------------------------------------------------------------}
{ Load a Constant Value to Primary Register }

procedure LoadConst(n: integer);
begin
   Emit('MOVE #');
   WriteLn(n, ',D0');
end;


{---------------------------------------------------------------}
{ Load a Variable to Primary Register }

procedure LoadVar(Name: string);
begin
   if not InTable(Name) then Undefined(Name);
   EmitLn('MOVE ' + Name + '(PC),D0');
end;


{---------------------------------------------------------------}
{ Push Primary onto Stack }

procedure Push;
begin
   EmitLn('MOVE D0,-(SP)');
end;


{---------------------------------------------------------------}
{ Add Top of Stack to Primary }

procedure PopAdd;
begin
   EmitLn('ADD (SP)+,D0');
end;


{---------------------------------------------------------------}
{ Subtract Primary from Top of Stack }

procedure PopSub;
begin
   EmitLn('SUB (SP)+,D0');
   EmitLn('NEG D0');
end;


{---------------------------------------------------------------}
{ Multiply Top of Stack by Primary }

procedure PopMul;
begin
   EmitLn('MULS (SP)+,D0');
end;


{---------------------------------------------------------------}
{ Divide Top of Stack by Primary }

procedure PopDiv;
begin
   EmitLn('MOVE (SP)+,D7');
   EmitLn('EXT.L D7');
   EmitLn('DIVS D0,D7');
   EmitLn('MOVE D7,D0');
end;


{---------------------------------------------------------------}
{ AND Top of Stack with Primary }

procedure PopAnd;
begin
   EmitLn('AND (SP)+,D0');
end;


{---------------------------------------------------------------}
{ OR Top of Stack with Primary }

procedure PopOr;
begin
   EmitLn('OR (SP)+,D0');
end;


{---------------------------------------------------------------}
{ XOR Top of Stack with Primary }

procedure PopXor;
begin
   EmitLn('EOR (SP)+,D0');
end;


{---------------------------------------------------------------}
{ Compare Top of Stack with Primary }

procedure PopCompare;
begin
   EmitLn('CMP (SP)+,D0');
end;


{---------------------------------------------------------------}
{ Set D0 If Compare was = }

procedure SetEqual;
begin
   EmitLn('SEQ D0');
   EmitLn('EXT D0');
end;


{---------------------------------------------------------------}
{ Set D0 If Compare was != }

procedure SetNEqual;
begin
   EmitLn('SNE D0');
   EmitLn('EXT D0');
end;


{---------------------------------------------------------------}
{ Set D0 If Compare was > }

procedure SetGreater;
begin
   EmitLn('SLT D0');
   EmitLn('EXT D0');
end;


{---------------------------------------------------------------}
{ Set D0 If Compare was < }

procedure SetLess;
begin
   EmitLn('SGT D0');
   EmitLn('EXT D0');
end;


{---------------------------------------------------------------}
{ Set D0 If Compare was <= }

procedure SetLessOrEqual;
begin
   EmitLn('SGE D0');
   EmitLn('EXT D0');
end;


{---------------------------------------------------------------}
{ Set D0 If Compare was >= }

procedure SetGreaterOrEqual;
begin
   EmitLn('SLE D0');
   EmitLn('EXT D0');
end;


{---------------------------------------------------------------}
{ Store Primary to Variable }

procedure Store(Name: string);
begin
   if not InTable(Name) then Undefined(Name);
   EmitLn('LEA ' + Name + '(PC),A0');
   EmitLn('MOVE D0,(A0)')
end;


{---------------------------------------------------------------}
{ Branch Unconditional  }

procedure Branch(L: string);
begin
   EmitLn('BRA ' + L);
end;


{---------------------------------------------------------------}
{ Branch False }

procedure BranchFalse(L: string);
begin
   EmitLn('TST D0');
   EmitLn('BEQ ' + L);
end;


{---------------------------------------------------------------}
{ Read Variable to Primary Register }

procedure ReadVar;
begin
   EmitLn('BSR READ');
   Store(Value[1]);
end;


{ Write Variable from Primary Register }

procedure WriteVar;
begin
   EmitLn('BSR WRITE');
end;


{--------------------------------------------------------------}
{ Write Header Info }

procedure Header;
begin
   WriteLn('WARMST', TAB, 'EQU $A01E');
end;


{--------------------------------------------------------------}
{ Write the Prolog }

procedure Prolog;
begin
   PostLabel('MAIN');
end;


{--------------------------------------------------------------}
{ Write the Epilog }

procedure Epilog;
begin
   EmitLn('DC WARMST');
   EmitLn('END MAIN');
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure BoolExpression; Forward;

procedure Factor;
begin
   if Look = '(' then begin
      Match('(');
      BoolExpression;
      Match(')');
      end
   else if IsAlpha(Look) then begin
      GetName;
      LoadVar(Value);
      end
   else
      LoadConst(GetNum);
end;


{--------------------------------------------------------------}
{ Parse and Translate a Negative Factor }

procedure NegFactor;
begin
   Match('-');
   if IsDigit(Look) then
      LoadConst(-GetNum)
   else begin
      Factor;
      Negate;
   end;
end;


{--------------------------------------------------------------}
{ Parse and Translate a Leading Factor }

procedure FirstFactor;
begin
   case Look of
     '+': begin
             Match('+');
             Factor;
          end;
     '-': NegFactor;
   else  Factor;
   end;
end;


{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }

procedure Multiply;
begin
   Match('*');
   Factor;
   PopMul;
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Divide }

procedure Divide;
begin
   Match('/');
   Factor;
   PopDiv;
end;


{---------------------------------------------------------------}
{ Common Code Used by Term and FirstTerm }

procedure Term1;
begin
   while IsMulop(Look) do begin
      Push;
      case Look of
       '*': Multiply;
       '/': Divide;
      end;
   end;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Term }

procedure Term;
begin
   Factor;
   Term1;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Leading Term }

procedure FirstTerm;
begin
   FirstFactor;
   Term1;
end;


{--------------------------------------------------------------}
{ Recognize and Translate an Add }

procedure Add;
begin
   Match('+');
   Term;
   PopAdd;
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }

procedure Subtract;
begin
   Match('-');
   Term;
   PopSub;
end;


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
   FirstTerm;
   while IsAddop(Look) do begin
      Push;
      case Look of
       '+': Add;
       '-': Subtract;
      end;
   end;
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Equals" }

procedure Equal;
begin
   Match('=');
   Expression;
   PopCompare;
   SetEqual;
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than or Equal" }

procedure LessOrEqual;
begin
   Match('=');
   Expression;
   PopCompare;
   SetLessOrEqual;
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Not Equals" }

procedure NotEqual;
begin
   Match('>');
   Expression;
   PopCompare;
   SetNEqual;
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than" }

procedure Less;
begin
   Match('<');
   case Look of
     '=': LessOrEqual;
     '>': NotEqual;
   else begin
           Expression;
           PopCompare;
           SetLess;
        end;
   end;
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Greater Than" }

procedure Greater;
begin
   Match('>');
   if Look = '=' then begin
      Match('=');
      Expression;
      PopCompare;
      SetGreaterOrEqual;
      end
   else begin
      Expression;
      PopCompare;
      SetGreater;
   end;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Relation }


procedure Relation;
begin
   Expression;
   if IsRelop(Look) then begin
      Push;
      case Look of
       '=': Equal;
       '<': Less;
       '>': Greater;
      end;
   end;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Boolean Factor with Leading NOT }

procedure NotFactor;
begin
   if Look = '!' then begin
      Match('!');
      Relation;
      NotIt;
      end
   else
      Relation;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Boolean Term }

procedure BoolTerm;
begin
   NotFactor;
   while Look = '&' do begin
      Push;
      Match('&');
      NotFactor;
      PopAnd;
   end;
end;


{--------------------------------------------------------------}
{ Recognize and Translate a Boolean OR }

procedure BoolOr;
begin
   Match('|');
   BoolTerm;
   PopOr;
end;


{--------------------------------------------------------------}
{ Recognize and Translate an Exclusive Or }

procedure BoolXor;
begin
   Match('~');
   BoolTerm;
   PopXor;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Boolean Expression }

procedure BoolExpression;
begin
   BoolTerm;
   while IsOrOp(Look) do begin
      Push;
      case Look of
       '|': BoolOr;
       '~': BoolXor;
      end;
   end;
end;


{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
var Name: string;
begin
   Name := Value;
   Match('=');
   BoolExpression;
   Store(Name);
end;


{---------------------------------------------------------------}
{ Recognize and Translate an IF Construct }

procedure Block; Forward;


procedure DoIf;
var L1, L2: string;
begin
   BoolExpression;
   L1 := NewLabel;
   L2 := L1;
   BranchFalse(L1);
   Block;
   if Token = 'l' then begin
      L2 := NewLabel;
      Branch(L2);
      PostLabel(L1);
      Block;
   end;
   PostLabel(L2);
   MatchString('ENDIF');
end;


{--------------------------------------------------------------}
{ Parse and Translate a WHILE Statement }

procedure DoWhile;
var L1, L2: string;
begin
   L1 := NewLabel;
   L2 := NewLabel;
   PostLabel(L1);
   BoolExpression;
   BranchFalse(L2);
   Block;
   MatchString('ENDWHILE');
   Branch(L1);
   PostLabel(L2);
end;


{--------------------------------------------------------------}
{ Process a Read Statement }

procedure DoRead;
begin
   Match('(');
   GetName;
   ReadVar;
   while Look = ',' do begin
      Match(',');
      GetName;
      ReadVar;
   end;
   Match(')');
end;


{--------------------------------------------------------------}
{ Process a Write Statement }

procedure DoWrite;
begin
   Match('(');
   Expression;
   WriteVar;
   while Look = ',' do begin
      Match(',');
      Expression;
      WriteVar;
   end;
   Match(')');
end;


{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure Block;
begin
   Scan;
   while not(Token in ['e', 'l']) do begin
      case Token of
       'i': DoIf;
       'w': DoWhile;
       'R': DoRead;
       'W': DoWrite;
      else Assignment;
      end;
      Scan;
   end;
end;


{--------------------------------------------------------------}
{ Allocate Storage for a Variable }

procedure Alloc(N: Symbol);
begin
   if InTable(N) then Abort('Duplicate Variable Name ' + N);
   AddEntry(N, 'v');
   Write(N, ':', TAB, 'DC ');
   if Look = '=' then begin
      Match('=');
      If Look = '-' then begin
         Write(Look);
         Match('-');
      end;
      WriteLn(GetNum);
      end
   else
      WriteLn('0');
end;


{--------------------------------------------------------------}
{ Parse and Translate a Data Declaration }

procedure Decl;
begin
   GetName;
   Alloc(Value);
   while Look = ',' do begin
      Match(',');
      GetName;
      Alloc(Value);
   end;
end;


{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }

procedure TopDecls;
begin
   Scan;
   while Token <> 'b' do begin
      case Token of
        'v': Decl;
      else Abort('Unrecognized Keyword ' + Value);
      end;
      Scan;
   end;
end;


{--------------------------------------------------------------}
{ Parse and Translate a Main Program }

procedure Main;
begin
   MatchString('BEGIN');
   Prolog;
   Block;
   MatchString('END');
   Epilog;
end;


{--------------------------------------------------------------}
{  Parse and Translate a Program }

procedure Prog;
begin
   MatchString('PROGRAM');
   Header;
   TopDecls;
   Main;
   Match('.');
end;


{--------------------------------------------------------------}
{ Initialize }

procedure Init;
var i: integer;
begin
   for i := 1 to MaxEntry do begin
      ST[i] := '';
      SType[i] := ' ';
   end;
   GetChar;
   Scan;
end;


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
   Prog;
   if Look <> CR then Abort('Unexpected data after ''.''');
end.
{--------------------------------------------------------------}
































                     LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                           3 June 1989


                 Part XI: LEXICAL SCAN REVISITED





INTRODUCTION

I've got some  good news and some bad news.  The bad news is that
this installment is  not  the  one  I promised last time.  What's
more, the one after this one won't be, either.

The good news is the reason for this installment:  I've  found  a
way  to simplify and improve the lexical  scanning  part  of  the
compiler.  Let me explain.


BACKGROUND

If  you'll remember, we talked at length  about  the  subject  of
lexical  scanners in Part VII, and I left you with a design for a
distributed scanner that I felt was about as simple  as  I  could
make it ... more than most that I've  seen  elsewhere.    We used
that idea in Part X.  The compiler structure  that  resulted  was
simple, and it got the job done.

Recently, though, I've begun  to  have  problems, and they're the
kind that send a message that you might be doing something wrong.

The  whole thing came to a head when I tried to address the issue
of  semicolons.  Several people have asked  me  about  them,  and
whether or not KISS will have them separating the statements.  My
intention has been NOT to  use semicolons, simply because I don't
like them and, as you can see, they have not proved necessary.

But I know that many of you, like me, have  gotten  used to them,
and so  I  set  out  to write a short installment to show you how
they could easily be added, if you were so inclined.

Well, it  turned  out  that  they weren't easy to add at all.  In
fact it was darned difficult.

I guess I should have  realized that something was wrong, because
of the issue  of  newlines.    In the last couple of installments
we've addressed that issue,  and  I've shown you how to deal with
newlines with a  procedure called, appropriately enough, NewLine.
In  TINY  Version  1.0,  I  sprinkled calls to this procedure  in
strategic spots in the code.

It  seems  that  every time I've addressed the issue of newlines,
though,  I've found it to be tricky,  and  the  resulting  parser
turned out to be quite fragile ... one addition or  deletion here
or  there and things tended to go to pot.  Looking back on it,  I
realize that  there  was  a  message  in  this that I just wasn't
paying attention to.

When I tried to add semicolons  on  top of the newlines, that was
the last straw.   I ended up with much too complex a solution.  I
began to realize that something fundamental had to change.

So,  in  a  way this installment will cause us to backtrack a bit
and revisit the issue of scanning all over again.    Sorry  about
that.  That's the price you pay for watching me  do  this in real
time.  But the new version is definitely an improvement, and will
serve us well for what is to come.

As  I said, the scanner we used in Part X was about as simple  as
one can get.  But anything can be improved.   The  new scanner is
more like the classical  scanner,  and  not  as simple as before.
But the overall  compiler  structure is even simpler than before.
It's also more robust, and easier to add  to  and/or  modify.   I
think that's worth the time spent in this digression.  So in this
installment, I'll be showing  you  the  new  structure.  No doubt
you'll  be  happy  to  know  that, while the changes affect  many
procedures, they aren't very profound  and so we lose very little
of what's been done so far.

Ironically, the new scanner  is  much  more conventional than the
old one, and is very much like the more generic scanner  I showed
you  earlier  in  Part VII.  Then I started trying to get clever,
and I almost clevered myself clean out of business.   You'd think
one day I'd learn: K-I-S-S!


THE PROBLEM

The problem begins to show  itself in procedure Block, which I've
reproduced below:


{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure Block;
begin
   Scan;
   while not(Token in ['e', 'l']) do begin
      case Token of
       'i': DoIf;
       'w': DoWhile;
       'R': DoRead;
       'W': DoWrite;
      else Assignment;
      end;
      Scan;
   end;
end;
{--------------------------------------------------------------}


As  you   can  see,  Block  is  oriented  to  individual  program
statements.  At each pass through  the  loop, we know that we are
at  the beginning of a statement.  We exit the block when we have
scanned an END or an ELSE.

But suppose that we see a semicolon instead.   The  procedure  as
it's shown above  can't  handle that, because procedure Scan only
expects and can only accept tokens that begin with a letter.

I  tinkered  around for quite awhile to come up with a  fix.    I
found many possible approaches, but none were very satisfying.  I
finally figured out the reason.

Recall that when we started with our single-character parsers, we
adopted a convention that the lookahead character would always be
prefetched.    That   is,   we  would  have  the  character  that
corresponds to our  current  position in the input stream fetched
into the global character Look, so that we could  examine  it  as
many  times  as  needed.    The  rule  we  adopted was that EVERY
recognizer, if it found its target token, would  advance  Look to
the next character in the input stream.

That simple and fixed convention served us very well when  we had
single-character tokens, and it still does.  It would make  a lot
of sense to apply the same rule to multi-character tokens.

But when we got into lexical scanning, I began  to  violate  that
simple rule.  The scanner of Part X  did  indeed  advance  to the
next token if it found an identifier or keyword, but it DIDN'T do
that if it found a carriage return, a whitespace character, or an
operator.

Now, that sort of mixed-mode  operation gets us into deep trouble
in procedure Block, because whether or not the  input  stream has
been advanced depends upon the kind of token we  encounter.    If
it's  a keyword or the target of  an  assignment  statement,  the
"cursor," as defined by the contents of Look,  has  been advanced
to  the next token OR to the beginning of whitespace.  If, on the
other  hand,  the  token  is  a  semicolon,  or if we have hit  a
carriage return, the cursor has NOT advanced.

Needless to say, we can add enough logic  to  keep  us  on track.
But it's tricky, and makes the whole parser very fragile.

There's a much  better  way,  and  that's just to adopt that same
rule that's worked so well before, to apply to TOKENS as  well as
single characters.  In other words, we'll prefetch tokens just as
we've always done for  characters.   It seems so obvious once you
think about it that way.

Interestingly enough, if we do things this way  the  problem that
we've had with newline characters goes away.  We  can  just  lump
them in as  whitespace  characters, which means that the handling
of  newlines  becomes  very trivial, and MUCH less prone to error
than we've had to deal with in the past.


THE SOLUTION

Let's  begin  to  fix  the  problem  by  re-introducing  the  two
procedures:

{--------------------------------------------------------------}
{ Get an Identifier }

procedure GetName;
begin
   SkipWhite;
   if Not IsAlpha(Look) then Expected('Identifier');
   Token := 'x';
   Value := '';
   repeat
      Value := Value + UpCase(Look);
      GetChar;
   until not IsAlNum(Look);
end;


{--------------------------------------------------------------}
{ Get a Number }

procedure GetNum;
begin
   SkipWhite;
   if not IsDigit(Look) then Expected('Number');
   Token := '#';
   Value := '';
   repeat
      Value := Value + Look;
      GetChar;
   until not IsDigit(Look);
end;
{--------------------------------------------------------------}


These two procedures are  functionally  almost  identical  to the
ones  I  showed  you in Part VII.  They each  fetch  the  current
token, either an identifier or a number, into  the  global string
Value.    They  also  set  the  encoded  version, Token,  to  the
appropriate code.  The input  stream is left with Look containing
the first character NOT part of the token.

We  can do the same thing  for  operators,  even  multi-character
operators, with a procedure such as:


{--------------------------------------------------------------}
{ Get an Operator }

procedure GetOp;
begin
   Token := Look;
   Value := '';
   repeat
      Value := Value + Look;
      GetChar;
   until IsAlpha(Look) or IsDigit(Look) or IsWhite(Look);
end;
{--------------------------------------------------------------}

Note  that  GetOp  returns,  as  its  encoded  token,  the  FIRST
character of the operator.  This is important,  because  it means
that we can now use that single character to  drive  the  parser,
instead of the lookahead character.

We need to tie these  procedures together into a single procedure
that can handle all three  cases.  The  following  procedure will
read any one of the token types and always leave the input stream
advanced beyond it:


{--------------------------------------------------------------}
{ Get the Next Input Token }

procedure Next;
begin
   SkipWhite;
   if IsAlpha(Look) then GetName
   else if IsDigit(Look) then GetNum
   else GetOp;
end;
{--------------------------------------------------------------}



than after.  This means that, in general, the variable  Look will
NOT have a meaningful value in it, and therefore  we  should  NOT
use it as a test value for parsing, as we have been doing so far.
That's the big departure from our normal approach.

Now, remember that before I was careful not to treat the carriage
return (CR) and line  feed  (LF) characters as white space.  This
was  because,  with  SkipWhite  called  as the last thing in  the
scanner, the encounter with  LF  would  trigger a read statement.
If we were on the last line of the program,  we  couldn't get out
until we input another line with a non-white  character.   That's
why I needed the second procedure, NewLine, to handle the CRLF's.

But now, with the call  to SkipWhite coming first, that's exactly
the behavior we want.    The  compiler  must know there's another
token coming or it wouldn't be calling Next.  In other words,  it
hasn't found the terminating  END  yet.  So we're going to insist
on more data until we find something.

All this means that we can greatly simplify both the  program and
the concepts, by treating CR and LF as whitespace characters, and
eliminating NewLine.  You  can  do  that  simply by modifying the
function IsWhite:


{--------------------------------------------------------------}
{ Recognize White Space }

function IsWhite(c: char): boolean;
begin
   IsWhite := c in [' ', TAB, CR, LF];
end;
{--------------------------------------------------------------}


We've already tried similar routines in Part VII,  but  you might
as well try these new ones out.  Add them to a copy of the Cradle
and call Next with the following main program:


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
   repeat
      Next;
      WriteLn(Token, ' ', Value);
   until Token = '.';
end.
{--------------------------------------------------------------}


Compile  it and verify that you can separate  a  program  into  a
series of tokens, and that you get the right  encoding  for  each
token.

This ALMOST works,  but  not  quite.    There  are  two potential
problems:    First,  in KISS/TINY almost all of our operators are
single-character operators.  The only exceptions  are  the relops
>=, <=, and <>.  It seems  a  shame  to  treat  all  operators as
strings and do a  string  compare,  when  only a single character
compare  will  almost  always  suffice.   Second, and  much  more
important, the  thing  doesn't  WORK  when  two  operators appear
together, as in (a+b)*(c+d).  Here the string following 'b' would
be interpreted as a single operator ")*(."

It's possible to fix that problem.  For example,  we  could  just
give GetOp a  list  of  legal  characters, and we could treat the
parentheses as different operator types  than  the  others.   But
this begins to get messy.

Fortunately, there's a  better  way that solves all the problems.
Since almost  all the operators are single characters, let's just
treat  them  that  way, and let GetOp get only one character at a
time.  This not only simplifies GetOp, but also speeds  things up
quite a  bit.    We  still have the problem of the relops, but we
were treating them as special cases anyway.

So here's the final version of GetOp:


{--------------------------------------------------------------}
{ Get an Operator }

procedure GetOp;
begin
   SkipWhite;
   Token := Look;
   Value := Look;
   GetChar;
end;
{--------------------------------------------------------------}


Note that I still give the string Value a value.  If you're truly
concerned about efficiency, you could leave this out.  When we're
expecting an operator, we will only be testing  Token  anyhow, so
the  value of the string won't matter.  But to me it seems to  be
good practice to give the thing a value just in case.

Try  this  new  version with some realistic-looking  code.    You
should  be  able  to  separate  any program into  its  individual
tokens, with the  caveat  that the two-character relops will scan
into two separate tokens.  That's OK ... we'll  parse  them  that
way.

Now, in Part VII the function of Next was combined with procedure
Scan,  which  also  checked every identifier against  a  list  of
keywords and encoded each one that was found.  As I  mentioned at
the time, the last thing we would want  to  do  is  to use such a
procedure in places where keywords  should not appear, such as in
expressions.  If we  did  that, the keyword list would be scanned
for every identifier appearing in the code.  Not good.

The  right  way  to  deal  with  that  is  to simply separate the
functions  of  fetching  tokens and looking for  keywords.    The
version of Scan shown below  does NOTHING but check for keywords.
Notice that it operates on the current token and does NOT advance
the input stream.


{--------------------------------------------------------------}
{ Scan the Current Identifier for Keywords }

procedure Scan;
begin
   if Token = 'x' then
      Token := KWcode[Lookup(Addr(KWlist), Value, NKW) + 1];
end;
{--------------------------------------------------------------}


There is one last detail.  In the compiler there are a few places
that we must  actually  check  the  string  value  of  the token.
Mainly, this  is done to distinguish between the different END's,
but there are a couple  of  other  places.    (I  should  note in
passing that we could always  eliminate the need for matching END
characters by encoding each one  to a different character.  Right
now we are definitely taking the lazy man's route.)

The  following  version  of MatchString takes the  place  of  the
character-oriented Match.  Note that, like Match, it DOES advance
the input stream.


{--------------------------------------------------------------}
{ Match a Specific Input String }

procedure MatchString(x: string);
begin
   if Value <> x then Expected('''' + x + '''');
   Next;
end;
{--------------------------------------------------------------}


FIXING UP THE COMPILER

Armed with these new scanner procedures, we can now begin  to fix
the compiler to  use  them  properly.   The changes are all quite
minor,  but  there  are quite a  few  places  where  changes  are
necessary.  Rather than  showing  you each place, I will give you
the general idea and then just give the finished product.


First of all, the code for procedure Block doesn't change, though
its function does:


{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure Block;
begin
   Scan;
   while not(Token in ['e', 'l']) do begin
      case Token of
       'i': DoIf;
       'w': DoWhile;
       'R': DoRead;
       'W': DoWrite;
      else Assignment;
      end;
      Scan;
   end;
end;
{--------------------------------------------------------------}


Remember that the new version of Scan doesn't  advance  the input
stream, it only  scans  for  keywords.   The input stream must be
advanced by each procedure that Block calls.

In general, we have to replace every test on Look with  a similar
test on Token.  For example:


{---------------------------------------------------------------}
{ Parse and Translate a Boolean Expression }

procedure BoolExpression;
begin
   BoolTerm;
   while IsOrOp(Token) do begin
      Push;
      case Token of
       '|': BoolOr;
       '~': BoolXor;
      end;
   end;
end;
{--------------------------------------------------------------}


In procedures like Add, we don't  have  to use Match anymore.  We
need only call Next to advance the input stream:


{--------------------------------------------------------------}
{ Recognize and Translate an Add }

procedure Add;
begin
   Next;
   Term;
   PopAdd;
end;
{-------------------------------------------------------------}


Control  structures  are  actually simpler.  We just call Next to
advance over the control keywords:


{---------------------------------------------------------------}
{ Recognize and Translate an IF Construct }

procedure Block; Forward;

procedure DoIf;
var L1, L2: string;
begin
   Next;
   BoolExpression;
   L1 := NewLabel;
   L2 := L1;
   BranchFalse(L1);
   Block;
   if Token = 'l' then begin
      Next;
      L2 := NewLabel;
      Branch(L2);
      PostLabel(L1);
      Block;
   end;
   PostLabel(L2);
   MatchString('ENDIF');
end;
{--------------------------------------------------------------}


That's about the extent of the REQUIRED changes.  In  the listing
of TINY  Version  1.1  below,  I've  also  made a number of other
"improvements" that  aren't really required.  Let me explain them
briefly:

 (1)  I've deleted the two procedures Prog and Main, and combined
      their functions into the main program.  They didn't seem to
      add  to program clarity ... in fact  they  seemed  to  just
      muddy things up a little.

 (2)  I've  deleted  the  keywords  PROGRAM  and  BEGIN  from the
      keyword list.  Each  one  only occurs in one place, so it's
      not necessary to search for it.

 (3)  Having been  bitten  by  an  overdose  of  cleverness, I've
      reminded myself that TINY  is  supposed  to be a minimalist
      program.  Therefore I've  replaced  the  fancy  handling of
      unary minus with the dumbest one I could think of.  A giant
      step backwards in code quality, but a  great simplification
      of the compiler.  KISS is the right place to use  the other
      version.

 (4)  I've added some  error-checking routines such as CheckTable
      and CheckDup, and  replaced  in-line code by calls to them.
      This cleans up a number of routines.

 (5)  I've  taken  the  error  checking  out  of  code generation
      routines  like Store, and put it in  the  parser  where  it
      belongs.  See Assignment, for example.

 (6)  There was an error in InTable and Locate  that  caused them
      to search all locations  instead  of  only those with valid
      data  in them.  They now search only  valid  cells.    This
      allows us to eliminate  the  initialization  of  the symbol
      table, which was done in Init.

 (7)  Procedure AddEntry now has two  arguments,  which  helps to
      make things a bit more modular.

 (8)  I've cleaned up the  code  for  the relational operators by
      the addition of the  new  procedures  CompareExpression and
      NextExpression.

 (9)  I fixed an error in the Read routine ... the  earlier value
      did not check for a valid variable name.


 CONCLUSION

The resulting compiler for  TINY  is given below.  Other than the
removal  of  the  keyword PROGRAM, it parses the same language as
before.    It's  just  a  bit cleaner, and more importantly  it's
considerably more robust.  I feel good about it.

The next installment will be another  digression:  the discussion
of  semicolons  and  such that got me into this mess in the first
place.  THEN we'll press on  into  procedures and types.  Hang in
there with me.  The addition of those features will go a long way
towards removing KISS from  the  "toy  language" category.  We're
getting very close to being able to write a serious compiler.


TINY VERSION 1.1


{--------------------------------------------------------------}
program Tiny11;

{--------------------------------------------------------------}
{ Constant Declarations }

const TAB = ^I;
      CR  = ^M;
      LF  = ^J;

      LCount: integer = 0;
      NEntry: integer = 0;


{--------------------------------------------------------------}
{ Type Declarations }

type Symbol = string[8];

     SymTab = array[1..1000] of Symbol;

     TabPtr = ^SymTab;


{--------------------------------------------------------------}
{ Variable Declarations }

var Look : char;             { Lookahead Character }
    Token: char;             { Encoded Token       }
    Value: string[16];       { Unencoded Token     }


const MaxEntry = 100;

var ST   : array[1..MaxEntry] of Symbol;
    SType: array[1..MaxEntry] of char;


{--------------------------------------------------------------}
{ Definition of Keywords and Token Types }

const NKW =   9;
      NKW1 = 10;

const KWlist: array[1..NKW] of Symbol =
              ('IF', 'ELSE', 'ENDIF', 'WHILE', 'ENDWHILE',
               'READ', 'WRITE', 'VAR', 'END');

const KWcode: string[NKW1] = 'xileweRWve';


{--------------------------------------------------------------}
{ Read New Character From Input Stream }

procedure GetChar;
begin
   Read(Look);
end;

{--------------------------------------------------------------}
{ Report an Error }

procedure Error(s: string);
begin
   WriteLn;
   WriteLn(^G, 'Error: ', s, '.');
end;


{--------------------------------------------------------------}
{ Report Error and Halt }

procedure Abort(s: string);
begin
   Error(s);
   Halt;
end;


{--------------------------------------------------------------}
{ Report What Was Expected }

procedure Expected(s: string);
begin
   Abort(s + ' Expected');
end;

{--------------------------------------------------------------}
{ Report an Undefined Identifier }

procedure Undefined(n: string);
begin
   Abort('Undefined Identifier ' + n);
end;


{--------------------------------------------------------------}
{ Report a Duplicate Identifier }

procedure Duplicate(n: string);
begin
   Abort('Duplicate Identifier ' + n);
end;


{--------------------------------------------------------------}
{ Check to Make Sure the Current Token is an Identifier }

procedure CheckIdent;
begin
   if Token <> 'x' then Expected('Identifier');
end;


{--------------------------------------------------------------}
{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;
begin
   IsAlpha := UpCase(c) in ['A'..'Z'];
end;


{--------------------------------------------------------------}
{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;
begin
   IsDigit := c in ['0'..'9'];
end;


{--------------------------------------------------------------}
{ Recognize an AlphaNumeric Character }

function IsAlNum(c: char): boolean;
begin
   IsAlNum := IsAlpha(c) or IsDigit(c);
end;


{--------------------------------------------------------------}
{ Recognize an Addop }

function IsAddop(c: char): boolean;
begin
   IsAddop := c in ['+', '-'];
end;


{--------------------------------------------------------------}
{ Recognize a Mulop }

function IsMulop(c: char): boolean;
begin
   IsMulop := c in ['*', '/'];
end;


{--------------------------------------------------------------}
{ Recognize a Boolean Orop }

function IsOrop(c: char): boolean;
begin
   IsOrop := c in ['|', '~'];
end;


{--------------------------------------------------------------}
{ Recognize a Relop }

function IsRelop(c: char): boolean;
begin
   IsRelop := c in ['=', '#', '<', '>'];
end;


{--------------------------------------------------------------}
{ Recognize White Space }

function IsWhite(c: char): boolean;
begin
   IsWhite := c in [' ', TAB, CR, LF];
end;


{--------------------------------------------------------------}
{ Skip Over Leading White Space }

procedure SkipWhite;
begin
   while IsWhite(Look) do
      GetChar;
end;


{--------------------------------------------------------------}
{ Table Lookup }

function Lookup(T: TabPtr; s: string; n: integer): integer;
var i: integer;
    found: Boolean;
begin
   found := false;
   i := n;
   while (i > 0) and not found do
      if s = T^[i] then
         found := true
      else
         dec(i);
   Lookup := i;
end;


{--------------------------------------------------------------}
{ Locate a Symbol in Table }
{ Returns the index of the entry.  Zero if not present. }

function Locate(N: Symbol): integer;
begin
   Locate := Lookup(@ST, n, NEntry);
end;


{--------------------------------------------------------------}
{ Look for Symbol in Table }

function InTable(n: Symbol): Boolean;
begin
   InTable := Lookup(@ST, n, NEntry) <> 0;
end;


{--------------------------------------------------------------}
{ Check to See if an Identifier is in the Symbol Table         }
{ Report an error if it's not. }


procedure CheckTable(N: Symbol);
begin
   if not InTable(N) then Undefined(N);
end;


{--------------------------------------------------------------}
{ Check the Symbol Table for a Duplicate Identifier }
{ Report an error if identifier is already in table. }


procedure CheckDup(N: Symbol);
begin
   if InTable(N) then Duplicate(N);
end;


{--------------------------------------------------------------}
{ Add a New Entry to Symbol Table }

procedure AddEntry(N: Symbol; T: char);
begin
   CheckDup(N);
   if NEntry = MaxEntry then Abort('Symbol Table Full');
   Inc(NEntry);
   ST[NEntry] := N;
   SType[NEntry] := T;
end;


{--------------------------------------------------------------}
{ Get an Identifier }

procedure GetName;
begin
   SkipWhite;
   if Not IsAlpha(Look) then Expected('Identifier');
   Token := 'x';
   Value := '';
   repeat
      Value := Value + UpCase(Look);
      GetChar;
   until not IsAlNum(Look);
end;


{--------------------------------------------------------------}
{ Get a Number }

procedure GetNum;
begin
   SkipWhite;
   if not IsDigit(Look) then Expected('Number');
   Token := '#';
   Value := '';
   repeat
      Value := Value + Look;
      GetChar;
   until not IsDigit(Look);
end;


{--------------------------------------------------------------}
{ Get an Operator }

procedure GetOp;
begin
   SkipWhite;
   Token := Look;
   Value := Look;
   GetChar;
end;


{--------------------------------------------------------------}
{ Get the Next Input Token }

procedure Next;
begin
   SkipWhite;
   if IsAlpha(Look) then GetName
   else if IsDigit(Look) then GetNum
   else GetOp;
end;


{--------------------------------------------------------------}
{ Scan the Current Identifier for Keywords }

procedure Scan;
begin
   if Token = 'x' then
      Token := KWcode[Lookup(Addr(KWlist), Value, NKW) + 1];
end;


{--------------------------------------------------------------}
{ Match a Specific Input String }

procedure MatchString(x: string);
begin
   if Value <> x then Expected('''' + x + '''');
   Next;
end;


{--------------------------------------------------------------}
{ Output a String with Tab }

procedure Emit(s: string);
begin
   Write(TAB, s);
end;


{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);
begin
   Emit(s);
   WriteLn;
end;


{--------------------------------------------------------------}
{ Generate a Unique Label }

function NewLabel: string;
var S: string;
begin
   Str(LCount, S);
   NewLabel := 'L' + S;
   Inc(LCount);
end;


{--------------------------------------------------------------}
{ Post a Label To Output }

procedure PostLabel(L: string);
begin
   WriteLn(L, ':');
end;


{---------------------------------------------------------------}
{ Clear the Primary Register }

procedure Clear;
begin
   EmitLn('CLR D0');
end;


{---------------------------------------------------------------}
{ Negate the Primary Register }

procedure Negate;
begin
   EmitLn('NEG D0');
end;


{---------------------------------------------------------------}
{ Complement the Primary Register }

procedure NotIt;
begin
   EmitLn('NOT D0');
end;


{---------------------------------------------------------------}
{ Load a Constant Value to Primary Register }

procedure LoadConst(n: string);
begin
   Emit('MOVE #');
   WriteLn(n, ',D0');
end;


{---------------------------------------------------------------}
{ Load a Variable to Primary Register }

procedure LoadVar(Name: string);
begin
   if not InTable(Name) then Undefined(Name);
   EmitLn('MOVE ' + Name + '(PC),D0');
end;


{---------------------------------------------------------------}
{ Push Primary onto Stack }

procedure Push;
begin
   EmitLn('MOVE D0,-(SP)');
end;


{---------------------------------------------------------------}
{ Add Top of Stack to Primary }

procedure PopAdd;
begin
   EmitLn('ADD (SP)+,D0');
end;


{---------------------------------------------------------------}
{ Subtract Primary from Top of Stack }

procedure PopSub;
begin
   EmitLn('SUB (SP)+,D0');
   EmitLn('NEG D0');
end;


{---------------------------------------------------------------}
{ Multiply Top of Stack by Primary }

procedure PopMul;
begin
   EmitLn('MULS (SP)+,D0');
end;


{---------------------------------------------------------------}
{ Divide Top of Stack by Primary }

procedure PopDiv;
begin
   EmitLn('MOVE (SP)+,D7');
   EmitLn('EXT.L D7');
   EmitLn('DIVS D0,D7');
   EmitLn('MOVE D7,D0');
end;


{---------------------------------------------------------------}
{ AND Top of Stack with Primary }

procedure PopAnd;
begin
   EmitLn('AND (SP)+,D0');
end;


{---------------------------------------------------------------}
{ OR Top of Stack with Primary }

procedure PopOr;
begin
   EmitLn('OR (SP)+,D0');
end;


{---------------------------------------------------------------}
{ XOR Top of Stack with Primary }

procedure PopXor;
begin
   EmitLn('EOR (SP)+,D0');
end;


{---------------------------------------------------------------}
{ Compare Top of Stack with Primary }

procedure PopCompare;
begin
   EmitLn('CMP (SP)+,D0');
end;


{---------------------------------------------------------------}
{ Set D0 If Compare was = }

procedure SetEqual;
begin
   EmitLn('SEQ D0');
   EmitLn('EXT D0');
end;


{---------------------------------------------------------------}
{ Set D0 If Compare was != }

procedure SetNEqual;
begin
   EmitLn('SNE D0');
   EmitLn('EXT D0');
end;


{---------------------------------------------------------------}
{ Set D0 If Compare was > }

procedure SetGreater;
begin
   EmitLn('SLT D0');
   EmitLn('EXT D0');
end;


{---------------------------------------------------------------}
{ Set D0 If Compare was < }

procedure SetLess;
begin
   EmitLn('SGT D0');
   EmitLn('EXT D0');
end;


{---------------------------------------------------------------}
{ Set D0 If Compare was <= }

procedure SetLessOrEqual;
begin
   EmitLn('SGE D0');
   EmitLn('EXT D0');
end;


{---------------------------------------------------------------}
{ Set D0 If Compare was >= }

procedure SetGreaterOrEqual;
begin
   EmitLn('SLE D0');
   EmitLn('EXT D0');
end;


{---------------------------------------------------------------}
{ Store Primary to Variable }

procedure Store(Name: string);
begin
   EmitLn('LEA ' + Name + '(PC),A0');
   EmitLn('MOVE D0,(A0)')
end;


{---------------------------------------------------------------}
{ Branch Unconditional  }

procedure Branch(L: string);
begin
   EmitLn('BRA ' + L);
end;


{---------------------------------------------------------------}
{ Branch False }

procedure BranchFalse(L: string);
begin
   EmitLn('TST D0');
   EmitLn('BEQ ' + L);
end;


{---------------------------------------------------------------}
{ Read Variable to Primary Register }

procedure ReadIt(Name: string);
begin
   EmitLn('BSR READ');
   Store(Name);
end;


{ Write from Primary Register }

procedure WriteIt;
begin
   EmitLn('BSR WRITE');
end;


{--------------------------------------------------------------}
{ Write Header Info }

procedure Header;
begin
   WriteLn('WARMST', TAB, 'EQU $A01E');
end;


{--------------------------------------------------------------}
{ Write the Prolog }

procedure Prolog;
begin
   PostLabel('MAIN');
end;


{--------------------------------------------------------------}
{ Write the Epilog }

procedure Epilog;
begin
   EmitLn('DC WARMST');
   EmitLn('END MAIN');
end;


{---------------------------------------------------------------}
{ Allocate Storage for a Static Variable }

procedure Allocate(Name, Val: string);
begin
   WriteLn(Name, ':', TAB, 'DC ', Val);
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure BoolExpression; Forward;

procedure Factor;
begin
   if Token = '(' then begin
      Next;
      BoolExpression;
      MatchString(')');
      end
   else begin
      if Token = 'x' then
         LoadVar(Value)
      else if Token = '#' then
         LoadConst(Value)
      else Expected('Math Factor');
      Next;
   end;
end;


{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }

procedure Multiply;
begin
   Next;
   Factor;
   PopMul;
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Divide }

procedure Divide;
begin
   Next;
   Factor;
   PopDiv;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Term }

procedure Term;
begin
   Factor;
   while IsMulop(Token) do begin
      Push;
      case Token of
       '*': Multiply;
       '/': Divide;
      end;
   end;
end;


{--------------------------------------------------------------}
{ Recognize and Translate an Add }

procedure Add;
begin
   Next;
   Term;
   PopAdd;
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }

procedure Subtract;
begin
   Next;
   Term;
   PopSub;
end;


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
   if IsAddop(Token) then
      Clear
   else
      Term;
   while IsAddop(Token) do begin
      Push;
      case Token of
       '+': Add;
       '-': Subtract;
      end;
   end;
end;


{---------------------------------------------------------------}
{ Get Another Expression and Compare }

procedure CompareExpression;
begin
   Expression;
   PopCompare;
end;


{---------------------------------------------------------------}
{ Get The Next Expression and Compare }

procedure NextExpression;
begin
   Next;
   CompareExpression;
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Equals" }

procedure Equal;
begin
   NextExpression;
   SetEqual;
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than or Equal" }

procedure LessOrEqual;
begin
   NextExpression;
   SetLessOrEqual;
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Not Equals" }

procedure NotEqual;
begin
   NextExpression;
   SetNEqual;
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than" }

procedure Less;
begin
   Next;
   case Token of
     '=': LessOrEqual;
     '>': NotEqual;
   else begin
           CompareExpression;
           SetLess;
        end;
   end;
end;


{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Greater Than" }

procedure Greater;
begin
   Next;
   if Token = '=' then begin
      NextExpression;
      SetGreaterOrEqual;
      end
   else begin
      CompareExpression;
      SetGreater;
   end;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Relation }


procedure Relation;
begin
   Expression;
   if IsRelop(Token) then begin
      Push;
      case Token of
       '=': Equal;
       '<': Less;
       '>': Greater;
      end;
   end;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Boolean Factor with Leading NOT }

procedure NotFactor;
begin
   if Token = '!' then begin
      Next;
      Relation;
      NotIt;
      end
   else
      Relation;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Boolean Term }

procedure BoolTerm;
begin
   NotFactor;
   while Token = '&' do begin
      Push;
      Next;
      NotFactor;
      PopAnd;
   end;
end;


{--------------------------------------------------------------}
{ Recognize and Translate a Boolean OR }

procedure BoolOr;
begin
   Next;
   BoolTerm;
   PopOr;
end;


{--------------------------------------------------------------}
{ Recognize and Translate an Exclusive Or }

procedure BoolXor;
begin
   Next;
   BoolTerm;
   PopXor;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Boolean Expression }

procedure BoolExpression;
begin
   BoolTerm;
   while IsOrOp(Token) do begin
      Push;
      case Token of
       '|': BoolOr;
       '~': BoolXor;
      end;
   end;
end;


{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
var Name: string;
begin
   CheckTable(Value);
   Name := Value;
   Next;
   MatchString('=');
   BoolExpression;
   Store(Name);
end;


{---------------------------------------------------------------}
{ Recognize and Translate an IF Construct }

procedure Block; Forward;

procedure DoIf;
var L1, L2: string;
begin
   Next;
   BoolExpression;
   L1 := NewLabel;
   L2 := L1;
   BranchFalse(L1);
   Block;
   if Token = 'l' then begin
      Next;
      L2 := NewLabel;
      Branch(L2);
      PostLabel(L1);
      Block;
   end;
   PostLabel(L2);
   MatchString('ENDIF');
end;


{--------------------------------------------------------------}
{ Parse and Translate a WHILE Statement }

procedure DoWhile;
var L1, L2: string;
begin
   Next;
   L1 := NewLabel;
   L2 := NewLabel;
   PostLabel(L1);
   BoolExpression;
   BranchFalse(L2);
   Block;
   MatchString('ENDWHILE');
   Branch(L1);
   PostLabel(L2);
end;


{--------------------------------------------------------------}
{ Read a Single Variable }

procedure ReadVar;
begin
   CheckIdent;
   CheckTable(Value);
   ReadIt(Value);
   Next;
end;


{--------------------------------------------------------------}
{ Process a Read Statement }

procedure DoRead;
begin
   Next;
   MatchString('(');
   ReadVar;
   while Token = ',' do begin
      Next;
      ReadVar;
   end;
   MatchString(')');
end;


{--------------------------------------------------------------}
{ Process a Write Statement }

procedure DoWrite;
begin
   Next;
   MatchString('(');
   Expression;
   WriteIt;
   while Token = ',' do begin
      Next;
      Expression;
      WriteIt;
   end;
   MatchString(')');
end;


{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure Block;
begin
   Scan;
   while not(Token in ['e', 'l']) do begin
      case Token of
       'i': DoIf;
       'w': DoWhile;
       'R': DoRead;
       'W': DoWrite;
      else Assignment;
      end;
      Scan;
   end;
end;


{--------------------------------------------------------------}
{ Allocate Storage for a Variable }

procedure Alloc;
begin
   Next;
   if Token <> 'x' then Expected('Variable Name');
   CheckDup(Value);
   AddEntry(Value, 'v');
   Allocate(Value, '0');
   Next;
end;


{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }

procedure TopDecls;
begin
   Scan;
   while Token = 'v' do
      Alloc;
      while Token = ',' do
         Alloc;
end;


{--------------------------------------------------------------}
{ Initialize }

procedure Init;
begin
   GetChar;
   Next;
end;


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
   MatchString('PROGRAM');
   Header;
   TopDecls;
   MatchString('BEGIN');
   Prolog;
   Block;
   MatchString('END');
   Epilog;
end.
{--------------------------------------------------------------}





























                     LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                           5 June 1989


                       Part XII: MISCELLANY





INTRODUCTION

This installment is another one  of  those  excursions  into side
alleys  that  don't  seem to fit  into  the  mainstream  of  this
tutorial  series.    As I mentioned last time, it was while I was
writing this installment that I realized some changes  had  to be
made  to  the  compiler structure.  So I had to digress from this
digression long enough to develop the new structure  and  show it
to you.

Now that that's behind us, I can tell you what I  set  out  to in
the first place.  This shouldn't  take  long, and then we can get
back into the mainstream.

Several people have asked  me  about  things that other languages
provide, but so far I haven't addressed in this series.   The two
biggies are semicolons and  comments.    Perhaps  you've wondered
about them, too, and  wondered  how things would change if we had
to  deal with them.  Just so you can proceed with what's to come,
without being  bothered by that nagging feeling that something is
missing, we'll address such issues here.


SEMICOLONS

Ever since the introduction of Algol, semicolons have been a part
of  almost every modern language.  We've all  used  them  to  the
point that they are taken for  granted.   Yet I suspect that more
compilation errors have  occurred  due  to  misplaced  or missing
semicolons  than  any  other single cause.  And if we had a penny
for  every  extra  keystroke programmers have used  to  type  the
little rascals, we could pay off the national debt.

Having  been  brought  up with FORTRAN, it took me a long time to
get used to using semicolons, and to tell the  truth  I've  never
quite understood why they  were  necessary.    Since I program in
Pascal, and since the use of semicolons in Pascal is particularly
tricky,  that one little character is still  by  far  my  biggest
source of errors.

When  I  began  developing  KISS,  I resolved to  question  EVERY
construct in other languages, and to try to avoid the most common
problems that occur with them.  That puts the semicolon very high
on my hit list.

To  understand  the  role of the semicolon, you have to look at a
little history.

Early programming languages were line-oriented.  In  FORTRAN, for
example, various parts  of  the statement had specific columns or
fields that they had to appear in.  Since  some  statements  were
too  long for one line, the  "continuation  card"  mechanism  was
provided to let  the  compiler  know  that a given card was still
part of the previous  line.   The mechanism survives to this day,
even though punched cards are now things of the distant past.

When  other  languages  came  along,  they  also  adopted various
mechanisms for dealing with multiple-line statements.  BASIC is a
good  example.  It's important to  recognize,  though,  that  the
FORTRAN  mechanism  was   not   so  much  required  by  the  line
orientation of that  language,  as by the column-orientation.  In
those versions of FORTRAN  where  free-form  input  is permitted,
it's no longer needed.

When the fathers  of  Algol introduced that language, they wanted
to get away  from  line-oriented programs like FORTRAN and BASIC,
and allow for free-form input.   This included the possibility of
stringing multiple statements on a single line, as in


     a=b; c=d; e=e+1;


In cases like this,  the  semicolon is almost REQUIRED.  The same
line, without the semicolons, just looks "funny":


     a=b c= d e=e+1

I suspect that this is the major ... perhaps ONLY ...  reason for
semicolons: to keep programs from looking funny.

But  the  idea  of stringing multiple statements  together  on  a
single  line  is  a  dubious  one  at  best.  It's not very  good
programming  style,  and  harks back to  the  days  when  it  was
considered improtant to conserve cards.  In these  days  of CRT's
and indented code, the clarity of programs is  far  better served
by  keeping statements separate.  It's still  nice  to  have  the
OPTION  of  multiple  statements,  but  it seems a shame to  keep
programmers  in  slavery  to the semicolon, just to keep that one
rare case from "looking funny."

When I started in with KISS, I tried  to  keep  an  open mind.  I
decided that I would use  semicolons when it became necessary for
the parser, but not until then.  I figured this would happen just
about  the time I added the ability  to  spread  statements  over
multiple lines.  But, as you  can  see, that never happened.  The
TINY compiler is perfectly  happy  to  parse the most complicated
statement, spread over any number of lines, without semicolons.

Still, there are people  who  have  used  semicolons for so long,
they feel naked  without them.  I'm one of them.  Once I had KISS
defined sufficiently well, I began to write a few sample programs
in the language.    I  discovered,  somewhat to my horror, that I
kept  putting  semicolons  in anyway.   So  now  I'm  facing  the
prospect of a NEW  rash  of  compiler  errors, caused by UNWANTED
semicolons.  Phooey!

Perhaps more to the point, there are readers out  there  who  are
designing their own languages, which may  include  semicolons, or
who  want to use the techniques of  these  tutorials  to  compile
conventional languages like  C.    In  either case, we need to be
able to deal with semicolons.


SYNTACTIC SUGAR

This whole discussion brings  up  the  issue of "syntactic sugar"
... constructs that are added to a language, not because they are
needed, but because they help make the programs look right to the
programmer.    After  all, it's nice  to  have  a  small,  simple
compiler,    but  it  would  be  of  little  use if the resulting
language  were  cryptic  and hard to program.  The language FORTH
comes  to mind (a premature OUCH! for the  barrage  I  know  that
one's going to fetch me).  If we can add features to the language
that  make the programs easier to read  and  understand,  and  if
those features  help keep the programmer from making errors, then
we should do so.    Particularly if the constructs don't add much
to the complexity of the language or its compiler.

The  semicolon  could  be considered an example,  but  there  are
plenty of others, such as the 'THEN' in a IF-statement,  the 'DO'
in a WHILE-statement,  and  even the 'PROGRAM' statement, which I
came within a gnat's eyelash of leaving out  of  TINY.    None of
these tokens  add  much  to  the  syntax  of the language ... the
compiler can figure out  what's  going on without them.  But some
folks feel that they  DO  add to the readability of programs, and
that can be very important.

There are two schools of thought on this subject, which  are well
represented by two of our most popular languages, C and Pascal.

To  the minimalists, all such sugar should be  left  out.    They
argue that it clutters up the language and adds to the  number of
keystrokes  programmers  must type.   Perhaps  more  importantly,
every extra token or keyword represents a trap laying in wait for
the inattentive programmer.  If you leave out  a  token, misplace
it, or misspell it, the compiler  will  get you.  So these people
argue that the best approach is to get rid of such things.  These
folks tend to like C, which has a minimum of unnecessary keywords
and punctuation.

Those from the other school tend to like Pascal.  They argue that
having to type a few extra characters is a small price to pay for
legibility.    After  all, humans have to read the programs, too.
Their best argument is that each such construct is an opportunity
to tell the compiler that you really mean for it  to  do what you
said to.  The sugary tokens serve as useful landmarks to help you
find your way.

The differences are well represented by the two  languages.   The
most oft-heard complaint about  C  is  that  it is too forgiving.
When you make a mistake in C, the  erroneous  code  is  too often
another  legal  C  construct.    So  the  compiler  just  happily
continues to compile, and  leaves  you  to  find the error during
debug.    I guess that's why debuggers  are  so  popular  with  C
programmers.

On the  other  hand,  if  a  Pascal  program compiles, you can be
pretty  sure that the program will do what you told it.  If there
is an error at run time, it's probably a design error.

The  best  example  of  useful  sugar  is  the semicolon  itself.
Consider the code fragment:


     a=1+(2*b+c)   b...


Since there is no operator connecting the token 'b' with the rest
of the  statement, the compiler will conclude that the expression
ends  with  the  ')', and the 'b'  is  the  beginning  of  a  new
statement.    But  suppose  I  have simply left out the  intended
operator, and I really want to say:


     a=1+(2*b+c)*b...


In  this  case  the compiler will get an error, all right, but it
won't be very meaningful  since  it will be expecting an '=' sign
after the 'b' that really shouldn't be there.

If, on the other hand, I include a semicolon after the  'b', THEN
there  can  be no doubt where I  intend  the  statement  to  end.
Syntactic  sugar,  then,  can  serve  a  very  useful purpose  by
providing some additional insurance that we remain on track.

I find  myself  somewhere  in  the middle of all this.  I tend to
favor the Pascal-ers' view ... I'd much rather find  my  bugs  at
compile time rather than run time.  But I also hate to just throw
verbosity  in  for  no apparent reason, as in COBOL.  So far I've
consistently left most of the Pascal sugar out of KISS/TINY.  But
I certainly have no strong feelings either way, and  I  also  can
see the value of sprinkling a little sugar around  just  for  the
extra  insurance  that  it  brings.    If  you like  this  latter
approach, things like that are easy to add.  Just  remember that,
like  the semicolon, each item of sugar  is  something  that  can
potentially cause a compile error by its omission.


DEALING WITH SEMICOLONS

There  are  two  distinct  ways  in which semicolons are used  in
popular  languages.    In Pascal, the semicolon is regarded as an
statement SEPARATOR.  No semicolon  is  required  after  the last
statement in a block.  The syntax is:


     <block> ::= <statement> ( ';' <statement>)*

     <statement> ::= <assignment> | <if> | <while> ... | null


(The null statement is IMPORTANT!)

Pascal  also defines some semicolons in  other  places,  such  as
after the PROGRAM statement.

In  C  and  Ada, on the other hand, the semicolon is considered a
statement TERMINATOR,  and  follows  all  statements  (with  some
embarrassing and confusing  exceptions).   The syntax for this is
simply:


     <block> ::= ( <statement> ';')*


Of  the two syntaxes, the Pascal one seems on the face of it more
rational, but experience has shown  that it leads to some strange
difficulties.  People get  so  used  to  typing a semicolon after
every  statement  that  they tend to  type  one  after  the  last
statement in a block, also.  That usually doesn't cause  any harm
...  it  just gets treated as a  null  statement.    Many  Pascal
programmers, including yours truly,  do  just  that. But there is
one  place you absolutely CANNOT type  a  semicolon,  and  that's
right before an ELSE.  This little gotcha  has  cost  me  many an
extra  compilation,  particularly  when  the  ELSE  is  added  to
existing code.    So  the  C/Ada  choice  turns out to be better.
Apparently Nicklaus Wirth thinks so, too:  In his  Modula  2,  he
abandoned the Pascal approach.

Given either of these two syntaxes, it's an easy matter (now that
we've  reorganized  the  parser!) to add these  features  to  our
parser.  Let's take the last case first, since it's simpler.

To begin, I've made things easy by introducing a new recognizer:


{--------------------------------------------------------------}
{ Match a Semicolon }

procedure Semi;
begin
   MatchString(';');
end;
{--------------------------------------------------------------}


This procedure works very much like our old Match.  It insists on
finding a semicolon as the next token.  Having found it, it skips
to the next one.

Since a  semicolon follows a statement, procedure Block is almost
the only one we need to change:


{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure Block;
begin
   Scan;
   while not(Token in ['e', 'l']) do begin
      case Token of
       'i': DoIf;
       'w': DoWhile;
       'R': DoRead;
       'W': DoWrite;
       'x': Assignment;
      end;
      Semi;
      Scan;
   end;
end;
{--------------------------------------------------------------}


Note carefully the subtle change in the case statement.  The call
to  Assignment  is now guarded by a test on Token.   This  is  to
avoid calling Assignment when the  token  is  a  semicolon (which
could happen if the statement is null).

Since declarations are also  statements,  we  also  need to add a
call to Semi within procedure TopDecls:


{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }

procedure TopDecls;
begin
   Scan;
   while Token = 'v' do begin
      Alloc;
      while Token = ',' do
         Alloc;
      Semi;
   end;
end;
{--------------------------------------------------------------}


Finally, we need one for the PROGRAM statement:


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
   MatchString('PROGRAM');
   Semi;
   Header;
   TopDecls;
   MatchString('BEGIN');
   Prolog;
   Block;
   MatchString('END');
   Epilog;
end.
{--------------------------------------------------------------}


It's as easy as that.  Try it with a copy of TINY and see how you
like it.

The Pascal version  is  a  little  trickier,  but  it  still only
requires  minor  changes,  and those only to procedure Block.  To
keep things as simple as possible, let's split the procedure into
two parts.  The following procedure handles just one statement:


{--------------------------------------------------------------}
{ Parse and Translate a Single Statement }

procedure Statement;
begin
   Scan;
   case Token of
    'i': DoIf;
    'w': DoWhile;
    'R': DoRead;
    'W': DoWrite;
    'x': Assignment;
   end;
end;
{--------------------------------------------------------------}


Using this procedure, we can now rewrite Block like this:


{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure Block;
begin
   Statement;
   while Token = ';' do begin
      Next;
      Statement;
   end;
end;
{--------------------------------------------------------------}


That  sure  didn't  hurt, did it?  We can now parse semicolons in
Pascal-like fashion.


A COMPROMISE

Now that we know how to deal with semicolons, does that mean that
I'm going to put them in KISS/TINY?  Well, yes and  no.    I like
the extra sugar and the security that comes with knowing for sure
where the  ends  of  statements  are.    But I haven't changed my
dislike for the compilation errors associated with semicolons.

So I have what I think is a nice compromise: Make them OPTIONAL!

Consider the following version of Semi:


{--------------------------------------------------------------}
{ Match a Semicolon }

procedure Semi;
begin
   if Token = ';' then Next;
end;
{--------------------------------------------------------------}


This procedure will ACCEPT a semicolon whenever it is called, but
it won't INSIST on one.  That means that when  you  choose to use
semicolons, the compiler  will  use the extra information to help
keep itself on track.  But if you omit one (or omit them all) the
compiler won't complain.  The best of both worlds.

Put this procedure in place in the first version of  your program
(the  one for C/Ada syntax), and you have  the  makings  of  TINY
Version 1.2.


COMMENTS

Up  until  now  I have carefully avoided the subject of comments.
You would think that this would be an easy subject ... after all,
the compiler doesn't have to deal with comments at all; it should
just ignore them.  Well, sometimes that's true.

Comments can be just about as easy or as difficult as  you choose
to make them.    At  one  extreme,  we can arrange things so that
comments  are  intercepted  almost  the  instant  they  enter the
compiler.  At the  other,  we can treat them as lexical elements.
Things  tend to get interesting when  you  consider  things  like
comment delimiters contained in quoted strings.


SINGLE-CHARACTER DELIMITERS

Here's an example.  Suppose we assume the  Turbo  Pascal standard
and use curly braces for comments.  In this case we  have single-
character delimiters, so our parsing is a little easier.

One  approach  is  to  strip  the  comments  out the  instant  we
encounter them in the input stream; that is,  right  in procedure
GetChar.    To  do  this,  first  change  the  name of GetChar to
something else, say GetCharX.  (For the record, this is  going to
be a TEMPORARY change, so best not do this with your only copy of
TINY.  I assume you understand that you should  always  do  these
experiments with a working copy.)

Now, we're going to need a  procedure  to skip over comments.  So
key in the following one:


{--------------------------------------------------------------}
{ Skip A Comment Field }

procedure SkipComment;
begin
   while Look <> '}' do
      GetCharX;
   GetCharX;
end;
{--------------------------------------------------------------}


Clearly, what this procedure is going to do is to simply read and
discard characters from the input  stream, until it finds a right
curly brace.  Then it reads one more character and returns  it in
Look.

Now we can  write  a  new  version of GetChar that SkipComment to
strip out comments:


{--------------------------------------------------------------}
{ Get Character from Input Stream }
{ Skip Any Comments }

procedure GetChar;
begin
   GetCharX;
   if Look = '{' then SkipComment;
end;
{--------------------------------------------------------------}


Code this up  and  give  it  a  try.    You'll find that you can,
indeed, bury comments anywhere you like.  The comments never even
get into the parser proper ... every call to GetChar just returns
any character that's NOT part of a comment.

As a matter of fact, while  this  approach gets the job done, and
may even be  perfectly  satisfactory  for  you, it does its job a
little  TOO  well.    First  of all, most  programming  languages
specify that a comment should be treated like a  space,  so  that
comments aren't allowed  to  be embedded in, say, variable names.
This current version doesn't care WHERE you put comments.

Second, since the  rest  of  the  parser can't even receive a '{'
character, you will not be allowed to put one in a quoted string.

Before you turn up your nose at this simplistic solution, though,
I should point out  that  as respected a compiler as Turbo Pascal
also won't allow  a  '{' in a quoted string.  Try it.  And as for
embedding a comment in an  identifier, I can't imagine why anyone
would want to do such a  thing,  anyway, so the question is moot.
For 99% of all  applications,  what I've just shown you will work
just fine.

But,  if  you  want  to  be  picky  about it  and  stick  to  the
conventional treatment, then we  need  to  move  the interception
point downstream a little further.

To  do  this,  first change GetChar back to the way  it  was  and
change the name called in SkipComment.  Then, let's add  the left
brace as a possible whitespace character:


{--------------------------------------------------------------}
{ Recognize White Space }

function IsWhite(c: char): boolean;
begin
   IsWhite := c in [' ', TAB, CR, LF, '{'];
end;
{--------------------------------------------------------------}


Now, we can deal with comments in procedure SkipWhite:


{--------------------------------------------------------------}
{ Skip Over Leading White Space }

procedure SkipWhite;
begin
   while IsWhite(Look) do begin
      if Look = '{' then
         SkipComment
      else
         GetChar;
   end;
end;
{--------------------------------------------------------------}


Note  that SkipWhite is written so that we  will  skip  over  any
combination of whitespace characters and comments, in one call.

OK, give this one a try, too.   You'll  find  that  it will let a
comment serve to delimit tokens.  It's worth mentioning that this
approach also gives us the  ability to handle curly braces within
quoted strings, since within such  strings we will not be testing
for or skipping over whitespace.

There's one last  item  to  deal  with:  Nested  comments.   Some
programmers like the idea  of  nesting  comments, since it allows
you to comment out code during debugging.  The  code  I've  given
here won't allow that and, again, neither will Turbo Pascal.

But the fix is incredibly easy.  All  we  need  to  do is to make
SkipComment recursive:


{--------------------------------------------------------------}
{ Skip A Comment Field }

procedure SkipComment;
begin
   while Look <> '}' do begin
      GetChar;
      if Look = '{' then SkipComment;
   end;
   GetChar;
end;
{--------------------------------------------------------------}


That does it.  As  sophisticated a comment-handler as you'll ever
need.


MULTI-CHARACTER DELIMITERS

That's all well and  good  for cases where a comment is delimited
by single  characters,  but  what  about  the  cases such as C or
standard Pascal, where two  characters  are  required?  Well, the
principles are still the same, but we have to change our approach
quite a bit.  I'm sure it won't surprise you to learn that things
get harder in this case.

For the multi-character situation, the  easiest thing to do is to
intercept the left delimiter  back  at the GetChar stage.  We can
"tokenize" it right there, replacing it by a single character.

Let's assume we're using the C delimiters '/*' and '*/'.   First,
we  need  to  go back to the "GetCharX' approach.  In yet another
copy of your compiler, rename  GetChar to GetCharX and then enter
the following new procedure GetChar:


{--------------------------------------------------------------}
{ Read New Character.  Intercept '/*' }

procedure GetChar;
begin
   if TempChar <> ' ' then begin
      Look := TempChar;
      TempChar := ' ';
      end
   else begin
      GetCharX;
      if Look = '/' then begin
         Read(TempChar);
         if TempChar = '*' then begin
            Look := '{';
            TempChar := ' ';
         end;
      end;
   end;
end;
{--------------------------------------------------------------}


As you can see, what this procedure does is  to  intercept  every
occurrence of '/'.  It then examines the NEXT  character  in  the
stream.  If the character  is  a  '*',  then  we  have  found the
beginning  of  a  comment,  and  GetChar  will  return  a  single
character replacement for it.   (For  simplicity,  I'm  using the
same '{' character  as I did for Pascal.  If you were writing a C
compiler, you'd no doubt want to pick some other character that's
not  used  elsewhere  in C.  Pick anything you like ... even $FF,
anything that's unique.)

If the character  following  the  '/'  is NOT a '*', then GetChar
tucks it away in the new global TempChar, and  returns  the  '/'.

Note that you need to declare this new variable and initialize it
to ' '.  I like to do  things  like  that  using the Turbo "typed
constant" construct:


     const TempChar: char = ' ';


Now we need a new version of SkipComment:


{--------------------------------------------------------------}
{ Skip A Comment Field }

procedure SkipComment;
begin
   repeat
      repeat
         GetCharX;
      until Look = '*';
      GetCharX;
   until Look = '/';
   GetChar;
end;
{--------------------------------------------------------------}


A  few  things  to  note:  first  of  all, function  IsWhite  and
procedure SkipWhite  don't  need  to  be  changed,  since GetChar
returns the '{' token.  If you change that token  character, then
of  course you also need to change the  character  in  those  two
routines.

Second, note that  SkipComment  doesn't call GetChar in its loop,
but  GetCharX.    That  means   that  the  trailing  '/'  is  not
intercepted and  is seen by SkipComment.  Third, although GetChar
is the  procedure  doing  the  work,  we  can still deal with the
comment  characters  embedded  in  a  quoted  string,  by calling
GetCharX  instead  of  GetChar  while  we're  within  the string.
Finally,  note  that  we can again provide for nested comments by
adding a single statement to SkipComment, just as we did before.


ONE-SIDED COMMENTS

So far I've shown you  how  to  deal  with  any  kind  of comment
delimited on the left and the  right.   That only leaves the one-
sided comments like those in assembler language or  in  Ada, that
are terminated by the end of the line.  In a  way,  that  case is
easier.   The only procedure that would need  to  be  changed  is
SkipComment, which must now terminate at the newline characters:


{--------------------------------------------------------------}
{ Skip A Comment Field }

procedure SkipComment;
begin
   repeat
      GetCharX;
   until Look = CR;
   GetChar;
end;
{--------------------------------------------------------------}


If the leading character is  a  single  one,  as  in  the  ';' of
assembly language, then we're essentially done.  If  it's  a two-
character token, as in the '--'  of  Ada, we need only modify the
tests  within  GetChar.   Either way, it's an easier problem than
the balanced case.


CONCLUSION

At this point we now have the ability to deal with  both comments
and semicolons, as well as other kinds of syntactic sugar.   I've
shown  you several ways to deal with  each,  depending  upon  the
convention  desired.    The  only  issue left is: which of  these
conventions should we use in KISS/TINY?

For the reasons that I've given as we went  along,  I'm  choosing
the following:


 (1) Semicolons are TERMINATORS, not separators

 (2) Semicolons are OPTIONAL

 (3) Comments are delimited by curly braces

 (4) Comments MAY be nested


Put the code corresponding to these cases into your copy of TINY.
You now have TINY Version 1.2.

Now that we  have  disposed  of  these  sideline  issues,  we can
finally get back into the mainstream.  In  the  next installment,
we'll talk  about procedures and parameter passing, and we'll add
these important features to TINY.  See you then.

































                     LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                          27 August 1989


                      Part XIII: PROCEDURES





INTRODUCTION

At last we get to the good part!

At  this point we've studied almost all  the  basic  features  of
compilers  and  parsing.    We  have  learned  how  to  translate
arithmetic expressions, Boolean expressions, control  constructs,
data  declarations,  and  I/O  statements.    We  have defined  a
language, TINY 1.3, that embodies all these features, and we have
written  a  rudimentary  compiler that can translate  them.    By
adding some file I/O we could indeed have a working compiler that
could produce executable object files  from  programs  written in
TINY.  With such a compiler, we could write simple  programs that
could read integer data, perform calculations with it, and output
the results.

That's nice, but what we have is still only a  toy  language.  We
can't read or write even a single character of text, and we still
don't have procedures.

It's  the  features  to  be  discussed  in  the  next  couple  of
installments  that  separate  the men from the toys, so to speak.
"Real" languages have more than one data type,  and  they support
procedure calls.  More than any others, it's  these  two features
that give a language much of its character and personality.  Once
we  have  provided   for   them,  our  languages,  TINY  and  its
successors, will cease  to  become  toys  and  will  take  on the
character  of  real  languages,  suitable for serious programming
jobs.

For several installments now, I've been promising you sessions on
these  two  important  subjects.  Each time, other issues came up
that required me to  digress  and deal with them.  Finally, we've
been able to put all those issues to rest and can get on with the
mainstream  of  things.    In   this   installment,   I'll  cover
procedures.  Next time, we'll talk about the basic data types.


ONE LAST DIGRESSION

This has  been an extraordinarily difficult installment for me to
write.  The reason has nothing to do with the subject  itself ...
I've  known  what I wanted to say for some time, and  in  fact  I
presented  most  of  this at Software Development  '89,  back  in
February.  It has more to do with the approach.  Let me explain.

When I first  began  this  series,  I  told you that we would use
several "tricks" to  make  things  easy,  and to let us learn the
concepts without getting too bogged down in the  details.   Among
these tricks was the idea of looking at individual  pieces  of  a
compiler at  a time, i.e. performing experiments using the Cradle
as a base.  When we studied expressions, for  example,  we  dealt
with only that part of compiler theory.  When we  studied control
structures,  we wrote a different program,  still  based  on  the
Cradle, to do that part. We only incorporated these concepts into
a complete language fairly recently. These techniques have served
us very well indeed, and led us to the development of  a compiler
for TINY version 1.3.

When  I  first  began this session, I tried to build upon what we
had already done, and  just  add the new features to the existing
compiler.  That turned out to be a little awkward and  tricky ...
much too much to suit me.

I finally figured out why.  In this series of experiments,  I had
abandoned the very useful techniques that had allowed  us  to get
here, and  without  meaning  to  I  had  switched over into a new
method of  working, that involved incremental changes to the full
TINY compiler.

You  need  to  understand that what we are doing here is a little
unique.  There have been a number of articles, such as  the Small
C articles by Cain and Hendrix, that presented finished compilers
for one language or another.  This is different.  In  this series
of tutorials, you are  watching  me  design  and implement both a
language and a compiler, in real time.

In the experiments that I've been doing in  preparation  for this
article,  I  was  trying to inject  the  changes  into  the  TINY
compiler  in such a way that, at every step, we still had a real,
working  compiler.     In   other  words,  I  was  attempting  an
incremental enhancement of the language and  its  compiler, while
at the same time explaining to you what I was doing.

That's a tough act to pull off!  I finally  realized  that it was
dumb to try.    Having  gotten  this  far using the idea of small
experiments   based   on   single-character  tokens  and  simple,
special-purpose  programs,  I  had  abandoned  them  in  favor of
working with the full compiler.  It wasn't working.

So we're going to go back to our  roots,  so  to  speak.  In this
installment and the next, I'll be  using  single-character tokens
again as we study the concepts of procedures,  unfettered  by the
other baggage  that we have accumulated in the previous sessions.
As a  matter  of  fact,  I won't even attempt, at the end of this
session, to merge the constructs into the TINY  compiler.   We'll
save that for later.

After all this time, you don't need more buildup  than  that,  so
let's waste no more time and dive right in.


THE BASICS

All modern  CPU's provide direct support for procedure calls, and
the  68000  is no exception.  For the 68000, the call  is  a  BSR
(PC-relative version) or JSR, and the return is RTS.  All we have
to do is to arrange for  the  compiler to issue these commands at
the proper place.

Actually, there are really THREE things we have to address.   One
of  them  is  the  call/return  mechanism.    The second  is  the
mechanism  for  DEFINING  the procedure in the first place.  And,
finally, there is the issue of passing parameters  to  the called
procedure.  None of these things are really  very  difficult, and
we can of course borrow heavily on what people have done in other
languages ... there's no need to reinvent the wheel here.  Of the
three issues, that of parameter passing will occupy  most  of our
attention, simply because there are so many options available.


A BASIS FOR EXPERIMENTS

As always, we will need some software to  serve  as  a  basis for
what  we are doing.  We don't need the full TINY compiler, but we
do need enough of a program so that some of the  other constructs
are present.  Specifically, we need at least to be able to handle
statements of some sort, and data declarations.

The program shown below is that basis.  It's a vestigial  form of
TINY, with single-character tokens.   It  has  data declarations,
but only in their simplest form ... no lists or initializers.  It
has assignment statements, but only of the kind

     <ident> = <ident>

In  other  words,  the only legal expression is a single variable
name.    There  are no control  constructs  ...  the  only  legal
statement is the assignment.

Most of the program  is  just the standard Cradle routines.  I've
shown the whole thing here, just to make sure we're  all starting
from the same point:


{--------------------------------------------------------------}
program Calls;

{--------------------------------------------------------------}
{ Constant Declarations }

const TAB = ^I;
      CR  = ^M;
      LF  = ^J;

{--------------------------------------------------------------}
{ Variable Declarations }

var Look: char;              { Lookahead Character }

var ST: Array['A'..'Z'] of char;


{--------------------------------------------------------------}
{ Read New Character From Input Stream }

procedure GetChar;
begin
   Read(Look);
end;

{--------------------------------------------------------------}
{ Report an Error }

procedure Error(s: string);
begin
   WriteLn;
   WriteLn(^G, 'Error: ', s, '.');
end;


{--------------------------------------------------------------}
{ Report Error and Halt }

procedure Abort(s: string);
begin
   Error(s);
   Halt;
end;


{--------------------------------------------------------------}
{ Report What Was Expected }

procedure Expected(s: string);
begin
   Abort(s + ' Expected');
end;


{--------------------------------------------------------------}
{ Report an Undefined Identifier }

procedure Undefined(n: string);
begin
   Abort('Undefined Identifier ' + n);
end;


{--------------------------------------------------------------}
{ Report an Duplicate Identifier }

procedure Duplicate(n: string);
begin
     Abort('Duplicate Identifier ' + n);
end;


{--------------------------------------------------------------}
{ Get Type of Symbol }

function TypeOf(n: char): char;
begin
     TypeOf := ST[n];
end;


{--------------------------------------------------------------}
{ Look for Symbol in Table }

function InTable(n: char): Boolean;
begin
   InTable := ST[n] <> ' ';
end;


{--------------------------------------------------------------}
{ Add a New Symbol to Table }

procedure AddEntry(Name, T: char);
begin
     if Intable(Name) then Duplicate(Name);
     ST[Name] := T;
end;


{--------------------------------------------------------------}
{ Check an Entry to Make Sure It's a Variable }

procedure CheckVar(Name: char);
begin
     if not InTable(Name) then Undefined(Name);
     if  TypeOf(Name)  <>  'v'  then    Abort(Name  +  ' is not a
variable');
end;


{--------------------------------------------------------------}
{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;
begin
   IsAlpha := upcase(c) in ['A'..'Z'];
end;


{--------------------------------------------------------------}
{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;
begin
   IsDigit := c in ['0'..'9'];
end;


{--------------------------------------------------------------}
{ Recognize an AlphaNumeric Character }

function IsAlNum(c: char): boolean;
begin
   IsAlNum := IsAlpha(c) or IsDigit(c);
end;


{--------------------------------------------------------------}
{ Recognize an Addop }

function IsAddop(c: char): boolean;
begin
   IsAddop := c in ['+', '-'];
end;


{--------------------------------------------------------------}
{ Recognize a Mulop }

function IsMulop(c: char): boolean;
begin
   IsMulop := c in ['*', '/'];
end;


{--------------------------------------------------------------}
{ Recognize a Boolean Orop }

function IsOrop(c: char): boolean;
begin
   IsOrop := c in ['|', '~'];
end;


{--------------------------------------------------------------}
{ Recognize a Relop }

function IsRelop(c: char): boolean;
begin
   IsRelop := c in ['=', '#', '<', '>'];
end;


{--------------------------------------------------------------}
{ Recognize White Space }

function IsWhite(c: char): boolean;
begin
   IsWhite := c in [' ', TAB];
end;


{--------------------------------------------------------------}
{ Skip Over Leading White Space }

procedure SkipWhite;
begin
   while IsWhite(Look) do
      GetChar;
end;


{--------------------------------------------------------------}
{ Skip Over an End-of-Line }

procedure Fin;
begin
   if Look = CR then begin
      GetChar;
      if Look = LF then
         GetChar;
   end;
end;


{--------------------------------------------------------------}
{ Match a Specific Input Character }

procedure Match(x: char);
begin
   if Look = x then GetChar
     else Expected('''' + x + '''');
     SkipWhite;
end;


{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: char;
begin
   if not IsAlpha(Look) then Expected('Name');
   GetName := UpCase(Look);
     GetChar;
     SkipWhite;
end;


{--------------------------------------------------------------}
{ Get a Number }

function GetNum: char;
begin
   if not IsDigit(Look) then Expected('Integer');
   GetNum := Look;
     GetChar;
     SkipWhite;
end;


{--------------------------------------------------------------}
{ Output a String with Tab }

procedure Emit(s: string);
begin
   Write(TAB, s);
end;


{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);
begin
   Emit(s);
   WriteLn;
end;


{--------------------------------------------------------------}
{ Post a Label To Output }

procedure PostLabel(L: string);
begin
   WriteLn(L, ':');
end;


{--------------------------------------------------------------}
{ Load a Variable to the Primary Register }

procedure LoadVar(Name: char);
begin
     CheckVar(Name);
     EmitLn('MOVE ' + Name + '(PC),D0');
end;


{--------------------------------------------------------------}
{ Store the Primary Register }

procedure StoreVar(Name: char);
begin
     CheckVar(Name);
     EmitLn('LEA ' + Name + '(PC),A0');
   EmitLn('MOVE D0,(A0)')
end;


{--------------------------------------------------------------}
{ Initialize }

procedure Init;
var i: char;
begin
     GetChar;
     SkipWhite;
     for i := 'A' to 'Z' do
          ST[i] := ' ';
end;


{--------------------------------------------------------------}
{ Parse and Translate an Expression }
{ Vestigial Version }

procedure Expression;
begin
     LoadVar(GetName);
end;


{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
var Name: char;
begin
     Name := GetName;
     Match('=');
     Expression;
     StoreVar(Name);
end;


{--------------------------------------------------------------}

                             






{ Parse and Translate a Block of Statements }

procedure DoBlock;
begin
     while not(Look in ['e']) do begin
          Assignment;
          Fin;
   end;
end;


{--------------------------------------------------------------}
{ Parse and Translate a Begin-Block }

procedure BeginBlock;
begin
     Match('b');
     Fin;
     DoBlock;
     Match('e');
     Fin;
end;


{--------------------------------------------------------------}
{ Allocate Storage for a Variable }

procedure Alloc(N: char);
begin
     if InTable(N) then Duplicate(N);
   ST[N] := 'v';
     WriteLn(N, ':', TAB, 'DC 0');
end;


{--------------------------------------------------------------}
{ Parse and Translate a Data Declaration }

procedure Decl;
var Name: char;
begin
   Match('v');
     Alloc(GetName);
end;


{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }

procedure TopDecls;
begin
     while Look <> 'b' do begin
      case Look of
        'v': Decl;
      else Abort('Unrecognized Keyword ' + Look);
          end;
          Fin;
     end;
end;


{--------------------------------------------------------------}
{ Main Program }

begin
     Init;
     TopDecls;
     BeginBlock;
end.
{--------------------------------------------------------------}


Note  that we DO have a symbol table, and there is logic to check
a variable name to make sure it's a legal one.    It's also worth
noting that I  have  included  the  code  you've  seen  before to
provide for white space  and  newlines.    Finally, note that the
main program is delimited, as usual, by BEGIN-END brackets.

Once you've copied  the  program  to  Turbo, the first step is to
compile it and make sure it  works.   Give it a few declarations,
and then a begin-block.  Try something like:


     va             (for VAR A)
     vb             (for VAR B)
     vc             (for VAR C)
     b              (for BEGIN)
     a=b
     b=c
     e.             (for END.)


As usual, you should also make some deliberate errors, and verify
that the program catches them correctly.


DECLARING A PROCEDURE

If you're satisfied that our little program works, then it's time
to  deal  with  the  procedures.  Since we haven't  talked  about
                             






parameters yet, we'll begin by considering  only  procedures that
have no parameter lists.

As a start, let's consider a simple program with a procedure, and
think about the code we'd like to see generated for it:


     PROGRAM FOO;
     .
     .
     PROCEDURE BAR;                     BAR:
     BEGIN                                   .
     .                                       .
     .                                       .
     END;                                    RTS

     BEGIN { MAIN PROGRAM }             MAIN:
     .                                       .
     .                                       .
     FOO;                                    BSR BAR
     .                                       .
     .                                       .
     END.                                    END MAIN


Here I've shown  the  high-order language constructs on the left,
and the desired assembler code on the right.  The first  thing to
notice  is that we certainly don't have  much  code  to  generate
here!  For  the  great  bulk  of  both the procedure and the main
program,  our existing constructs take care of  the  code  to  be
generated.

The key to dealing with the body of the procedure is to recognize
that  although a procedure may be quite  long,  declaring  it  is
really no different than  declaring  a  variable.   It's just one
more kind of declaration.  We can write the BNF:


     <declaration> ::= <data decl> | <procedure>


This means that it should be easy to modify TopDecl to  deal with
procedures.  What about the syntax of a procedure?   Well, here's
a suggested syntax, which is essentially that of Pascal:


     <procedure> ::= PROCEDURE <ident> <begin-block>


There is practically no code generation required, other than that
generated within the begin-block.    We need only emit a label at
the beginning of the procedure, and an RTS at the end.

Here's the required code:

{--------------------------------------------------------------}
{ Parse and Translate a Procedure Declaration }

procedure DoProc;
var N: char;
begin
     Match('p');
     N := GetName;
     Fin;
     if InTable(N) then Duplicate(N);
     ST[N] := 'p';
     PostLabel(N);
     BeginBlock;
     Return;
end;
{--------------------------------------------------------------}


Note that I've added a new code generation routine, Return, which
merely emits an RTS instruction.  The creation of that routine is
"left as an exercise for the student."

To  finish  this  version, add the following line within the Case
statement in DoBlock:


            'p': DoProc;


I should mention that  this  structure  for declarations, and the
BNF that drives it, differs from standard Pascal.  In  the Jensen
& Wirth  definition of Pascal, variable declarations, in fact ALL
kinds of declarations,  must  appear in a specific sequence, i.e.
labels,   constants,  types,  variables,  procedures,  and   main
program.  To  follow  such  a  scheme, we should separate the two
declarations, and have code in the main program something like


     DoVars;
     DoProcs;
     DoMain;


However,  most implementations of Pascal, including Turbo,  don't
require  that  order  and  let  you  freely  mix up  the  various
declarations,  as  long  as  you  still  don't  try to  refer  to
something  before  it's  declared.    Although  it  may  be  more
aesthetically pleasing to declare all the global variables at the
top of the  program,  it  certainly  doesn't do any HARM to allow
them to be sprinkled around.   In  fact,  it may do some GOOD, in
the  sense  that it gives you the  opportunity  to  do  a  little
rudimentary  information  hiding.     Variables  that  should  be
accessed only by the main program, for example,  can  be declared
just before it and will thus be inaccessible by the procedures.

OK, try this new version out.  Note that we  can  declare as many
procedures as we choose (as long  as  we don't run out of single-
character names!), and the  labels  and RTS's all come out in the
right places.

It's  worth  noting  here  that  I  do  _NOT_  allow  for  nested
procedures.   In TINY, all procedures must  be  declared  at  the
global level,  the  same  as  in  C.    There  has  been  quite a
discussion about this point in  the  Computer  Language  Forum of
CompuServe.  It turns out that there is a significant  penalty in
complexity that must be paid for the luxury of nested procedures.
What's  more,  this  penalty gets paid at RUN TIME, because extra
code must be added and executed every time a procedure is called.
I also happen to believe that nesting is not a good  idea, simply
on the grounds that I have seen too many abuses of the feature.
Before going on to the next step, it's also worth noting that the
"main program" as it stands  is incomplete, since it doesn't have
the label and END statement.  Let's fix that little oversight:


{--------------------------------------------------------------}
{ Parse and Translate a Main Program }

procedure DoMain;
begin
     Match('b');
     Fin;
     Prolog;
     DoBlock;
     Epilog;
end;
{--------------------------------------------------------------}
.
.
.
{--------------------------------------------------------------}
{ Main Program }

begin
     Init;
     TopDecls;
     DoMain;
end.
{--------------------------------------------------------------}


Note  that  DoProc  and DoMain are not quite symmetrical.  DoProc
uses a call to BeginBlock, whereas DoMain cannot.  That's because
a procedure  is signaled by the keyword PROCEDURE (abbreviated by
a 'p' here), while the main program gets no  keyword  other  than
the BEGIN itself.

And _THAT_ brings up an interesting question: WHY?

If  we  look  at the structure of C programs, we  find  that  all
functions are treated just  alike,  except  that the main program
happens to be identified by its name, "main."  Since  C functions
can appear in any order, the main program can also be anywhere in
the compilation unit.

In Pascal, on the other hand, all variables  and  procedures must
be declared before they're  used,  which  means  that there is no
point putting anything after the  main program ... it could never
be accessed.  The "main program" is not identified at  all, other
than  being that part of the code that  comes  after  the  global
BEGIN.  In other words, if it ain't anything else, it must be the
main program.

This  causes  no  small  amount   of   confusion   for  beginning
programmers, and for big Pascal programs sometimes it's difficult
to  find the beginning of the main program at all.  This leads to
conventions such as identifying it in comments:


     BEGIN { of MAIN }


This  has  always  seemed  to  me to be a bit of a kludge.    The
question comes up:    Why  should  the main program be treated so
much  differently  than  a  procedure?   In fact, now that  we've
recognized that  procedure declarations are just that ... part of
the global declarations ... isn't  the main program just one more
declaration, also?

The answer is yes, and by  treating  it that way, we can simplify
the code and make  it  considerably  more  orthogonal.  I propose
that  we  use  an explicit keyword, PROGRAM, to identify the main
program (Note that this  means  that we can't start the file with
it, as in Pascal).  In this case, our BNF becomes:


     <declaration> ::= <data decl> | <procedure> | <main program>


     <procedure> ::= PROCEDURE <ident> <begin-block>


     <main program> ::= PROGRAM <ident> <begin-block>


The code  also  looks  much  better,  at  least in the sense that
DoMain and DoProc look more alike:


{--------------------------------------------------------------}
{ Parse and Translate a Main Program }

procedure DoMain;
var N: char;
begin
     Match('P');
     N := GetName;
     Fin;
     if InTable(N) then Duplicate(N);
     Prolog;
     BeginBlock;
end;
{--------------------------------------------------------------}
.
.
.
{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }

procedure TopDecls;
begin
     while Look <> '.' do begin
      case Look of
            'v': Decl;
            'p': DoProc;
            'P': DoMain;
          else Abort('Unrecognized Keyword ' + Look);
          end;
          Fin;
     end;
end;


{--------------------------------------------------------------}
{ Main Program }

begin
     Init;
     TopDecls;
     Epilog;
end.
{--------------------------------------------------------------}


Since the declaration of the main program is now within  the loop
of  TopDecl,  that  does  present  some difficulties.  How do  we
ensure that it's  the last thing in the file?  And how do we ever
exit  from  the  loop?  My answer for the second question, as you
can see, was to bring back our old friend the  period.   Once the
parser sees that, we're done.

To  answer  the first question:  it  depends  on  how  far  we're
willing to go to  protect  the programmer from dumb mistakes.  In
the code that I've shown,  there's nothing to keep the programmer
from adding code after  the  main  program  ... even another main
program.   The code will just not be  accessible.    However,  we
COULD access it via a FORWARD statement, which we'll be providing
later. As a  matter  of fact, many assembler language programmers
like to use  the  area  just  after the program to declare large,
uninitialized data blocks, so there may indeed be  some  value in
not  requiring the main program to be last.  We'll leave it as it
is.

If we decide  that  we  should  give the programmer a little more
help than that, it's pretty easy to add some logic to kick us out
of the loop  once  the  main  program  has been processed.  Or we
could  at least flag an error if someone  tries  to  include  two
mains.


CALLING THE PROCEDURE

If you're satisfied that  things  are  working, let's address the
second half of the equation ... the call.

Consider the BNF for a procedure call:


     <proc_call> ::= <identifier>


for an assignment statement, on the other hand, the BNF is:


     <assignment> ::= <identifier> '=' <expression>


At this point we seem to  have  a problem. The two BNF statements
both begin on the  right-hand  side  with the token <identifier>.
How are we supposed to know, when we see the  identifier, whether
we have a procedure call or an assignment statement?   This looks
like a case where our  parser ceases being predictive, and indeed
that's exactly the case.  However, it turns  out  to  be  an easy
problem to fix, since all we have to do is to look at the type of
the identifier, as  recorded  in  the  symbol  table.    As we've
discovered before, a  minor  local  violation  of  the predictive
parsing rule can be easily handled as a special case.

Here's how to do it:


{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment(Name: char);
begin
     Match('=');
     Expression;
     StoreVar(Name);
end;


{--------------------------------------------------------------}
{ Decide if a Statement is an Assignment or Procedure Call }

procedure AssignOrProc;
var Name: char;
begin
     Name := GetName;
     case TypeOf(Name) of
          ' ': Undefined(Name);
          'v': Assignment(Name);
          'p': CallProc(Name);
          else Abort('Identifier ' + Name +
                                   ' Cannot Be Used Here');
     end;
end;


{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure DoBlock;
begin
     while not(Look in ['e']) do begin
          AssignOrProc;
          Fin;
   end;
end;
{--------------------------------------------------------------}


As you can see, procedure Block now calls AssignOrProc instead of
Assignment.  The function of this new procedure is to simply read
the identifier,  determine  its  type,  and  then  call whichever
procedure  is  appropriate  for  that  type.  Since the name  has
already been read,  we  must  pass  it to the two procedures, and
modify Assignment to match.   Procedure CallProc is a simple code
generation routine:


{--------------------------------------------------------------}
{ Call a Procedure }

procedure CallProc(N: char);
begin
     EmitLn('BSR ' + N);
end;
{--------------------------------------------------------------}


Well,  at  this  point  we  have  a  compiler  that can deal with
procedures.    It's  worth  noting  that   procedures   can  call
procedures to any depth.  So even though we  don't  allow  nested
DECLARATIONS, there  is certainly nothing to keep us from nesting
CALLS, just as  we  would  expect  to  do in any language.  We're
getting there, and it wasn't too hard, was it?

Of course, so far we can  only  deal with procedures that have no
parameters.    The  procedures  can  only operate on  the  global
variables  by  their  global names.  So at this point we have the
equivalent of BASIC's GOSUB construct.  Not too bad ... after all
lots of serious programs were written using GOSUBs, but we can do
better, and we will.  That's the next step.


PASSING PARAMETERS

Again, we all know the basic idea of passed parameters, but let's
review them just to be safe.

In general the procedure is given a parameter list, for example

     PROCEDURE FOO(X, Y, Z)

In  the declaration of a procedure,  the  parameters  are  called
formal  parameters, and may be referred to in  the  body  of  the
procedure  by  those  names.    The  names  used for  the  formal
parameters  are  really  arbitrary.    Only  the  position really
counts.  In  the  example  above,  the name 'X' simply means "the
first parameter" wherever it is used.

When a procedure is called,  the "actual parameters" passed to it
are associated  with  the  formal  parameters,  on  a one-for-one
basis.

The BNF for the syntax looks something like this:


     <procedure> ::= PROCEDURE <ident>
                    '(' <param-list> ')' <begin-block>


     <param_list> ::= <parameter> ( ',' <parameter> )* | null

Similarly, the procedure call looks like:


     <proc call> ::= <ident> '(' <param-list> ')'


Note that there is already an implicit decision  built  into this
syntax.  Some languages, such as Pascal and Ada, permit parameter
lists to be  optional.    If  there are no parameters, you simply
leave off the parens  completely.    Other  languages, like C and
Modula 2, require the parens even if the list is empty.  Clearly,
the example we just finished corresponds to the  former  point of
view.  But to tell the truth I prefer the latter.  For procedures
alone, the  decision would seem to favor the "listless" approach.
The statement


     Initialize; ,


standing alone, can only  mean  a procedure call.  In the parsers
we've  been  writing,  we've  made  heavy  use  of  parameterless
procedures, and it would seem a  shame  to have to write an empty
pair of parens for each case.

But later on we're going to  be  using functions, too.  And since
functions  can  appear  in  the  same  places  as  simple  scalar
identifiers, you can't tell the  difference between the two.  You
have to go  back  to  the  declarations  to find out.  Some folks
consider  this to be an advantage.  Their  argument  is  that  an
identifier gets replaced by a value, and what do you care whether
it's done by  substitution  or  by  a function?  But we sometimes
_DO_ care, because the function may be quite time-consuming.  If,
by  writing  a  simple identifier into a given expression, we can
incur a heavy run-time penalty, it seems to  me  we  ought  to be
made aware of it.

Anyway,  Niklaus  Wirth  designed both Pascal and Modula 2.  I'll
give him the benefit of the doubt and assume that  he  had a good
reason for changing the rules the second time around!

Needless to say, it's an easy thing to accomodate either point of
view as we design a language, so this one is strictly a matter of
personal preference.  Do it whichever way you like best.

Before we go any further, let's alter the translator to  handle a
(possibly empty) parameter list.  For now we  won't  generate any
extra code ... just parse the syntax.  The  code  for  processing
the declaration has very  much  the  same  form we've seen before
when dealing with VAR-lists:


{--------------------------------------------------------------}
{ Process the Formal Parameter List of a Procedure }

procedure FormalList;
begin
     Match('(');
     if Look <> ')' then begin
          FormalParam;
          while Look = ',' do begin
               Match(',');
               FormalParam;
          end;
     end;
     Match(')');
end;
{--------------------------------------------------------------}


Procedure DoProc needs to have a line added to call FormalList:


{--------------------------------------------------------------}
{ Parse and Translate a Procedure Declaration }

procedure DoProc;
var N: char;
begin
     Match('p');
     N := GetName;
     FormalList;
     Fin;
     if InTable(N) then Duplicate(N);
     ST[N] := 'p';
     PostLabel(N);
     BeginBlock;
     Return;
end;
{--------------------------------------------------------------}


For now, the code for FormalParam is just a dummy one that simply
skips the parameter name:


{--------------------------------------------------------------}
{ Process a Formal Parameter }

procedure FormalParam;
var Name:  char;
begin
     Name := GetName;
end;
{--------------------------------------------------------------}


For  the actual procedure call, there must  be  similar  code  to
process the actual parameter list:


{--------------------------------------------------------------}
{ Process an Actual Parameter }

procedure Param;
var Name:  char;
begin
     Name := GetName;
end;


{--------------------------------------------------------------}
{ Process the Parameter List for a Procedure  Call }

procedure ParamList;
begin
     Match('(');
     if Look <> ')' then begin
          Param;
          while Look = ',' do begin
               Match(',');
               Param;
          end;
     end;
     Match(')');
end;


{--------------------------------------------------------------}
{ Process a Procedure Call }

procedure CallProc(Name: char);
begin
     ParamList;
     Call(Name);
end;
{--------------------------------------------------------------}


Note  here  that  CallProc  is  no  longer  just  a  simple  code
generation  routine.  It has some structure to  it.    To  handle
this, I've renamed the code  generation routine to just Call, and
called it from within CallProc.

OK, if you'll add all this code to  your  translator  and  try it
out, you'll find that you can indeed parse the syntax properly.
I'll note in  passing  that  there  is _NO_ checking to make sure
that  the  number  (and,  later,  types)  of  formal  and  actual
parameters match up.  In a production compiler, we must of course
do  this.  We'll ignore the issue now if for no other reason than
that the structure of our  symbol table doesn't currently give us
a place to store the necessary information.  Later on, we'll have
a place for that data and we can deal with the issue then.


THE SEMANTICS OF PARAMETERS

So  far we've dealt with the SYNTAX  of  parameter  passing,  and
we've got the parsing mechanisms in place to handle it.  Next, we
have to look at the SEMANTICS, i.e., the actions to be taken when
we encounter parameters. This brings  us  square  up  against the
issue of the different ways parameters can be passed.

There is more than one way to pass a parameter, and the way we do
it can have a  profound  effect on the character of the language.
So  this is another of those areas where I can't just give you my
solution.  Rather, it's important that we spend some time looking
at the  alternatives  so  that  you  can  go another route if you
choose to.

There are two main ways parameters are passed:

     o By value
     o By reference (address)

The differences are best seen in the light of a little history.

The old FORTRAN compilers passed all parameters by reference.  In
other  words, what was actually passed was  the  address  of  the
parameter.  This meant  that  the  called  subroutine was free to
either read or  write  that  parameter,  as often as it chose to,
just  as though it were a global variable.    This  was  actually
quite an efficient  way  to  do  things, and it was pretty simple
since  the  same  mechanism  was  used  in  all cases,  with  one
exception that I'll get to shortly.

There were problems, though.  Many people felt  that  this method
created entirely too much coupling between the  called subroutine
and  its  caller.    In  effect, it gave the subroutine  complete
access to all variables that appeared in the parameter list.

Many  times,  we  didn't want to actually change a parameter, but
only use it as an input.  For example, we  might  pass an element
count  to a subroutine, and wish we could  then  use  that  count
within a DO-loop.    To  avoid  changing the value in the calling
program, we had to make a local copy of the input  parameter, and
operate only on the  copy.    Some  FORTRAN programmers, in fact,
made it a practice to copy ALL parameters except those  that were
to be used as return values.    Needless to say, all this copying
defeated  a  good  bit  of  the  efficiency  associated with  the
approach.

There was, however, an even more insidious problem, which was not
really just the fault of  the "pass by reference" convention, but
a bad convergence of several implementation decisions.

Suppose we have a subroutine:


     SUBROUTINE FOO(X, Y, N)


where N is some kind of  input  count  or flag.  Many times, we'd
like  to be able to pass a literal or even an expression in place
of a variable, such as:


     CALL FOO(A, B, J + 1)


Here the third  parameter  is  not  a  variable, and so it has no
address.    The  earliest FORTRAN compilers did  not  allow  such
things, so we had to resort to subterfuges like:


     K = J + 1
     CALL FOO(A, B, K)


Here again, there was copying required, and the burden was on the
programmer to do it.  Not good.

Later  FORTRAN  implementations  got  rid  of  this  by  allowing
expressions  as  parameters.   What they  did  was  to  assign  a
compiler-generated variable, store the value of the expression in
the variable, and then pass the address of the expression.

So far, so good.    Even if the subroutine mistakenly altered the
anonymous variable, who was to know  or  care?  On the next call,
it would be recalculated anyway.

The  problem  arose  when  someone  decided to make  things  more
efficient.  They  reasoned,  rightly enough, that the most common
kind of "expression" was a single integer value, as in:


     CALL FOO(A, B, 4)


It seemed inefficient to go to the trouble of "computing" such an
integer and storing it  in  a temporary variable, just to pass it
through  the  calling  list.  Since we had to pass the address of
the  thing  anyway,  it seemed to make lots of sense to just pass
the address of the literal integer, 4 in the example above.

To make matters  more  interesting, most compilers, then and now,
identify all literals and store  them  separately  in  a "literal
pool,"  so that we only have to store one  value  for each unique
literal.    That  combination  of  design  decisions:     passing
expressions, optimization for literals as a special case, and use
of a literal pool, is what led to disaster.

To  see  how  it works, imagine that we call subroutine FOO as in
the example above, passing  it  a literal 4.  Actually, what gets
passed  is  the  address of the literal 4, which is stored in the
literal pool.   This address corresponds to the formal parameter,
K, in the subroutine itself.

Now suppose that, unbeknownst to the  programmer,  subroutine FOO
actually modifies K to be, say, -7.  Suddenly, that literal  4 in
the literal pool  gets  CHANGED,  to  a  -7.  From then on, every
expression that uses  a  4  and  every subroutine that passes a 4
will be using the value of -7 instead!  Needless to say, this can
lead to some  bizarre  and difficult-to-find behavior.  The whole
thing gave  the concept of pass-by-reference a bad name, although
as we have seen, it was really a combination of  design decisions
that led to the problem.

In spite of  the  problem,  the  FORTRAN  approach  had  its good
points.    Chief  among them is the fact that we  don't  have  to
support  multiple  mechanisms.    The  same  scheme,  passing the
address of  the argument, works for EVERY case, including arrays.
So the size of the compiler can be reduced.

Partly because of the FORTRAN  gotcha, and partly just because of
the reduced coupling involved, modern languages  like  C, Pascal,
Ada, and Modula 2 generally pass scalars by value.

This means that the value of the scalar is COPIED into a separate
value  used only for the call.  Since the value passed is a copy,
the called procedure can use it as a local variable and modify it
any way it likes.  The value in the caller will not be changed.

It may seem at first that  this  is a bit inefficient, because of
the need to copy the parameter.  But remember that we're going to
have  to  fetch SOME value to pass  anyway,  whether  it  be  the
parameter  itself  or  an address for it.  Inside the subroutine,
using  pass-by-value  is  definitely  more  efficient,  since  we
eliminate one level of indirection.  Finally, we saw earlier that
with  FORTRAN,  it  was often necessary to make copies within the
subroutine anyway, so pass-by-value reduces the  number  of local
variables.  All in all, pass-by-value is better.

Except for one small little detail:  if all parameters are passed
by value, there is no way for a called to  procedure  to return a
result to its caller!  The parameter passed is NOT altered in the
caller,  only  in  the called procedure.  Clearly, that won't get
the job done.

There  have  been   two   answers  to  this  problem,  which  are
equivalent.   In Pascal, Wirth provides for VAR parameters, which
are  passed-by-reference.    What a VAR parameter is, in fact, is
none other than our old friend the FORTRAN parameter, with  a new
name and paint job for disguise.  Wirth neatly  gets  around  the
"changing a literal"  problem  as  well  as  the  "address  of an
expression" problem, by  the  simple expedient of allowing only a
variable to be the actual parameter.  In other  words,  it's  the
same restriction that the earliest FORTRANs imposed.

C does the same thing, but explicitly.  In  C,  _ALL_  parameters
are passed  by  value.    One  kind  of variable that C supports,
however, is the pointer.  So  by  passing a pointer by value, you
in effect pass what it points to by reference.  In some ways this
works even better yet,  because  even  though  you can change the
variable  pointed to all you like, you  still  CAN'T  change  the
pointer itself.  In a function such as strcpy, for example, where
the  pointers are incremented as the string  is  copied,  we  are
really only incrementing copies of the pointers, so the values of
those  pointers in the calling procedure  still  remain  as  they
were.  To modify a  pointer,  you  must  pass  a  pointer  to the
pointer.

Since we are simply  performing  experiments  here, we'll look at
BOTH pass-by-value and pass-by-reference.    That  way,  we'll be
able to use either one as we need to.  It's worth mentioning that
it's  going  to  be tough to use the C approach to pointers here,
since a pointer is a different type and we haven't  studied types
yet!


PASS-BY-VALUE

Let's just try some simple-minded  things and see where they lead
us.    Let's begin with the pass-by-value  case.    Consider  the
procedure call:


     FOO(X, Y)


Almost the only reasonable way to pass the data  is  through  the
CPU stack.  So the code we'd like  to  see  generated  might look
something like this:


     MOVE X(PC),-(SP)    ; Push X
     MOVE Y(PC),-(SP)    ; Push Y
     BSR FOO             ; Call FOO


That certainly doesn't seem too complex!

When the BSR is executed, the CPU pushes the return  address onto
the stack and jumps to FOO.    At  this point the stack will look
like this:

          .
          .
          Value of X (2 bytes)
          Value of Y (2 bytes)
  SP -->  Return Address (4 bytes)


So the values of  the  parameters  have  addresses that are fixed
offsets from the stack pointer.  In this  example,  the addresses
are:


     X:  6(SP)
     Y:  4(SP)


Now consider what the called procedure might look like:


     PROCEDURE FOO(A, B)
     BEGIN
          A = B
     END

(Remember, the names  of  the formal parameters are arbitrary ...
only the positions count.)

The desired output code might look like:


     FOO: MOVE 4(SP),D0
          MOVE D0,6(SP)
          RTS


Note that, in order to address the formal parameters, we're going
to have to know  which  position they have in the parameter list.
This means some changes to the symbol table stuff.  In  fact, for
our single-character case it's best to just create  a  new symbol
table for the formal parameters.

Let's begin by declaring a new table:


     var Params: Array['A'..'Z'] of integer;


We  also  will  need to keep track of how many parameters a given
procedure has:


     var NumParams: integer;


And we need to initialize the new table.  Now, remember  that the
formal parameter list  will  be different for each procedure that
we process, so we'll need to initialize that table anew  for each
procedure.  Here's the initializer:


{--------------------------------------------------------------}
{ Initialize Parameter Table to Null }

procedure ClearParams;
var i: char;
begin
     for i := 'A' to 'Z' do
          Params[i] := 0;
     NumParams := 0;
end;
{--------------------------------------------------------------}


We'll put a call to this procedure in Init, and  also  at the end
of DoProc:


{--------------------------------------------------------------}
{ Initialize }

procedure Init;
var i: char;
begin
     GetChar;
     SkipWhite;
     for i := 'A' to 'Z' do
          ST[i] := ' ';
     ClearParams;
end;
{--------------------------------------------------------------}
.
.
.
{--------------------------------------------------------------}
{ Parse and Translate a Procedure Declaration }

procedure DoProc;
var N: char;
begin
     Match('p');
     N := GetName;
     FormalList;
     Fin;
     if InTable(N) then Duplicate(N);
     ST[N] := 'p';
     PostLabel(N);
     BeginBlock;
     Return;
     ClearParams;
end;
{--------------------------------------------------------------}


Note that the call  within  DoProc ensures that the table will be
clear when we're in the main program.


OK, now  we  need  a  few procedures to work with the table.  The
next few functions are  essentially  copies  of  InTable, TypeOf,
etc.:


{--------------------------------------------------------------}
{ Find the Parameter Number }

function ParamNumber(N: char): integer;
begin
     ParamNumber := Params[N];
end;


{--------------------------------------------------------------}
{ See if an Identifier is a Parameter }

function IsParam(N: char): boolean;
begin
     IsParam := Params[N] <> 0;
end;


{--------------------------------------------------------------}
{ Add a New Parameter to Table }

procedure AddParam(Name: char);
begin
     if IsParam(Name) then Duplicate(Name);
     Inc(NumParams);
     Params[Name] := NumParams;
end;
{--------------------------------------------------------------}


Finally, we need some code generation routines:


{--------------------------------------------------------------}
{ Load a Parameter to the Primary Register }

procedure LoadParam(N: integer);
var Offset: integer;
begin
     Offset := 4 + 2 * (NumParams - N);
     Emit('MOVE ');
     WriteLn(Offset, '(SP),D0');
end;


{--------------------------------------------------------------}
{ Store a Parameter from the Primary Register }

procedure StoreParam(N: integer);
var Offset: integer;
begin
     Offset := 4 + 2 * (NumParams - N);
     Emit('MOVE D0,');
     WriteLn(Offset, '(SP)');
end;


{--------------------------------------------------------------}
{ Push The Primary Register to the Stack }

procedure Push;
begin
     EmitLn('MOVE D0,-(SP)');
end;
{--------------------------------------------------------------}


( The last routine is one we've seen  before,  but  it  wasn't in
this vestigial version of the program.)

With those preliminaries in place, we're ready to  deal  with the
semantics of procedures with calling lists (remember, the code to
deal with the syntax is already in place).

Let's begin by processing a formal parameter.  All we have  to do
is to add each parameter to the parameter symbol table:


{--------------------------------------------------------------}
{ Process a Formal Parameter }

procedure FormalParam;
begin
     AddParam(GetName);
end;
{--------------------------------------------------------------}


Now, what about dealing with a formal parameter  when  it appears
in the body of the procedure?  That takes a little more work.  We
must first determine that it IS a formal parameter.  To  do this,
I've written a modified version of TypeOf:


{--------------------------------------------------------------}
{ Get Type of Symbol }

function TypeOf(n: char): char;
begin
     if IsParam(n) then
          TypeOf := 'f'
     else
          TypeOf := ST[n];
end;
{--------------------------------------------------------------}


(Note that, since  TypeOf  now  calls  IsParam, it may need to be
relocated in your source.)

We also must modify AssignOrProc to deal with this new type:


{--------------------------------------------------------------}
{ Decide if a Statement is an Assignment or Procedure Call }

procedure AssignOrProc;
var Name: char;
begin
     Name := GetName;
     case TypeOf(Name) of
          ' ': Undefined(Name);
          'v', 'f': Assignment(Name);
          'p': CallProc(Name);
          else  Abort('Identifier ' + Name +  '  Cannot  Be  Used
Here');
     end;
end;
{--------------------------------------------------------------}


Finally,  the  code  to process an assignment  statement  and  an
expression must be extended:


{--------------------------------------------------------------}
{ Parse and Translate an Expression }
{ Vestigial Version }

procedure Expression;
var Name: char;
begin
     Name := GetName;
     if IsParam(Name) then
          LoadParam(ParamNumber(Name))
     else
          LoadVar(Name);
end;


{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment(Name: char);
begin
     Match('=');
     Expression;
     if IsParam(Name) then
          StoreParam(ParamNumber(Name))
     else
          StoreVar(Name);
end;
{--------------------------------------------------------------}


As you can see, these procedures will treat  every  variable name
encountered as either a  formal  parameter  or a global variable,
depending  on  whether  or not it appears in the parameter symbol
table.   Remember  that  we  are  using  only a vestigial form of
Expression.  In the  final  program,  the  change shown here will
have to be added to Factor, not Expression.

The rest is easy.  We need only add the  semantics  to the actual
procedure call, which we can do with one new line of code:


{--------------------------------------------------------------}
{ Process an Actual Parameter }

procedure Param;
begin
     Expression;
     Push;
end;
{--------------------------------------------------------------}


That's  it.  Add these changes to your program and give it a try.
Try declaring one or two procedures, each with a formal parameter
list.  Then do some assignments, using combinations of global and
formal  parameters.    You  can  call one procedure  from  within
another, but you cannot DECLARE a nested procedure.  You can even
pass formal parameters from one procedure to another.  If  we had
the  full  syntax  of the language here, you'd also be able to do
things like read  or  write  formal  parameters  or  use  them in
complicated expressions.


WHAT'S WRONG?

At this point, you might be thinking: Surely there's more to this
than a few pushes and  pops.    There  must  be  more  to passing
parameters than this.

You'd  be  right.    As  a  matter  of fact, the code that  we're
generating here leaves a lot to be desired in several respects.

The most glaring oversight is that it's wrong!   If  you'll  look
back at the code for a procedure call, you'll see that the caller
pushes each actual parameter onto the stack before  it  calls the
procedure.  The  procedure  USES that information, but it doesn't
change the stack  pointer.    That  means that the stuff is still
there when we return. SOMEBODY needs to clean up  the  stack,  or
we'll soon be in very hot water!

Fortunately,  that's  easily fixed.  All we  have  to  do  is  to
increment the stack pointer when we're finished.

Should  we  do  that  in  the  calling  program,  or  the  called
procedure?   Some folks let the called  procedure  clean  up  the
stack,  since  that  requires less code to be generated per call,
and since the procedure, after  all,  knows  how  many parameters
it's got.   But  that  means  that  it must do something with the
return address so as not to lose it.

I prefer letting  the  caller  clean  up, so that the callee need
only execute a return.  Also, it seems a bit more balanced, since
the caller is  the  one  who  "messed  up" the stack in the first
place.  But  THAT  means  that  the caller must remember how many
items  it  pushed.    To  make  things  easy, I've  modified  the
procedure  ParamList to be a function  instead  of  a  procedure,
returning the number of bytes pushed:


{--------------------------------------------------------------}
{ Process the Parameter List for a Procedure  Call }

function ParamList: integer;
var N: integer;
begin
     N := 0;
     Match('(');
     if Look <> ')' then begin
          Param;
          inc(N);
          while Look = ',' do begin
               Match(',');
               Param;
               inc(N);
          end;
     end;
     Match(')');
     ParamList := 2 * N;
end;
{--------------------------------------------------------------}


Procedure CallProc then uses this to clean up the stack:


{--------------------------------------------------------------}
{ Process a Procedure Call }

procedure CallProc(Name: char);
var N: integer;
begin
     N := ParamList;
     Call(Name);
     CleanStack(N);
end;
{--------------------------------------------------------------}


Here I've created yet another code generation procedure:


{--------------------------------------------------------------}
{ Adjust the Stack Pointer Upwards by N Bytes }

procedure CleanStack(N: integer);
begin
     if N > 0 then begin
          Emit('ADD #');
          WriteLn(N, ',SP');
     end;
end;
{--------------------------------------------------------------}


OK, if you'll add this code to your compiler, I think you'll find
that the stack is now under control.

The next problem has to do with our way of addressing relative to
the stack pointer.  That works fine in our simple examples, since
with our rudimentary  form  of expressions nobody else is messing
with the stack.  But consider a different example as simple as:


     PROCEDURE FOO(A, B)
     BEGIN
          A = A + B
     END


The code generated by a simple-minded parser might be:


     FOO: MOVE 6(SP),D0       ; Fetch A
          MOVE D0,-(SP)       ; Push it
          MOVE 4(SP),D0       ; Fetch B
          ADD (SP)+,D0        ; Add A
          MOVE D0,6(SP)       : Store A
          RTS


This  would  be  wrong.  When we push the first argument onto the
stack, the offsets for the two formal parameters are no  longer 4
and 6, but are 6 and 8.  So the second fetch would fetch A again,
not B.

This is not  the  end of the world.  I think you can see that all
we really have to do is to alter the offset every  time  we  do a
push, and that in fact is what's done if the  CPU  has no support
for other methods.

Fortunately,   though,   the   68000   does  have  such  support.
Recognizing that this CPU  would  be  used  a lot with high-order
language compilers, Motorola decided to  add  direct  support for
this kind of thing.

The problem, as you  can  see, is that as the procedure executes,
the stack  pointer  bounces  up  and  down,  and so it becomes an
awkward  thing  to  use  as  a  reference  to access  the  formal
parameters.  The solution is to define some _OTHER_ register, and
use  it instead.  This register is typically  set  equal  to  the
original stack pointer, and is called the frame pointer.

The  68000 instruction set LINK lets you  declare  such  a  frame
pointer, and  sets  it  equal  to  the  stack pointer, all in one
instruction.  As a matter of  fact,  it does even more than that.
Since this register may have been in use for  something  else  in
the calling procedure, LINK also pushes the current value of that
register onto the stack.  It  can  also  add a value to the stack
pointer, to make room for local variables.

The complement of LINK is UNLK, which simply  restores  the stack
pointer and pops the old value back into the register.

Using these two  instructions,  the code for the previous example
becomes:


     FOO: LINK A6,#0
          MOVE 10(A6),D0      ; Fetch A
          MOVE D0,-(SP)       ; Push it
          MOVE 8(A6),D0       ; Fetch B
          ADD (SP)+,D0        ; Add A
          MOVE D0,10(A6)      : Store A
          UNLK A6
          RTS


Fixing the compiler to generate this code is a lot easier than it
is  to  explain  it.    All we need to do is to modify  the  code
generation created by DoProc.  Since that makes the code a little
more than one line, I've created new procedures to deal  with it,
paralleling the Prolog and Epilog procedures called by DoMain:


{--------------------------------------------------------------}
{ Write the Prolog for a Procedure }

procedure ProcProlog(N: char);
begin
     PostLabel(N);
     EmitLn('LINK A6,#0');
end;


{--------------------------------------------------------------}
{ Write the Epilog for a Procedure }

procedure ProcEpilog;
begin
     EmitLn('UNLK A6');
     EmitLn('RTS');
end;
{--------------------------------------------------------------}


Procedure DoProc now just calls these:


{--------------------------------------------------------------}
{ Parse and Translate a Procedure Declaration }

procedure DoProc;
var N: char;
begin
     Match('p');
     N := GetName;
     FormalList;
     Fin;
     if InTable(N) then Duplicate(N);
     ST[N] := 'p';
     ProcProlog(N);
     BeginBlock;
     ProcEpilog;
     ClearParams;
end;
{--------------------------------------------------------------}


Finally, we need to  change  the  references  to SP in procedures
LoadParam and StoreParam:


{--------------------------------------------------------------}
{ Load a Parameter to the Primary Register }

procedure LoadParam(N: integer);
var Offset: integer;
begin
     Offset := 8 + 2 * (NumParams - N);
     Emit('MOVE ');
     WriteLn(Offset, '(A6),D0');
end;


{--------------------------------------------------------------}
{ Store a Parameter from the Primary Register }

procedure StoreParam(N: integer);
var Offset: integer;
begin
     Offset := 8 + 2 * (NumParams - N);
     Emit('MOVE D0,');
     WriteLn(Offset, '(A6)');
end;
{--------------------------------------------------------------}


(Note that the Offset computation  changes to allow for the extra
push of A6.)

That's all it takes.  Try this out and see how you like it.

At this point we  are  generating  some  relatively nice code for
procedures and procedure calls.  Within the limitation that there
are no local variables  (yet)  and  that  no procedure nesting is
allowed, this code is just what we need.

There is still just one little small problem remaining:


     WE HAVE NO WAY TO RETURN RESULTS TO THE CALLER!


But  that,  of course, is not a  limitation  of  the  code  we're
generating, but  one  inherent  in  the  call-by-value  protocol.
Notice that we CAN use formal parameters in any  way  inside  the
procedure.  We  can  calculate  new  values for them, use them as
loop counters (if we had loops, that is!), etc.   So  the code is
doing what it's supposed to.   To  get over this last problem, we
need to look at the alternative protocol.


CALL-BY-REFERENCE

This  one is easy, now that we have  the  mechanisms  already  in
place.    We  only  have  to  make  a few  changes  to  the  code
generation.  Instead of  pushing  a value onto the stack, we must
push an address.  As it turns out, the 68000 has  an instruction,
PEA, that does just that.

We'll be  making  a  new  version  of  the test program for this.
Before we do anything else,

>>>> MAKE A COPY <<<<

of  the program as it now stands, because  we'll  be  needing  it
again later.

Let's begin by looking at the code we'd like to see generated for
the new case. Using the same example as before, we need the call


     FOO(X, Y)


to be translated to:


     PEA X(PC)           ; Push the address of X
     PEA Y(PC)           ; Push Y the address of Y
     BSR FOO             ; Call FOO


That's a simple matter of a slight change to Param:


{--------------------------------------------------------------}
{ Process an Actual Parameter }

procedure Param;
begin
     EmitLn('PEA ' + GetName + '(PC)');
end;
{--------------------------------------------------------------}


(Note that with pass-by-reference, we can't  have  expressions in
the calling list, so Param can just read the name directly.)

At the other end, the references to the formal parameters must be
given one level of indirection:


     FOO: LINK A6,#0
          MOVE.L 12(A6),A0    ; Fetch the address of A
          MOVE (A0),D0        ; Fetch A
          MOVE D0,-(SP)       ; Push it
          MOVE.L 8(A6),A0     ; Fetch the address of B
          MOVE (A0),D0        ; Fetch B
          ADD (SP)+,D0        ; Add A
          MOVE.L 12(A6),A0    ; Fetch the address of A
          MOVE D0,(A0)        : Store A
          UNLK A6
          RTS


All  of  this  can  be   handled  by  changes  to  LoadParam  and
StoreParam:


{--------------------------------------------------------------}
{ Load a Parameter to the Primary Register }

procedure LoadParam(N: integer);
var Offset: integer;
begin
     Offset := 8 + 4 * (NumParams - N);
     Emit('MOVE.L ');
     WriteLn(Offset, '(A6),A0');
     EmitLn('MOVE (A0),D0');
end;


{--------------------------------------------------------------}
{ Store a Parameter from the Primary Register }

procedure StoreParam(N: integer);
var Offset: integer;
begin
     Offset := 8 + 4 * (NumParams - N);
     Emit('MOVE.L ');
     WriteLn(Offset, '(A6),A0');
     EmitLn('MOVE D0,(A0)');
end;
{--------------------------------------------------------------}

To  get  the  count  right,  we  must  also  change  one line  in
ParamList:


     ParamList := 4 * N;


That  should  do it.  Give it a try and see  if  it's  generating
reasonable-looking code.  As  you  will  see,  the code is hardly
optimal,  since  we  reload  the  address register every  time  a
parameter  is  needed.    But  that's  consistent  with our  KISS
approach  here,  of  just being sure to generate code that works.
We'll  just  make  a  little  note here, that here's yet  another
candidate for optimization, and press on.

Now we've learned to process parameters  using  pass-by-value and
pass-by-reference.  In the real world, of course, we'd like to be
able  to  deal  with BOTH methods.  We can't do that yet, though,
because we have not yet had a session on types,  and  that has to
come first.

If  we can only have ONE method, then of course it has to be  the
good ol' FORTRAN method of  pass-by-reference,  since  that's the
only way procedures can ever return values to their caller.

This, in fact, will be one of the differences  between  TINY  and
KISS.  In the next version of TINY,  we'll  use pass-by-reference
for all parameters.  KISS will support both methods.


LOCAL VARIABLES

So  far,  we've  said  nothing  about  local  variables, and  our
definition of procedures doesn't allow  for  them.    Needless to
say, that's a big gap in our language, and one  that  needs to be
corrected.

Here again we are faced with a choice: Static or dynamic storage?

In those  old FORTRAN programs, local variables were given static
storage just like global ones.  That is, each local  variable got
a  name  and  allocated address, like any other variable, and was
referenced by that name.

That's easy for us to do, using the allocation mechanisms already
in place.  Remember,  though,  that local variables can have  the
same  names as global ones.  We need to somehow deal with that by
assigning unique names for these variables.

The characteristic of static storage, of course, is that the data
survives  a procedure call and return.   When  the  procedure  is
called  again,  the  data will still be there.  That  can  be  an
advantage in some applications.    In the FORTRAN days we used to
do tricks like initialize a flag, so that you could tell when you
were entering a  procedure  for  the  first time and could do any
one-time initialization that needed to be done.

Of  course,  the  same  "feature"  is also what  makes  recursion
impossible with static storage.  Any new call to a procedure will
overwrite the data already in the local variables.

The alternative is dynamic storage, in which storage is allocated
on the stack just as for passed parameters.    We  also  have the
mechanisms  already  for  doing this.  In fact, the same routines
that  deal with passed (by value) parameters  on  the  stack  can
easily deal  with  local  variables  as  well  ... the code to be
generated  is  the  same.  The purpose of the offset in the 68000
LINK instruction is there just for that reason:  we can use it to
adjust the stack  pointer  to  make  room  for  locals.   Dynamic
storage, of course, inherently supports recursion.

When  I  first  began  planning  TINY,  I  must  admit  to  being
prejudiced in favor of static  storage.    That's  simply because
those old FORTRAN  programs  were pretty darned efficient ... the
early FORTRAN compilers  produced  a quality of code that's still
rarely matched by modern compilers.   Even today, a given program
written  in  FORTRAN  is likely to outperform  the  same  program
written in C or Pascal, sometimes  by  wide margins. (Whew!  Am I
going to hear about THAT statement!)

I've always supposed that the reason had to do with the  two main
differences  between  FORTRAN  implementations  and  the  others:
static  storage  and  pass-by-reference.    I  know  that dynamic
storage  supports  recursion,  but it's always seemed to me a bit
peculiar to be willing to accept slower code in the 95%  of cases
that don't need recursion, just to get that feature when you need
it.  The idea is that, with static storage, you can  use absolute
addressing  rather than indirect addressing, which should  result
in faster code.

More recently, though, several folks  have pointed out to me that
there really is no performance  penalty  associated  with dynamic
storage.  With the 68000, for example, you shouldn't use absolute
addressing  anyway  ...  most  operating systems require position
independent code.  And the 68000 instruction

     MOVE 8(A6),D0

has exactly the same timing as

     MOVE X(PC),D0.

So  I'm  convinced,  now, that there is no good reason NOT to use
dynamic storage.

Since this use of local variables fits so well into the scheme of
pass-by-value  parameters,  we'll  use   that   version   of  the
translator to illustrate it. (I _SURE_ hope you kept a copy!)

The general idea is to keep track of how  many  local  parameters
there  are.    Then we use the integer in the LINK instruction to
adjust the stack pointer downward to make room for them.   Formal
parameters are  addressed  as  positive  offsets  from  the frame
pointer, and locals as negative offsets.  With a  little  bit  of
work, the same procedures we've  already created can take care of
the whole thing.

Let's start by creating a new variable, Base:


     var Base: integer;

We'll use this  variable,  instead of NumParams, to compute stack
offsets.  That means changing  the two references to NumParams in
LoadParam and StoreParam:


{--------------------------------------------------------------}
{ Load a Parameter to the Primary Register }

procedure LoadParam(N: integer);
var Offset: integer;
begin
     Offset := 8 + 2 * (Base - N);
     Emit('MOVE ');
     WriteLn(Offset, '(A6),D0');
end;


{--------------------------------------------------------------}
{ Store a Parameter from the Primary Register }

procedure StoreParam(N: integer);
var Offset: integer;
begin
     Offset := 8 + 2 * (Base - N);
     Emit('MOVE D0,');
     WriteLn(Offset, '(A6)');
end;
{--------------------------------------------------------------}


The idea is that the value of Base will be  frozen  after we have
processed the formal parameters, and  won't  increase  further as
the new, local variables, are inserted in the symbol table.  This
is taken care of at the end of FormalList:


{--------------------------------------------------------------}
{ Process the Formal Parameter List of a Procedure }

procedure FormalList;
begin
     Match('(');
     if Look <> ')' then begin
          FormalParam;
          while Look = ',' do begin
               Match(',');
               FormalParam;
          end;
     end;
     Match(')');
     Fin;
     Base := NumParams;
     NumParams := NumParams + 4;
end;
{--------------------------------------------------------------}


(We add four words to make allowances for the return  address and
old frame pointer, which end up between the formal parameters and
the locals.)

About all we  need  to  do  next  is to install the semantics for
declaring local variables into the parser.  The routines are very
similar to Decl and TopDecls:


{--------------------------------------------------------------}
{ Parse and Translate a Local Data Declaration }

procedure LocDecl;
var Name: char;
begin
   Match('v');
     AddParam(GetName);
     Fin;
end;


{--------------------------------------------------------------}


{ Parse and Translate Local Declarations }

function LocDecls: integer;
var n: integer;
begin
     n := 0;
     while Look = 'v' do begin
          LocDecl;
          inc(n);
     end;
     LocDecls := n;
end;
{--------------------------------------------------------------}


Note that LocDecls is a  FUNCTION, returning the number of locals
to DoProc.

Next, we modify DoProc to use this information:


{--------------------------------------------------------------}
{ Parse and Translate a Procedure Declaration }

procedure DoProc;
var N: char;
      k: integer;
begin
     Match('p');
     N := GetName;
     if InTable(N) then Duplicate(N);
     ST[N] := 'p';
     FormalList;
     k := LocDecls;
     ProcProlog(N, k);
     BeginBlock;
     ProcEpilog;
     ClearParams;
end;
{--------------------------------------------------------------}


(I've  made   a  couple  of  changes  here  that  weren't  really
necessary.  Aside from rearranging things a bit, I moved the call
to  Fin  to  within FormalList, and placed one inside LocDecls as
well.   Don't forget to put one at the end of FormalList, so that
we're together here.)

Note the change in the call  to  ProcProlog.  The new argument is
the number of WORDS (not bytes) to allocate space  for.    Here's
the new version of ProcProlog:


{--------------------------------------------------------------}
{ Write the Prolog for a Procedure }

procedure ProcProlog(N: char; k: integer);
begin
     PostLabel(N);
     Emit('LINK A6,#');
     WriteLn(-2 * k)
end;
{--------------------------------------------------------------}


That should do it.  Add these changes and see how they work.


CONCLUSION

At this point you know  how to compile procedure declarations and
procedure calls,  with  parameters  passed  by  reference  and by
value.  You can also handle local variables.  As you can see, the
hard part is not  in  providing  the  mechanisms, but in deciding
just which mechanisms to use.  Once we make these  decisions, the
code to translate the constructs is really not that difficult.
I didn't  show  you  how  to  deal  with the combination of local
parameters   and  pass-by-reference  parameters,  but  that's   a
straightforward extension to  what  you've already seen.  It just
gets a little more messy, that's all, since we  need  to  support
both mechanisms instead of just one at a  time.    I'd  prefer to
save  that  one  until after we've  dealt  with  ways  to  handle
different variable types.

That will be the next installment, which will be coming soon to a
Forum near you.  See you then.




































                     LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                           26 May 1990


                         Part XIV: TYPES





INTRODUCTION

In the  last installment (Part XIII: PROCEDURES) I mentioned that
in that part and this one,  we  would cover the two features that
tend  to  separate  the toy language from a real, usable one.  We
covered  procedure  calls  in that installment.  Many of you have
been  waiting patiently, since August '89, for  me  to  drop  the
other shoe.  Well, here it is.

In this installment, we'll talk  about how to deal with different
data types.  As I did in the last segment, I will NOT incorporate
these  features directly into the TINY  compiler  at  this  time.
Instead, I'll be using the same approach that has worked  so well
for  us  in the past: using only  fragments  of  the  parser  and
single-character  tokens.    As  usual,  this  allows  us to  get
directly to the  heart  of  the  matter  without  having  to wade
through a lot of  unnecessary  code.  Since the major problems in
dealing with multiple types occur in  the  arithmetic operations,
that's where we'll concentrate our focus.

A  few words of warning:  First, there are some types that I will
NOT  be  covering in this installment.   Here  we  will  ONLY  be
talking about the simple, predefined types.  We  won't  even deal
with arrays, pointers or strings  in  this  installment;  I'll be
covering them in the next few.

Second, we also will not discuss user-defined types.    That will
not come until  much  later,  for  the simple reason that I still
haven't convinced myself  that  user-defined  types  belong  in a
language named KISS.  In later installments, I do intend to cover
at least the general  concepts  of  user-defined  types, records,
etc., just so that the series  will  be complete.  But whether or
not they will be included as part of KISS is still an open issue.
I am open to comments or suggestions on this question.

Finally,  I  should  warn you: what we are about to  do  CAN  add
considerable  extra  complication  to  both  the  parser  and the
generated  code.    Handling  variables  of  different  types  is
straightforward enough.  The complexity  comes  in  when  you add
rules about conversion between types.  In general,  you  can make
the  compiler  as  simple or as complex as you choose to make it,
depending upon the  way  you  define  the  type-conversion rules.
Even if you decide not to allow ANY type conversions (as  in Ada,
for example) the problem is still there, and is  built  into  the
mathematics.  When  you  multiply two short numbers, for example,
you can get a long result.

I've approached this problem very  carefully,  in  an  attempt to
Keep It Simple.  But we can't avoid the complexity entirely.   As
has so often has happened, we end up having to trade code quality
against complexity,  and  as  usual  I  will  tend to opt for the
simplest approach.


WHAT'S COMING NEXT?

Before diving into the tutorial, I think you'd like to know where
we are going  from  here  ...  especially since it's been so long
since the last installment.

I have not been idle in  the  meantime.   What I've been doing is
reorganizing  the  compiler  itself into Turbo Units.  One of the
problems I've encountered is that  as we've covered new areas and
thereby added features to  the  TINY  compiler, it's been getting
longer and longer.  I realized a couple of installments back that
this was causing trouble, and that's why I've gone back  to using
only compiler fragments for  the  last  installment and this one.
The problem is that it just  seems  dumb to have to reproduce the
code  for,  say,  processing  boolean  exclusive  OR's,  when the
subject of the discussion is parameter passing.

The obvious way  to have our cake and eat it, too, is to break up
the compiler into separately compilable  modules,  and  of course
the Turbo Unit is an ideal  vehicle  for doing this.  This allows
us to hide some fairly complex code (such as the  full arithmetic
and boolean expression parsing) into a single unit, and just pull
it in whenever it's needed.  In that way, the only code I'll have
to reproduce in these installments will be the code that actually
relates to the issue under discussion.

I've  also  been  toying with Turbo 5.5, which of course includes
the Borland object-oriented  extensions  to  Pascal.    I haven't
decided whether to make use of these features,  for  two reasons.
First of all, many of you who have been following this series may
still not have 5.5, and I certainly don't want to force anyone to
have to go out and  buy  a  new  compiler  just  to  complete the
series.  Secondly, I'm not convinced that the O-O extensions have
all that much value for this application.  We've been having some
discussions  about that in CompuServe's CLM  forum,  and  so  far
we've  not found any compelling reason  to  use  O-O  constructs.
This is another of those areas where I could  use  some  feedback
from you readers.  Anyone want to vote for Turbo 5.5 and O-O?

In any case, after  the  next few installments in the series, the
plan  is  to  upload to you a complete set of Units, and complete
functioning compilers as  well.    The  plan, in fact, is to have
THREE compilers:  One for  a single-character version of TINY (to
use  for  our  experiments), one for TINY and one for KISS.  I've
pretty much isolated the differences between TINY and KISS, which
are these:

   o TINY will support only two data types: The character and the
     16-bit  integer.    I may also  try  to  do  something  with
     strings, since  without  them  a  compiler  would  be pretty
     useless.   KISS will support all  the  usual  simple  types,
     including arrays and even floating point.

   o TINY will only have two control constructs, the  IF  and the
     WHILE.  KISS will  support  a  very  rich set of constructs,
     including one we haven't discussed here before ... the CASE.

   o KISS will support separately compilable modules.

One caveat: Since I still don't know much  about  80x86 assembler
language, all these compiler modules  will  still  be  written to
support 68000 code.  However, for the programs I plan  to upload,
all the code generation  has  been  carefully encapsulated into a
single unit, so that any enterprising student should  be  able to
easily retarget to any other processor.  This task is "left as an
exercise for the  student."    I'll  make an offer right here and
now:  For the person who provides us the first robust retarget to
80x86, I will be happy to discuss shared copyrights and royalties
from the book that's upcoming.

But enough talk.  Let's get on with  the  study  of  types.  As I
said  earlier,  we'll  do  this  one  as  we  did  in   the  last
installment:  by  performing experiments  using  single-character
tokens.


THE SYMBOL TABLE

It should be apparent that, if we're going to deal with variables
of different types, we're going  to need someplace to record what
those  types are.  The obvious vehicle for  that  is  the  symbol
table, and we've already  used  it  that  way to distinguish, for
example,   between  local  and  global  variables,  and   between
variables and procedures.

The  symbol  table   structure  for  single-character  tokens  is
particularly simple, and we've used  it several times before.  To
deal with it, we'll steal some procedures that we've used before.

First, we need to declare the symbol table itself:


{--------------------------------------------------------------}
{ Variable Declarations }

var Look: char;              { Lookahead Character }

    ST: Array['A'..'Z'] of char;   {  *** ADD THIS LINE ***}
{--------------------------------------------------------------}


Next, we need to make sure it's initialized as part  of procedure
Init:


{--------------------------------------------------------------}
{ Initialize }

procedure Init;
var i: char;
begin
   for i := 'A' to 'Z' do
      ST[i] := '?';
   GetChar;
end;
{--------------------------------------------------------------}


We don't really need  the  next procedure, but it will be helpful
for debugging.  All it does is to dump the contents of the symbol
table:


{--------------------------------------------------------------}
{ Dump the Symbol Table }

procedure DumpTable;
var i: char;
begin
   for i := 'A' to 'Z' do
      WriteLn(i, ' ', ST[i]);
end;
{--------------------------------------------------------------}


It really doesn't matter much where you put this procedure  ... I
plan to cluster all the symbol table routines together, so  I put
mine just after the error reporting procedures.

If  you're  the  cautious type (as I am), you might want to begin
with a test program that does nothing but initializes, then dumps
the table.  Just to be sure that we're all on the same wavelength
here, I'm reproducing the entire program below, complete with the
new  procedures.  Note that this  version  includes  support  for
white space:


{--------------------------------------------------------------}
program Types;

{--------------------------------------------------------------}
{ Constant Declarations }

const TAB = ^I;
      CR  = ^M;
      LF  = ^J;

{--------------------------------------------------------------}
{ Variable Declarations }

var Look: char;              { Lookahead Character }

    ST: Array['A'..'Z'] of char;


{--------------------------------------------------------------}
{ Read New Character From Input Stream }

procedure GetChar;
begin
   Read(Look);
end;


{--------------------------------------------------------------}
{ Report an Error }

procedure Error(s: string);
begin
   WriteLn;
   WriteLn(^G, 'Error: ', s, '.');
end;


{--------------------------------------------------------------}
{ Report Error and Halt }

procedure Abort(s: string);
begin
   Error(s);
   Halt;
end;


{--------------------------------------------------------------}
{ Report What Was Expected }

procedure Expected(s: string);
begin
   Abort(s + ' Expected');
end;


{--------------------------------------------------------------}
{ Dump the Symbol Table }

procedure DumpTable;
var i: char;
begin
   for i := 'A' to 'Z' do
        WriteLn(i, ' ', ST[i]);
end;


{--------------------------------------------------------------}
{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;
begin
   IsAlpha := UpCase(c) in ['A'..'Z'];
end;


{--------------------------------------------------------------}
{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;
begin
   IsDigit := c in ['0'..'9'];
end;


{--------------------------------------------------------------}
{ Recognize an AlphaNumeric Character }

function IsAlNum(c: char): boolean;
begin
   IsAlNum := IsAlpha(c) or IsDigit(c);
end;


{--------------------------------------------------------------}
{ Recognize an Addop }

function IsAddop(c: char): boolean;
begin
   IsAddop := c in ['+', '-'];
end;


{--------------------------------------------------------------}
{ Recognize a Mulop }

function IsMulop(c: char): boolean;
begin
   IsMulop := c in ['*', '/'];
end;


{--------------------------------------------------------------}
{ Recognize a Boolean Orop }

function IsOrop(c: char): boolean;
begin
   IsOrop := c in ['|', '~'];
end;


{--------------------------------------------------------------}
{ Recognize a Relop }

function IsRelop(c: char): boolean;
begin
   IsRelop := c in ['=', '#', '<', '>'];
end;


{--------------------------------------------------------------}
{ Recognize White Space }

function IsWhite(c: char): boolean;
begin
   IsWhite := c in [' ', TAB];
end;


{--------------------------------------------------------------}
{ Skip Over Leading White Space }

procedure SkipWhite;
begin
   while IsWhite(Look) do
      GetChar;
end;


{--------------------------------------------------------------}
{ Skip Over an End-of-Line }

procedure Fin;
begin
   if Look = CR then begin
      GetChar;
      if Look = LF then
         GetChar;
   end;
end;


{--------------------------------------------------------------}
{ Match a Specific Input Character }

procedure Match(x: char);
begin
   if Look = x then GetChar
   else Expected('''' + x + '''');
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: char;
begin
   if not IsAlpha(Look) then Expected('Name');
   GetName := UpCase(Look);
   GetChar;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Get a Number }

function GetNum: char;
begin
   if not IsDigit(Look) then Expected('Integer');
   GetNum := Look;
   GetChar;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Output a String with Tab }

procedure Emit(s: string);
begin
   Write(TAB, s);
end;


{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);
begin
   Emit(s);
   WriteLn;
end;


{--------------------------------------------------------------}
{ Initialize }

procedure Init;
var i: char;
begin
   for i := 'A' to 'Z' do
      ST[i] := '?';
   GetChar;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
   DumpTable;
end.
{--------------------------------------------------------------}


OK, run this program.  You  should  get a (very fast) printout of
all the letters of  the  alphabet  (potential  identifiers), each
followed by  a  question  mark.    Not  very exciting, but it's a
start.

Of course, in general we  only  want  to  see  the  types  of the
variables that have been defined.  We can eliminate the others by
modifying DumpTable with an IF test.  Change the loop to read:


  for i := 'A' to 'Z' do
     if ST[i] <> '?' then
         WriteLn(i, ' ', ST[i]);


Now, run the program again.  What did you get?

Well, that's even more  boring  than before!  There was no output
at all, since at this point NONE of the names have been declared.
We  can  spice  things up a  bit  by  inserting  some  statements
declaring some entries in the main program.  Try these:


     ST['A'] := 'a';
     ST['P'] := 'b';
     ST['X'] := 'c';


This time, when  you  run  the  program, you should get an output
showing that the symbol table is working right.


ADDING ENTRIES

Of course, writing to the table directly is pretty poor practice,
and not one that will  help  us  much  later.   What we need is a
procedure to add entries to the table.  At the same time, we know
that  we're going to need to test the table, to make sure that we
aren't redeclaring a variable that's already in use  (easy  to do
with only 26 choices!).  To handle all this, enter  the following
new procedures:


{--------------------------------------------------------------}
{ Report Type of a Variable }


function TypeOf(N: char): char;
begin
   TypeOf := ST[N];
end;


{--------------------------------------------------------------}
{ Report if a Variable is in the Table }


function InTable(N: char): boolean;
begin
   InTable := TypeOf(N) <> '?';
end;


{--------------------------------------------------------------}
{ Check for a Duplicate Variable Name }

procedure CheckDup(N: char);
begin
   if InTable(N) then Abort('Duplicate Name ' + N);
end;


{--------------------------------------------------------------}
{ Add Entry to Table }

procedure AddEntry(N, T: char);
begin
   CheckDup(N);
   ST[N] := T;
end;
{--------------------------------------------------------------}


Now change the three lines in the main program to read:


     AddEntry('A', 'a');
     AddEntry('P', 'b');
     AddEntry('X', 'c');
                             

and run the program again.  Did it work?  Then we have the symbol
table routines needed to support our work on types.  In  the next
section, we'll actually begin to use them.


ALLOCATING STORAGE

In  other programs like this one,  including  the  TINY  compiler
itself, we have  already  addressed the issue of declaring global
variables, and the  code  generated  for  them.    Let's  build a
vestigial version of a "compiler" here, whose only function is to
allow  us   declare  variables.    Remember,  the  syntax  for  a
declaration is:


     <data decl> ::= VAR <identifier>


Again, we can lift a lot of the code from previous programs.  The
following are stripped-down versions of those  procedures.   They
are greatly simplified  since  I  have  eliminated  niceties like
variable lists and  initializers.   In procedure Alloc, note that
the  new call to AddEntry will also  take  care  of  checking for
duplicate declarations:


{--------------------------------------------------------------}
{ Allocate Storage for a Variable }

procedure Alloc(N: char);
begin
   AddEntry(N, 'v');
   WriteLn(N, ':', TAB, 'DC 0');
end;


{--------------------------------------------------------------}
{ Parse and Translate a Data Declaration }

procedure Decl;
var Name: char;
begin
   Match('v');
   Alloc(GetName);
end;


{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }

procedure TopDecls;
begin
   while Look <> '.' do begin
      case Look of
        'v': Decl;
      else Abort('Unrecognized Keyword ' + Look);
      end;
      Fin;
   end;
end;
{--------------------------------------------------------------}


Now, in the  main  program,  add  a  call to TopDecls and run the
program.  Try allocating a  few variables, and note the resulting
code generated.  This is old stuff for you, so the results should
look familiar.  Note from the code for TopDecls that  the program
is ended by a terminating period.

While you're at it,  try  declaring  two  variables with the same
name, and verify that the parser catches the error.


DECLARING TYPES


Allocating storage of different sizes  is  as  easy  as modifying
procedure TopDecls to recognize more than one keyword.  There are
a  number  of  decisions to be made here, in terms  of  what  the
syntax should be, etc., but for now I'm  going  to  duck  all the
issues and simply declare by  executive fiat that our syntax will
be:


     <data decl> ::= <typename>  <identifier>

where:


     <typename> ::= BYTE | WORD | LONG


(By  an amazing coincidence, the first  letters  of  these  names
happen  to  be  the  same  as  the  68000  assembly  code  length
specifications, so this choice saves us a little work.)

We can create the code to take care of  these  declarations  with
only slight modifications.  In the routines below, note that I've
separated  the  code  generation parts of Alloc  from  the  logic
parts.  This  is  in  keeping  with our desire to encapsulate the
machine-dependent part of the compiler.


{--------------------------------------------------------------}
{ Generate Code for Allocation of a Variable }

procedure AllocVar(N, T: char);
begin
   WriteLn(N, ':', TAB, 'DC.', T, ' 0');
end;


{--------------------------------------------------------------}
{ Allocate Storage for a Variable }

procedure Alloc(N, T: char);
begin
   AddEntry(N, T);
   AllocVar(N, T);
end;


{--------------------------------------------------------------}
{ Parse and Translate a Data Declaration }

procedure Decl;
var Typ: char;
begin
   Typ := GetName;
   Alloc(GetName, Typ);
end;


{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }

procedure TopDecls;
begin
   while Look <> '.' do begin
      case Look of
        'b', 'w', 'l': Decl;
      else Abort('Unrecognized Keyword ' + Look);
      end;
      Fin;
   end;
end;
{--------------------------------------------------------------}


Make the changes shown to these procedures, and give the  thing a
try.    Use  the  single  characters  'b',  'w',  and 'l' for the
keywords (they must be lower case,  for  now).  You will see that
in each case, we are allocating the proper storage  size.    Note
from the dumped symbol table that the sizes are also recorded for
later use.  What later use?  Well, that's the subject of the rest
of this installment.


ASSIGNMENTS

Now that we can declare variables of different  sizes,  it stands
to reason that we ought to be able  to  do  something  with them.
For our first trick, let's just try loading them into our working
register, D0.  It makes sense to use the same  idea  we used for
Alloc; that is, make a load procedure that can load more than one
size.    We  also  want  to continue to encapsulate the  machine-
dependent stuff.  The load procedure looks like this:


{---------------------------------------------------------------}
{ Load a Variable to Primary Register }

procedure LoadVar(Name, Typ: char);
begin
   Move(Typ, Name + '(PC)', 'D0');
end;
{---------------------------------------------------------------}


On  the  68000,  at least, it happens that many instructions turn
out to be MOVE's.  It turns out to be useful to create a separate
code generator just for these instructions, and then  call  it as
needed:


{---------------------------------------------------------------}
{ Generate a Move Instruction }

procedure Move(Size: char; Source, Dest: String);
begin
   EmitLn('MOVE.' + Size + ' ' + Source + ',' + Dest);
end;
{---------------------------------------------------------------}


Note that these  two  routines are strictly code generators; they
have no error-checking or other  logic.  To complete the picture,
we need one more layer of software that provides these functions.

First of all, we need to make sure that the  type  we are dealing
with is a  loadable  type.    This  sounds like a job for another
recognizer:


{--------------------------------------------------------------}
{ Recognize a Legal Variable Type }

function IsVarType(c: char): boolean;
begin
   IsVarType := c in ['B', 'W', 'L'];
end;
{--------------------------------------------------------------}


Next, it would be nice to have a routine that will fetch the type
of a variable from the symbol table, while checking  it  to  make
sure it's valid:


{--------------------------------------------------------------}
{ Get a Variable Type from the Symbol Table }

function VarType(Name: char): char;
var Typ: char;
begin
   Typ := TypeOf(Name);
   if not IsVarType(Typ) then Abort('Identifier ' + Name +
                                        ' is not a variable');
   VarType := Typ;
end;
{--------------------------------------------------------------}


Armed with these  tools,  a  procedure  to cause a variable to be
loaded becomes trivial:


{--------------------------------------------------------------}
{ Load a Variable to the Primary Register }

procedure Load(Name: char);
begin
     LoadVar(Name, VarType(Name));
end;
{--------------------------------------------------------------}


(NOTE to the  concerned:  I  know,  I  know, all this is all very
inefficient.  In a production  program,  we  probably  would take
steps to avoid such deep nesting of procedure calls.  Don't worry
about it.  This is an EXERCISE, remember?  It's more important to
get it  right  and  understand  it, than it is to make it get the
wrong  answer,  quickly.   If you get your compiler completed and
find that you're unhappy  with  the speed, feel free to come back
and hack the code to speed it up!)

It would be a good idea to test the program at this point.  Since
we don't have a  procedure  for  dealing  with assignments yet, I
just added the lines:


     Load('A');
     Load('B');
     Load('C');
     Load('X');


to  the main program.  Thus, after  the  declaration  section  is
complete, they will be executed to generate code  for  the loads.
You can play around with  this, and try different combinations of
declarations to see how the errors are handled.

I'm sure you won't be surprised to learn  that  storing variables
is a lot like  loading  them.  The necessary procedures are shown
next:


{---------------------------------------------------------------}
{ Store Primary to Variable }

procedure StoreVar(Name, Typ: char);
begin
   EmitLn('LEA ' + Name + '(PC),A0');
   Move(Typ, 'D0', '(A0)');
end;


{--------------------------------------------------------------}
{ Store a Variable from the Primary Register }

procedure Store(Name: char);
begin
   StoreVar(Name, VarType(Name));
end;
{--------------------------------------------------------------}


You can test this one the same way as the loads.

Now, of course, it's a RATHER  small  step to use these to handle
assignment  statements.  What we'll do is  to  create  a  special
version   of  procedure  Block  that  supports  only   assignment
statements, and also a  special  version  of Expression that only
supports single variables as legal expressions.  Here they are:


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
var Name: char;
begin
   Load(GetName);
end;


{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
var Name: char;
begin
   Name := GetName;
   Match('=');
   Expression;
   Store(Name);
end;


{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure Block;
begin
   while Look <> '.' do begin
      Assignment;
      Fin;
   end;
end;
{--------------------------------------------------------------}


(It's worth noting that, if  anything,  the  new  procedures that
permit us to manipulate types  are, if anything, even simpler and
cleaner than what we've seen before.  This is  mostly  thanks  to
our efforts to encapsulate the code generator procedures.)

There is one small, nagging problem.  Before, we used  the Pascal
terminating period to get us out of procedure TopDecls.   This is
now the wrong  character  ...  it's  used to terminate Block.  In
previous programs, we've used the BEGIN symbol  (abbreviated 'b')
to get us out.  But that is now used as a type symbol.

The solution, while somewhat of a kludge, is easy enough.   We'll
use  an  UPPER CASE 'B' to stand for the BEGIN.   So  change  the
character in the WHILE loop within TopDecls, from '.' to 'B', and
everything will be fine.

Now, we can  complete  the  task  by changing the main program to
read:


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
   TopDecls;
   Match('B');
   Fin;
   Block;
   DumpTable;
end.
{--------------------------------------------------------------}


(Note  that I've had to sprinkle a few calls to Fin around to get
us out of Newline troubles.)

OK, run this program.  Try the input:


     ba        { byte a }   *** DON'T TYPE THE COMMENTS!!! ***
     wb        { word b }
     lc        { long c }
     B         { begin  }
     a=a
     a=b
     a=c
     b=a
     b=b
     b=c
     c=a
     c=b
     c=c
     .


For  each  declaration,  you  should  get  code   generated  that
allocates storage.  For each assignment, you should get code that
loads a variable of the correct size, and stores one, also of the
correct size.

There's only one small  little  problem:    The generated code is
WRONG!

Look at the code for a=c above.  The code is:


     MOVE.L    C(PC),D0
     LEA       A(PC),A0
     MOVE.B    D0,(A0)


This code is correct.  It will cause the lower eight bits of C to
be stored into A, which is a reasonable behavior.  It's about all
we can expect to happen.

But now, look at the opposite case.  For c=a, the  code generated
is:


     MOVE.B A(PC),D0
     LEA  C(PC),A0
     MOVE.L D0,(A0)


This is  NOT  correct.    It will cause the byte variable A to be
stored into the lower eight bits  of  D0.  According to the rules
for the 68000 processor,  the  upper 24 bits are unchanged.  This
means  that when we store the entire 32  bits  into  C,  whatever
garbage  that  was  in those high bits will also get stored.  Not
good.

So what  we  have  run  into here, early on, is the issue of TYPE
CONVERSION, or COERCION.

Before we do anything with  variables of different types, even if
it's just to  copy  them, we have to face up to the issue.  It is
not the most easy part of a compiler.  Most of  the  bugs  I have
seen in production compilers  have  had to do with errors in type
conversion for  some obscure combination of arguments.  As usual,
there is a tradeoff between compiler complexity and the potential
quality of the  generated  code,  and  as usual, we will take the
path that keeps the  compiler  simple.  I think you'll find that,
with this approach, we can keep the potential complexity in check
rather nicely.


THE COWARD'S WAY OUT

Before we get into the details (and potential complexity) of type
conversion,  I'd  like  you to see that there is one super-simple
way to solve the problem: simply promote every variable to a long
integer when we load it!

This takes the addition of only one line to LoadVar,  although if
we  are  not  going to COMPLETELY ignore efficiency, it should be
guarded by an IF test.  Here is the modified version:


{---------------------------------------------------------------}
{ Load a Variable to Primary Register }

procedure LoadVar(Name, Typ: char);
begin
   if Typ <> 'L' then
      EmitLn('CLR.L D0');
   Move(Typ, Name + '(PC)', 'D0');
end;
{---------------------------------------------------------------}


(Note that StoreVar needs no similar change.)

If you run some tests with  this  new version, you will find that
everything  works correctly now, albeit sometimes  inefficiently.
For example, consider the case  a=b  (for  the  same declarations
shown above).  Now the generated code turns out to be:


     CLR.L D0
     MOVE.W B(PC),D0
     LEA  A(PC),A0
     MOVE.B D0,(A0)


In  this  case,  the CLR turns out not to be necessary, since the
result is going into a byte-sized variable.  With a little bit of
work, we can do better.  Still, this is not  bad,  and it typical
of the kinds of inefficiencies  that we've seen before in simple-
minded compilers.

I should point out that, by setting the high bits to zero, we are
in effect treating the numbers as UNSIGNED integers.  If  we want
to treat them as signed ones instead (the more  likely  case)  we
should do a  sign  extension  after  the load, instead of a clear
before it. Just  to  tie  this  part  of the discussion up with a
nice, red ribbon, let's change LoadVar as shown below:


{---------------------------------------------------------------}
{ Load a Variable to Primary Register }

procedure LoadVar(Name, Typ: char);
begin
   if Typ = 'B' then
      EmitLn('CLR.L D0');
   Move(Typ, Name + '(PC)', 'D0');
   if Typ = 'W' then
      EmitLn('EXT.L D0');
end;
{---------------------------------------------------------------}


With this version, a byte is treated as unsigned  (as  in  Pascal
and C), while a word is treated as signed.


A MORE REASONABLE SOLUTION

As we've seen, promoting  every  variable  to  long while it's in
memory solves the problem, but it can hardly be called efficient,
and  probably wouldn't be acceptable even for  those  of  us  who
claim be unconcerned about  efficiency.    It  will mean that all
arithmetic operations will be done to 32-bit accuracy, which will
DOUBLE the run time  for  most operations, and make it even worse
for multiplication  and division.  For those operations, we would
need to call subroutines to do  them,  even if the data were byte
or  word types.  The whole thing is sort of a cop-out, too, since
it ducks all the real issues.

OK, so that solution's no good.  Is there still a relatively easy
way to get data conversion?  Can we still Keep It Simple?

Yes, indeed.   All we have to do is to make the conversion at the
other end ... that is, we convert on the way _OUT_, when the data
is stored, rather than on the way in.

But, remember, the storage part  of the assignment is pretty much
independent of the data load, which is taken care of by procedure
Expression.    In  general  the  expression  may  be  arbitrarily
complex, so how can procedure Assignment know what  type  of data
is left in register D0?

Again,  the  answer  is  simple:    We'll  just  _ASK_  procedure
Expression!  The answer can be returned as a function value.

All of this requires several procedures to be  modified,  but the
mods, like the method, are quite simple.  First of all,  since we
aren't requiring LoadVar to do  all the work of conversion, let's
go back to the simple version:


{---------------------------------------------------------------}
{ Load a Variable to Primary Register }

procedure LoadVar(Name, Typ: char);
begin
   Move(Typ, Name + '(PC)', 'D0');
end;
{--------------------------------------------------------------}


Next, let's add a  new  procedure that will convert from one type
to another:


{---------------------------------------------------------------}
{ Convert a Data Item from One Type to Another }


procedure Convert(Source, Dest: char);
begin
   if Source <> Dest then begin
      if Source  = 'B' then
         EmitLn('AND.W #$FF,D0');
      if Dest = 'L' then
         EmitLn('EXT.L D0');
   end;
end;
{--------------------------------------------------------------}


Next, we need to do  the  logic  required  to  load  and  store a
variable of any type.  Here are the routines for that:


{---------------------------------------------------------------}
{ Load a Variable to the Primary Register }

function Load(Name: char): char;
var Typ : char;
begin
   Typ := VarType(Name);
   LoadVar(Name, Typ);
   Load := Typ;
end;


{--------------------------------------------------------------}
{ Store a Variable from the Primary Register }

procedure Store(Name, T1: char);
var T2: char;
begin
   T2 := VarType(Name);
   Convert(T1, T2);
   StoreVar(Name, T2);
end;
{--------------------------------------------------------------}


Note that Load is a function, which not only emits the code for a
load, but also returns the variable type.  In this way, we always
know what type of data we  are  dealing  with.  When we execute a
Store,  we pass it the current type of the variable in D0.  Since
Store also knows the  type  of  the  destination variable, it can
convert as necessary.

Armed  with all these new routines,  the  implementation  of  our
rudimentary   assignment   statement  is   essentially   trivial.
Procedure Expression now becomes a  function,  which  returns its
type to procedure Assignment:


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

function Expression: char;
begin
   Expression := Load(GetName);
end;


{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
var Name: char;
begin
   Name := GetName;
   Match('=');
   Store(Name, Expression);
end;
{--------------------------------------------------------------}

Again, note how  incredibly  simple these two routines are. We've
encapsulated  all the type logic into Load  and  Store,  and  the
trick of  passing  the  type  around  makes  the rest of the work
extremely easy.    Of  course,  all  of  this is for our special,
trivial case of Expression.  Naturally, for the  general  case it
will have to get more complex.  But  you're  looking  now  at the
FINAL version of procedure Assignment!

All this seems like a very  simple  and clean solution, and it is
indeed.   Compile this program and run the  same  test  cases  as
before.    You will see that all  types  of  data  are  converted
properly, and there are few if any wasted instructions.  Only the
byte-to-long conversion uses two instructions where one would do,
and we could easily modify Convert to handle this case, too.

Although we haven't considered unsigned variables in this case, I
think you can see  that  we could easily fix up procedure Convert
to deal with these types as well.  This is  "left  as an exercise
for the student."


LITERAL ARGUMENTS

Sharp-eyed readers might have noticed, though, that we don't even
have a proper form of a simple factor yet, because we don't allow
for loading literal constants,  only  variables.   Let's fix that
now.

To begin with, we'll need a GetNum function.  We've  seen several
versions of this, some returning  only a single character, some a
string, and some an integer.   The  one needed here will return a
LongInt, so that it can handle anything we  throw  at  it.   Note
that no type information is returned here: GetNum doesn't concern
itself with how the number will be used:


{--------------------------------------------------------------}
{ Get a Number }

function GetNum: LongInt;
var Val: LongInt;
begin
   if not IsDigit(Look) then Expected('Integer');
   Val := 0;
   while IsDigit(Look) do begin
      Val := 10 * Val + Ord(Look) - Ord('0');
      GetChar;
   end;
   GetNum := Val;
   SkipWhite;
end;
{---------------------------------------------------------------}


Now, when dealing with  literal  data,  we  have one little small
problem.   With variables, we know what  type  things  should  be
because they've been declared to be  that  type.  We have no such
type information for  literals.   When the programmer says, "-1,"
does that mean a byte, word, or longword  version?    We  have no
clue.  The obvious thing to do would be to  use  the largest type
possible, i.e. a longword.    But that's a bad idea, because when
we get to more complex expressions, we'll find that it will cause
every expression involving literals  to  be  promoted to long, as
well.

A better approach is to select a type based upon the value of the
literal, as shown next:


{--------------------------------------------------------------}
{ Load a Constant to the Primary Register }

function LoadNum(N: LongInt): char;
var Typ : char;
begin
   if abs(N) <= 127 then
      Typ := 'B'
   else if abs(N) <= 32767 then
      Typ := 'W'
   else Typ := 'L';
   LoadConst(N, Typ);
   LoadNum := Typ;
end;
{---------------------------------------------------------------}


(I know, I know, the number base isn't really symmetric.  You can
store -128 in a single byte,  and  -32768  in a word.  But that's
easily fixed, and not  worth  the time or the added complexity to
fool with it here.  It's the thought that counts.)

Note  that  LoadNum  calls  a  new version of the code  generator
routine  LoadConst, which has an added  argument  to  define  the
type:


{---------------------------------------------------------------}
{ Load a Constant to the Primary Register }

procedure LoadConst(N: LongInt; Typ: char);
var temp:string;
begin
   Str(N, temp);
   Move(Typ, '#' + temp, 'D0');
end;
{--------------------------------------------------------------}


Now  we can modify procedure Expression  to  accomodate  the  two
possible kinds of factors:


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

function Expression: char;
begin
   if IsAlpha(Look) then
      Expression := Load(GetName)
   else
      Expression := LoadNum(GetNum);
end;
{--------------------------------------------------------------}


(Wow, that sure didn't hurt too bad!  Just a  few  extra lines do
the job.)

OK,  compile  this code into your program  and  give  it  a  try.
You'll see that it now works for either variables or constants as
valid expressions.


ADDITIVE EXPRESSIONS

If you've been following this series from the beginning, I'm sure
you  know  what's coming next:  We'll  expand  the  form  for  an
expression   to   handle   first   additive   expressions,   then
multiplicative, then general expressions with parentheses.

The nice part is that we already have a pattern for  dealing with
these more complex expressions.  All we have  to  do  is  to make
sure that  all the procedures called by Expression (Term, Factor,
etc.)  always  return a type identifier.   If  we  do  that,  the
program structure gets changed hardly at all.

The  first  step  is  easy:  We can rename our existing  function
Expression  to  Term,  as  we've  done so many times before,  and
create the new version of Expression:


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

function Expression: char;
var Typ: char;
begin
   if IsAddop(Look) then
      Typ := Unop
   else
      Typ := Term;
   while IsAddop(Look) do begin
      Push(Typ);
      case Look of
       '+': Typ := Add(Typ);
       '-': Typ := Subtract(Typ);
      end;
   end;
   Expression := Typ;
end;
{--------------------------------------------------------------}


Note  in  this  routine how each  procedure  call  has  become  a
function call, and how  the  local  variable  Typ gets updated at
each pass.

Note also the new call to a function  Unop,  which  lets  us deal
with a leading unary minus.  This change is not necessary  ... we
could  still  use  a form more like what we've done before.  I've
chosen  to  introduce  UnOp as a separate routine because it will
make it easier, later, to produce somewhat better code than we've
been  doing.    In other words, I'm looking ahead to optimization
issues.

For  this  version,  though, we'll retain the same dumb old code,
which makes the new routine trivial:


{---------------------------------------------------------------}
{ Process a Term with Leading Unary Operator }

function Unop: char;
begin
   Clear;
   Unop := 'W';
end;
{---------------------------------------------------------------}


Procedure  Push  is  a code-generator routine, and now has a type
argument:


{---------------------------------------------------------------}
{ Push Primary onto Stack }

procedure Push(Size: char);
begin
   Move(Size, 'D0', '-(SP)');
end;
{---------------------------------------------------------------}


Now, let's take a look at functions Add  and  Subtract.    In the
older versions of these routines, we let them call code generator
routines PopAdd and PopSub.    We'll  continue  to do that, which
makes the functions themselves extremely simple:


{---------------------------------------------------------------}
{ Recognize and Translate an Add }

function Add(T1: char): char;
begin
   Match('+');
   Add := PopAdd(T1, Term);
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }

function Subtract(T1: char): char;
begin
   Match('-');
   Subtract := PopSub(T1, Term);
end;
{---------------------------------------------------------------}


The simplicity is  deceptive,  though, because what we've done is
to defer all the logic to PopAdd and PopSub, which are  no longer
just code generation routines.    They must also now take care of
the type conversions required.

And just what conversion is that?  Simple: Both arguments must be
of the same size, and the result  is  also  of  that  size.   The
smaller of the two arguments must be "promoted" to  the  size  of
the larger one.

But  this  presents a bit of a problem.  If the  argument  to  be
promoted is the second argument  (i.e.  in  the  primary register
D0), we  are  in  great  shape.  If it's not, however, we're in a
fix: we can't change the size of the  information  that's already
been pushed onto the stack.

The solution is simple but a little painful: We must abandon that
lovely  "pop  the  data and do something  with  it"  instructions
thoughtfully provided by Motorola.

The alternative is to assign  a  secondary  register,  which I've
chosen to be R7.  (Why not R1?  Because I  have  later  plans for
the other registers.)

The  first  step in this new structure  is  to  introduce  a  Pop
procedure analogous to the Push.   This procedure will always Pop
the top element of the stack into D7:


{---------------------------------------------------------------}
{ Pop Stack into Secondary Register }

procedure Pop(Size: char);
begin
   Move(Size, '(SP)+', 'D7');
end;
{---------------------------------------------------------------}


The general idea is that all the "Pop-Op" routines can  call this
one.    When  this is done, we will then have  both  operands  in
registers, so we can promote whichever  one  we need to.  To deal
with this, procedure Convert needs another argument, the register
name:


{---------------------------------------------------------------}
{ Convert a Data Item from One Type to Another }

procedure Convert(Source, Dest: char; Reg: String);
begin
   if Source <> Dest then begin
      if Source  = 'B' then
         EmitLn('AND.W #$FF,' + Reg);
      if Dest = 'L' then
         EmitLn('EXT.L ' + Reg);
   end;
end;
{---------------------------------------------------------------}


The next function does a conversion, but only if the current type
T1  is  smaller  in size than the desired  type  T2.    It  is  a
function, returning the final type to let us know what it decided
to do:


{---------------------------------------------------------------}
{ Promote the Size of a Register Value }

function Promote(T1, T2: char; Reg: string): char;
var Typ: char;
begin
   Typ := T1;
   if T1 <> T2 then
      if (T1 = 'B') or ((T1 = 'W') and (T2 = 'L')) then begin
         Convert(T1, T2, Reg);
         Typ := T2;
      end;
   Promote := Typ;
end;
{---------------------------------------------------------------}


Finally, the following function forces the two registers to be of
the same type:


{---------------------------------------------------------------}
{ Force both Arguments to Same Type }

function SameType(T1, T2: char): char;
begin
   T1 := Promote(T1, T2, 'D7');
   SameType := Promote(T2, T1, 'D0');
end;
{---------------------------------------------------------------}


These new routines give us the ammunition we need  to  flesh  out
PopAdd and PopSub:


{---------------------------------------------------------------}
{ Generate Code to Add Primary to the Stack }

function PopAdd(T1, T2: char): char;
begin
   Pop(T1);
   T2 := SameType(T1, T2);
   GenAdd(T2);
   PopAdd := T2;
end;


{---------------------------------------------------------------}
{ Generate Code to Subtract Primary from the Stack }

function PopSub(T1, T2: char): char;
begin
   Pop(T1);
   T2 := SameType(T1, T2);
   GenSub(T2);
   PopSub := T2;
end;
{---------------------------------------------------------------}


After  all   the   buildup,   the   final   results   are  almost
anticlimactic.  Once  again,  you can see that the logic is quite
simple.  All the two routines do is to pop the  top-of-stack into
D7, force the two operands to be the same size, and then generate
the code.

Note  the  new  code generator routines GenAdd and GenSub.  These
are vestigial forms of the ORIGINAL PopAdd and PopSub.   That is,
they  are pure code generators, producing a  register-to-register
add or subtract:


{---------------------------------------------------------------}
{ Add Top of Stack to Primary }

procedure GenAdd(Size: char);
begin
   EmitLn('ADD.' + Size + ' D7,D0');
end;


{---------------------------------------------------------------}
{ Subtract Primary from Top of Stack }

procedure GenSub(Size: char);
begin
   EmitLn('SUB.' + Size + ' D7,D0');
   EmitLn('NEG.' + Size + ' D0');
end;
{---------------------------------------------------------------}


OK,  I grant you:  I've thrown a lot of routines at you since  we
last tested the code.   But  you  have  to  admit  that  each new
routine is pretty simple and transparent.  If you (like me) don't
like to test so many new  routines  at  once, that's OK.  You can
stub out routines like Convert, Promote, and SameType, since they
don't  read  any inputs.  You won't  get  the  correct  code,  of
course, but things should work.  Then flesh  them  out  one  at a
time.

When testing the program,  don't  forget  that  you first have to
declare some variables, and then  start the "body" of the program
with an upper-case  'B'  (for  BEGIN).   You should find that the
parser  will  handle  any  additive  expressions.  Once  all  the
conversion routines are in, you should see that the  correct code
is  generated,  with  type  conversions inserted where necessary.
Try mixing up variables  of  different  sizes, and also literals.
Make sure that everything's working properly.  As  usual,  it's a
good  idea  to  try  some  erroneous expressions and see how  the
compiler handles them.


WHY SO MANY PROCEDURES?

At this point, you may think  I've  pretty much gone off the deep
end in terms of deeply nested procedures.  There is  admittedly a
lot of overhead here.  But there's a method in my madness.  As in
the case of UnOp, I'm looking ahead to the time when  we're going
to want better code  generation.   The way the code is organized,
we can achieve  this  without major modifications to the program.
For example, in cases where the value pushed onto the  stack does
_NOT_ have to be converted, it's still better to use the "pop and
add"  instruction.    If we choose to test for such cases, we can
embed the extra tests into  PopAdd  and  PopSub  without changing
anything else much.


MULTIPLICATIVE EXPRESSIONS

The procedure for dealing with multiplicative  operators  is much
the  same.    In  fact,  at  the  first  level,  they are  almost
identical, so I'll just show them here without much fanfare.  The
first  one  is  our  general  form  for  Factor,  which  includes
parenthetical subexpressions:


{---------------------------------------------------------------}
{ Parse and Translate a Factor }

function Expression: char; Forward;

function Factor: char;
begin
   if Look = '(' then begin
      Match('(');
      Factor := Expression;
      Match(')');
      end
   else if IsAlpha(Look) then
      Factor := Load(GetName)
   else
      Factor := LoadNum(GetNum);
end;


{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }

Function Multiply(T1: char): char;
begin
   Match('*');
   Multiply := PopMul(T1, Factor);
end;


{--------------------------------------------------------------}
{ Recognize and Translate a Divide }

function Divide(T1: char): char;
begin
   Match('/');
   DIvide := PopDiv(T1, Factor);
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Term }

function Term: char;
var Typ: char;
begin
   Typ := Factor;
   while IsMulop(Look) do begin
      Push(Typ);
      case Look of
       '*': Typ := Multiply(Typ);
       '/': Typ := Divide(Typ);
      end;
   end;
   Term := Typ;
end;
{---------------------------------------------------------------}


These routines parallel the additive  ones  almost  exactly.   As
before, the complexity is encapsulated within PopMul  and PopDiv.
If  you'd  like  to test the program before we get into that, you
can build dummy versions of them, similar to  PopAdd  and PopSub.
Again, the code won't be correct at this point,  but  the  parser
should handle expressions of arbitrary complexity.


MULTIPLICATION

Once you've  convinced yourself that the parser itself is working
properly, we need to figure out what it will take to generate the
right code.  This is where  things  begin to get a little sticky,
because the rules are more complex.

Let's take the case of multiplication first.   This  operation is
similar to the "addops" in that both operands should  be  of  the
same size.  It differs in two important respects:


  o  The type of the product is typically not the same as that of
     the  two  operands.   For the product of two words, we get a
     longword result.

  o  The 68000 does  not support a 32 x 32 multiply, so a call to
     a software routine is needed.  This routine will become part
     of the run-time library.

  o  It also does  not  support  an  8  x 8 multiply, so all byte
     operands must be promoted to words.


The actions that we have to take are best shown in  the following
table:

  T1 -->  |                 |                 |                 |
          |                 |                 |                 |
      |   |        B        |        W        |       L         |
  T2  V   |                 |                 |                 |
-----------------------------------------------------------------
          |                 |                 |                 |
                             






     B    | Convert D0 to W | Convert D0 to W | Convert D0 to L |
          | Convert D7 to W |                 |                 |
          | MULS            | MULS            | JSR MUL32       |
          | Result = W      | Result = L      | Result = L      |
          |                 |                 |                 |
-----------------------------------------------------------------
          |                 |                 |                 |
     W    | Convert D7 to W |                 | Convert D0 to L |
          | MULS            | MULS            | JSR MUL32       |
          | Result = L      | Result = L      | Result = L      |
          |                 |                 |                 |
-----------------------------------------------------------------
          |                 |                 |                 |
     L    | Convert D7 to L | Convert D7 to L |                 |
          | JSR MUL32       | JSR MUL32       | JSR MUL32       |
          | Result = L      | Result = L      | Result = L      |
          |                 |                 |                 |
-----------------------------------------------------------------

This table shows the actions to be taken for each  combination of
operand types.  There are three things to note: First,  we assume
a library routine  MUL32  which  performs  a  32  x  32 multiply,
leaving a >> 32-bit << (not 64-bit) product.    If  there  is any
overflow in the process,  we  choose to ignore it and return only
the lower 32 bits.

Second, note that the  table  is  symmetric  ... the two operands
enter in the same way.  Finally, note that the product  is ALWAYS
a longword, except when  both  operands  are  bytes.  (It's worth
noting, in passing, that  this  means  that many expressions will
end up being longwords, whether we  like  it or not.  Perhaps the
idea  of  just  promoting  them  all  up  front wasn't  all  that
outrageous, after all!)

Now, clearly, we are going to have to generate different code for
the 16-bit and 32-bit multiplies.  This is best  done  by  having
separate code generator routines for the two cases:


{---------------------------------------------------------------}
{ Multiply Top of Stack by Primary (Word) }

procedure GenMult;
begin
   EmitLn('MULS D7,D0')
end;


{---------------------------------------------------------------}
{ Multiply Top of Stack by Primary (Long) }

procedure GenLongMult;
begin
   EmitLn('JSR MUL32');
end;
{---------------------------------------------------------------}


An examination of the code below for PopMul  should  convince you
that the conditions in the table are met:


{---------------------------------------------------------------}
{ Generate Code to Multiply Primary by Stack }

function PopMul(T1, T2: char): char;
var T: char;
begin
   Pop(T1);
   T := SameType(T1, T2);
   Convert(T, 'W', 'D7');
   Convert(T, 'W', 'D0');
   if T = 'L' then
      GenLongMult
   else
      GenMult;
   if T = 'B' then
      PopMul := 'W'
   else
      PopMul:= 'L';
end;
{---------------------------------------------------------------}


As you can see, the routine starts off just like PopAdd.  The two
arguments are forced to the same type.  The two calls  to Convert
take  care  of  the case where both operands are bytes.  The data
themselves are promoted  to  words, but the routine remembers the
type so as to assign the correct type to the result.  Finally, we
call one of the two code generator routines, and then  assign the
result type.  Not too complicated, really.

At this point, I suggest that you go ahead and test  the program.
Try all combinations of operand sizes.


DIVISION

The case of division is not nearly so  symmetric.    I  also have
some bad news for you:

All  modern  16-bit   CPU's   support   integer   divide.     The
manufacturer's data  sheet  will  describe  this  operation  as a
32 x 16-bit divide, meaning that you can divide a 32-bit dividend
by a 16-bit divisor.  Here's the bad news:


                     THEY'RE LYING TO YOU!!!


If you don't believe  it,  try  dividing  any large 32-bit number
(meaning that it has non-zero bits  in  the upper 16 bits) by the
integer 1.  You are guaranteed to get an overflow exception.

The  problem is that the instruction  really  requires  that  the
resulting quotient fit into a 16-bit result.   This  won't happen
UNLESS the divisor is  sufficiently  large.    When any number is
divided by unity, the quotient will of course be the same  as the
dividend, which had better fit into a 16-bit word.

Since  the  beginning  of  time  (well,  computers,  anyway), CPU
architects have  provided  this  little  gotcha  in  the division
circuitry.  It provides a certain amount of  symmetry  in things,
since it is sort of the inverse of the way a multiply works.  But
since  unity  is  a perfectly valid (and rather common) number to
use as a divisor, the division as implemented  in  hardware needs
some help from us programmers.

The implications are as follows:

  o  The type of the quotient must always be the same as  that of
     the dividend.  It is independent of the divisor.

  o  In spite of  the  fact  that  the  CPU  supports  a longword
     dividend,  the hardware-provided  instruction  can  only  be
     trusted  for  byte  and  word  dividends.      For  longword
     dividends, we need another library routine that can return a
     long result.



This  looks  like  a job for  another  table,  to  summarize  the
required actions:

  T1 -->  |                 |                 |                 |
          |                 |                 |                 |
      |   |        B        |        W        |       L         |
  T2  V   |                 |                 |                 |
-----------------------------------------------------------------
          |                 |                 |                 |
     B    | Convert D0 to W | Convert D0 to W | Convert D0 to L |
          | Convert D7 to L | Convert D7 to L |                 |
          | DIVS            | DIVS            | JSR DIV32       |
          | Result = B      | Result = W      | Result = L      |
          |                 |                 |                 |
-----------------------------------------------------------------
          |                 |                 |                 |
     W    | Convert D7 to L | Convert D7 to L | Convert D0 to L |
          | DIVS            | DIVS            | JSR DIV32       |
          | Result = B      | Result = W      | Result = L      |
          |                 |                 |                 |
-----------------------------------------------------------------
          |                 |                 |                 |
     L    | Convert D7 to L | Convert D7 to L |                 |
          | JSR DIV32       | JSR DIV32       | JSR DIV32       |
          | Result = B      | Result = W      | Result = L      |
          |                 |                 |                 |
-----------------------------------------------------------------


(You may wonder why it's necessary to do a 32-bit  division, when
the  dividend is, say, only a byte in the first place.  Since the
number  of bits in the result can only be as many as that in  the
dividend,  why  bother?   The reason is that, if the divisor is a
longword,  and  there  are any high bits set in it, the result of
the division must  be zero.  We might not get that if we only use
the lower word of the divisor.)

The following code provides the correct function for PopDiv:


{---------------------------------------------------------------}
{ Generate Code to Divide Stack by the Primary }

function PopDiv(T1, T2: char): char;
begin
   Pop(T1);
   Convert(T1, 'L', 'D7');
   if (T1 = 'L') or (T2 = 'L') then begin
      Convert(T2, 'L', 'D0');
      GenLongDiv;
      PopDiv := 'L';
      end
   else begin
      Convert(T2, 'W', 'D0');
      GenDiv;
      PopDiv := T1;
   end;
end;
{---------------------------------------------------------------}


The two code generation procedures are:


{---------------------------------------------------------------}
{ Divide Top of Stack by Primary  (Word) }

procedure GenDiv;
begin
   EmitLn('DIVS D0,D7');
   Move('W', 'D7', 'D0');
end;


{---------------------------------------------------------------}
{ Divide Top of Stack by Primary (Long) }

procedure GenLongDiv;
begin
   EmitLn('JSR DIV32');
end;
{---------------------------------------------------------------}


Note  that  we  assume that DIV32 leaves the (longword) result in
D0.

OK, install the new  procedures  for division.  At this point you
should be able  to  generate  code  for  any  kind  of arithmetic
expression.  Give it a whirl!


BEGINNING TO WIND DOWN

At  last, in this installment, we've learned  how  to  deal  with
variables (and literals) of different types.  As you can  see, it
hasn't been too tough.  In  fact,  in  some ways most of the code
looks even more simple than it does in earlier  programs.    Only
the  multiplication  and  division  operators  require  a  little
thinking and planning.

The main concept that  made  things  easy  was that of converting
procedures such as Expression into functions that return the type
of the result.  Once this  was  done,  we were able to retain the
same general structure of the compiler.

I won't pretend that  we've  covered  every  single aspect of the
issue.  I conveniently  ignored  unsigned  arithmetic.  From what
we've  done, I think you can see that to include them adds no new
challenges, just extra possibilities to test for.

I've also ignored the  logical  operators And, Or, etc.  It turns
out  that  these are pretty easy to  handle.    All  the  logical
operators are  bitwise  operations,  so  they  are  symmetric and
therefore work  in  the  same  fashion  as  PopAdd.  There is one
difference,  however:    if  it  is necessary to extend the  word
length for a logical variable, the extension should be done as an
UNSIGNED  number.      Floating   point   numbers,   again,   are
straightforward  to  handle  ... just a few more procedures to be
added to the run-time library, or perhaps instructions for a math
chip.

Perhaps more importantly, I have also skirted the  issue  of type
CHECKING,  as  opposed  to  conversion.   In other  words,  we've
allowed for operations between variables of  all  combinations of
types.  In general this will not be true ... certainly  you don't
want to add an integer, for example, to a string.  Most languages
also don't allow you to mix up character and integer variables.

Again, there are  really  no  new  issues to be addressed in this
case.  We are already checking the types of the two  operands ...
much  of this checking gets done  in  procedures  like  SameType.
It's  pretty  straightforward  to  include  a  call  to an  error
handler, if the types of the two operands are incompatible.

In the general  case,  we  can  think of every single operator as
being handled by  a  different procedure, depending upon the type
of the two operands.  This is straightforward, though tedious, to
implement simply by implementing  a  jump  table with the operand
types  as indices.  In Pascal,  the  equivalent  operation  would
involve nested Case statements.    Some  of the called procedures
could then be simple  error  routines,  while others could effect
whatever kind of conversion we need.  As more  types  are  added,
the number of procedures goes up by a square-law rule, but that's
still not an unreasonably large number of procedures.

What  we've  done  here is to collapse such a jump table into far
fewer  procedures, simply by making use  of  symmetry  and  other
simplifying rules.


TO COERCE OR NOT TO COERCE

In case you haven't gotten this message yet, it sure appears that
TINY and KISS will  probably  _NOT_  be strongly typed languages,
since I've allowed for  automatic  mixing  and conversion of just
about any type.  Which brings up the next issue:

                Is this really what we want to do?

The answer depends on what kind of language you want, and the way
you'd like it to behave.  What we have not addressed is the issue
of when to allow and when to deny the use of operations involving
different  data  types.   In other  words,  what  should  be  the
SEMANTICS of our compiler?   Do we want automatic type conversion
for all cases, for some cases, or not at all?

Let's pause here to think about this a bit more.   To  do  so, it
will help to look at a bit of history.

FORTRAN  II supported only two simple  data  types:  Integer  and
Real.    It  allowed implicit type conversion  between  real  and
integer types during assignment, but not within expressions.  All
data items (including literal constants) on  the  right-hand side
of an assignment statement had to be of the same type.  That made
things pretty easy  ...  much  simpler  than what we've had to do
here.

This  was  changed  in  FORTRAN   IV   to   support  "mixed-mode"
arithmetic.  If an expression had any real data items in it, they
were all converted to reals and the expression  itself  was real.
To round out  the  picture, functions were provided to explicitly
convert  from  one  type to the other, so that you could force an
expression to end up as either type.

This  led to two things:  code that was easier to write, and code
that was less efficient.  That's because sloppy programmers would
write expressions with simple  constants  like  0  and 1 in them,
which  the  compiler  would  dutifully  compile  to   convert  at
execution  time.  Still, the system  worked  pretty  well,  which
would  tend  to  indicate that implicit type conversion is a Good
Thing.

C is also a weakly typed language, though it  supports  a  larger
number  of types.  C won't complain if you try to add a character
to an integer,  for  example.    Partly,  this is helped by the C
convention of promoting every char  to integer when it is loaded,
or  passed  through  a  parameter  list.    This  simplifies  the
conversions quite a  bit.    In  fact, in subset C compilers that
don't support long or float types,  we  end up back where we were
in our earlier,  simple-minded  first try: every variable has the
same representation, once loaded into  a  register.    Makes life
pretty easy!

The  ultimate  language  in  the  direction  of   automatic  type
conversion is PL/I.   This  language  supports  a large number of
data types, and you can mix them all  freely.    If  the implicit
conversions of FORTRAN seemed good,  then  those  of  PL/I should
have been Heaven, but it turned  out  to  be more like Hell!  The
problem was that with so many data types, there had to be a large
number  of  different conversions, AND  a  correspondingly  large
number of rules about how  mixed  operands  should  be converted.
These rules became so  complex  that  no  one could remember what
they  were!  A lot of the errors in PL/I programs had to do  with
unexpected and unwanted type  conversions.    Too  much of a Good
Thing can be bad for you!

Pascal,  on  the  other hand, is a  language  which  is "strongly
typed," which means that in general you can't mix types,  even if
they differ only in _NAME_, and yet have the same base type!
Niklaus Wirth made Pascal strongly typed to help keep programmers
out of trouble, and  the  restrictions  have  indeed saved many a
programmer from himself, because the compiler kept him from doing
something dumb.  Better  to  find  the  bug in compilation rather
than  the  debug  phase.    The same restrictions can also  cause
frustration when you really  WANT  to mix types, and they tend to
drive an ex-C-programmer up the wall.

Even so, Pascal does permit some implicit conversions.    You can
assign  an integer to a real value.  You can also mix integer and
real types in  expressions  of  type  Real.  The integers will be
automatically coerced to real, just as in FORTRAN  (and  with the
same hidden cost in run-time overhead).

You can't, however, convert the  other way, from real to integer,
without applying an explicit  conversion  function,  Trunc.   The
theory here is that,  since  the numerical value of a real number
is  necessarily  going  to  be  changed  by  the conversion  (the
fractional  part will be lost), you really  shouldn't  do  it  in
"secret."

In the spirit of strong typing, Pascal will not allow you  to mix
Char  and  Integer   variables,  without  applying  the  explicit
coercion functions Chr and Ord.

Turbo Pascal also includes the  types  Byte,  Word,  and LongInt.
The first two are basically the same as unsigned  integers.    In
Turbo,  these can be freely intermixed  with  variables  of  type
Integer,  and  Turbo will automatically  handle  the  conversion.
There are run-time  checks,  though, to keep you from overflowing
or otherwise getting the wrong  answer. Note that you still can't
mix Byte and Char types, even though they  are  stored internally
in the same representation.

The ultimate in a  strongly-typed  language  is Ada, which allows
_NO_  implicit  type  conversions at all, and also will not allow
mixed-mode  arithmetic.    Jean   Ichbiah's   position   is  that
conversions cost  execution time, and you shouldn't be allowed to
build in such cost in a hidden manner.  By forcing the programmer
to  explicitly  request  a  type  conversion,  you  make it  more
apparent that there could be a cost involved.

I have been using another strongly-typed  language,  a delightful
little  language  called  Whimsical,  by  John  Spray.   Although
Whimsical is  intended as a systems programming language, it also
requires explicit conversion EVERY time.    There  are  NEVER any
automatic conversions, even the ones supported by Pascal.

This approach does  have  certain advantages:  The compiler never
has to guess what to do: the programmer always tells it precisely
what  he  wants.  As a result, there tends to be  a  more  nearly
one-to-one correspondence between  source code and compiled code,
and John's compiler produces VERY tight code.

On the other hand, I sometimes find the  explicit  conversions to
be a pain.  If I want, for example, to add one to a character, or
AND it with a mask, there are a lot of conversions to make.  If I
get  it  wrong,  the  only   error  message  is  "Types  are  not
compatible."  As it happens, John's particular  implementation of
the language in his compiler doesn't tell you exactly WHICH types
are not compatible ... it only tells you which LINE the  error is
in.

I must admit that most of my errors with this compiler tend to be
errors of this type, and  I've  spent  a  lot  of  time  with the
Whimsical compiler, trying to figure out just WHERE  in  the line
I've offended it.   The only real way to fix the error is to keep
trying things until something works.

So what should we do in TINY and KISS?  For the first one, I have
the answer:  TINY  will  support only the types Char and Integer,
and  we'll  use  the  C  trick  of  promoting Chars  to  Integers
internally.  That means  that  the  TINY  compiler will be _MUCH_
simpler  than  what  we've  already  done.    Type conversion  in
expressions is sort of moot, since none will be required!   Since
longwords will not be supported, we also won't need the MUL32 and
DIV32 run-time routines, nor the logic to figure out when to call
them.  I _LIKE_ it!

KISS, on the other hand, will support the type Long.

Should it support both signed and unsigned arithmetic?    For the
sake of simplicity I'd rather not.    It  does add quite a bit to
the  complexity  of  type conversions.  Even  Niklaus  Wirth  has
eliminated  unsigned  (Cardinal) numbers from  his  new  language
Oberon, with the argument that  32-bit  integers  should  be long
enough for anybody, in either case.

But KISS is supposed to  be a systems programming language, which
means that we should  be  able to do whatever operations that can
be done in assembler.    Since the 68000 supports both flavors of
integers, I guess KISS  should,  also.    We've seen that logical
operations  need to be able to extend  integers  in  an  unsigned
fashion, so the unsigned conversion  procedures  are  required in
any case.


CONCLUSION

That wraps up our session on type conversions.  Sorry you  had to
wait  so  long for it, but hope you feel that it  was  worth  the
wait.

In  the  next  few installments, we'll extend the simple types to
include arrays and pointers, and we'll have a look at what  to do
about  strings.    That should pretty well wrap up the mainstream
part of the series.  After  that,  I'll give you the new versions
of the TINY and KISS compilers,  and  then we'll start to look at
optimization issues.

See you then.




































                     LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                           5 March 1994


Part 15: BACK TO THE FUTURE






INTRODUCTION

Can it really have been four years since I wrote installment 
fourteen of this series?  Is it really possible that six long 
years have passed since I began it?  Funny how time flies when 
you're having fun, isn't it?  

I won't spend a lot of time making excuses; only point out that 
things happen, and priorities change.  In the four years since 
installment fourteen, I've managed to get laid off, get divorced, 
have a nervous breakdown, begin a new career as a writer, begin 
another one as a consultant, move, work on two real-time systems, 
and raise fourteen baby birds, three pigeons, six possums, and a 
duck.  For awhile there, the parsing of source code was not high 
on my list of priorities.  Neither was writing stuff for free, 
instead of writing stuff for pay.  But I do try to be faithful, 
and I do recognize and feel my responsibility to you, the reader, 
to finish what I've started.  As the tortoise said in one of my 
son's old stories, I may be slow, but I'm sure.  I'm  sure that 
there are people out there anxious to see the last reel of this 
film, and I intend to give it to them.  So, if you're one of those 
who's been waiting, more or less patiently, to see how this thing 
comes out, thanks for your patience.  I apologize for the delay.  
Let's move on.


NEW STARTS, OLD DIRECTIONS

Like many other things, programming languages and programming 
styles change with time.  In 1994, it seems a little anachronistic 
to be programming in Turbo Pascal, when the rest of the world 
seems  to have gone bananas over C++.  It also seems a little 
strange to be programming in a classical style when the rest of 
the world has switched to object-oriented methods.  Still, in 
spite of the four-year hiatus, it would be entirely too wrenching 
a change, at this point, to switch to, say, C++ with object-
orientation .  Anyway, Pascal is still not only a powerful 
programming language (more than ever, in fact), but it's a 
wonderful medium for teaching.  C is a notoriously difficult 
language to read ... it's often been accused, along with Forth, of 
being a "write-only language."  When I program in C++, I find 
myself spending at least 50% of my time struggling with language 
syntax rather than with concepts.  A stray "&" or "*" can not only 
change the functioning of the program, but its correctness as 
well.  By contrast, Pascal code is usually quite transparent and 
easy to read, even if you don't know the language. What you see is 
almost always what you get, and we can concentrate on concepts 
rather than implementation details.  I've said from the beginning 
that the purpose of this tutorial series was not to generate the 
world's fastest compiler, but to teach the fundamentals of 
compiler technology, while spending the least amount of time 
wrestling with language syntax or other aspects of software 
implementation. Finally, since a lot of what we do in this course 
amounts to software experimentation, it's important to have a 
compiler and associated environment that compiles quickly and with 
no fuss.  In my opinion, by far the most significant time measure 
in software development is the speed of the edit/compile/test 
cycle.  In this department, Turbo Pascal is king.  The compilation 
speed is blazing fast, and continues to get faster in every 
release (how do they keep doing that?).  Despite vast improvements 
in C compilation speed over the years, even Borland's fastest 
C/C++ compiler is still no match for Turbo Pascal.  Further, the 
editor built into their IDE, the make facility, and even their 
superb smart linker, all complement each other to produce a 
wonderful environment for quick turnaround.  For all of these 
reasons, I intend to stick with Pascal for the duration of this 
series. We'll be using Turbo Pascal for Windows, one of the 
compilers provided Borland Pascal with Objects, version 7.0.  If 
you don't have this compiler, don't worry ... nothing we do here 
is going to count on your having the latest version. Using the 
Windows version helps me a lot, by allowing me to use the 
Clipboard to copy code from the compiler's editor into these 
documents.  It should also help you at least as much, copying the 
code in the other direction.  

I've thought long and hard about whether or not to introduce 
objects to our discussion.  I'm a big advocate of object-oriented 
methods for all uses, and such methods definitely have their place 
in compiler technology.  In fact, I've written papers on just this 
subject (Refs. 1-3).  But the architecture of a compiler which is 
based on object-oriented approaches is vastly different than that 
of the more classical compiler we've been building.  Again, it 
would seem to be entirely too much to change these horses in mid-
stream.  As I said, programming styles change.  Who knows, it may 
be another six years before we finish this thing, and if we keep 
changing the code every time programming style changes, we may 
NEVER finish.

So for now, at least, I've determined to continue the classical 
style in Pascal, though we might indeed discuss objects and object 
orientation as we go.  Likewise, the target machine will remain 
the Motorola 68000 family.  Of all the decisions to be made here, 
this one has been the easiest.  Though I know that many of you 
would like to see code for the 80x86, the 68000 has become, if 
anything, even more popular as a platform for embedded systems, 
and it's to that application that this whole effort began in the 
first place.  Compiling for the PC, MSDOS platform, we'd have to 
deal with all the issues of DOS system calls, DOS linker formats, 
the PC file system and hardware, and all those other complications 
of a DOS environment.  An embedded system, on the other hand, must 
run standalone, and it's for this kind of application, as an 
alternative to assembly language, that I've always imagined that a 
language like KISS would thrive. Anyway, who wants to deal with 
the 80x86 architecture if they don't have to?

The one feature of Turbo Pascal that I'm going to be making heavy 
use of is units.  In the past, we've had to make compromises 
between code size and complexity, and program functionality.  A 
lot of our work has been in the nature of computer 
experimentation, looking at only one aspect of compiler technology 
at a time. We did this to avoid to avoid having to carry around 
large programs, just to investigate simple concepts.  In the 
process, we've re-invented the wheel and re-programmed the same 
functions more times than I'd like to count.  Turbo units provide 
a wonderful way to get functionality and simplicity at the same 
time:  You write reusable code, and invoke it with a single line.  
Your test program stays small, but it can do powerful things.

One feature of Turbo Pascal units is their initialization block.  
As with an Ada package, any code in the main begin-end block of a 
unit gets executed as the program is initialized.  As you'll see 
later, this sometimes gives us neat simplifications in the code.  
Our procedure Init, which has been with us since Installment 1, 
goes away entirely when we use units.  The various routines in the 
Cradle, another key features of our approach, will get distributed 
among the units.

The concept of units, of course, is no different than that of C 
modules.  However, in C (and C++), the interface between modules 
comes via preprocessor include statements and header files.  As 
someone who's had to read a lot of other people's C programs, I've 
always found this rather bewildering.  It always seems that 
whatever data structure you'd like to know about is in some other 
file.  Turbo units are simpler for the very reason that they're 
criticized by some:  The function interfaces and their 
implementation are included in the same file.  While this 
organization may create problems with code security, it also 
reduces the number of files by half, which isn't half bad.  
Linking of the object files is also easy, because the Turbo 
compiler takes care of it without the need for make files or other 
mechanisms.


STARTING OVER?

Four years ago, in Installment 14, I promised you that our days of 
re-inventing the wheel, and recoding the same software over and 
over for each lesson, were over, and that from now on we'd stick 
to more complete programs that we would simply add new features 
to.  I still intend to keep that promise; that's one of the main 
purposes for using units.  However, because of the long time since 
Installment 14, it's natural to want to at least do some review, 
and anyhow, we're going to have to make rather sweeping changes in 
the code to make the transition to units.  Besides, frankly, after 
all this time I can't remember all the neat ideas I had in my head 
four years ago.  The best way for me to recall them is to retrace 
some of the steps we took to arrive at Installment 14.  So I hope 
you'll be understanding and bear with me as we go back to our 
roots, in a sense, and rebuild the core of the software, 
distributing the routines among the various units, and 
bootstrapping ourselves back up to the point we were at lo, those 
many moons ago. As has always been the case, you're going to get  
to see me make all the mistakes and execute changes of direction, 
in real time.  Please bear with me ... we'll start getting to the 
new stuff before you know it.

Since we're going to be using multiple modules in our new 
approach, we have to address the issue of file management.  If 
you've followed all the other sections of this tutorial, you know 
that, as our programs evolve, we're going to be replacing older, 
more simple-minded units with more capable ones. This brings us to 
an issue of version control. There will almost certainly be times 
when we will overlay a simple file (unit), but later wish we had 
the simple one again.  A case in point is embodied in our 
predilection for using single-character variable names, keywords, 
etc., to test concepts without getting bogged down in the details 
of a lexical scanner.  Thanks to the use of units, we will be 
doing much less of this in the future.  Still, I not only suspect, 
but am certain that we will need to save some older versions of 
files, for special purposes, even though they've been replaced by 
newer, more capable ones.

To deal with this problem, I suggest that you create different 
directories, with different versions of the units as needed.  If 
we do this properly, the code in each directory will remain self-
consistent.  I've tentatively created four directories:  SINGLE 
(for single-character experimentation), MULTI (for, of course, 
multi-character versions), TINY, and KISS.

Enough said about philosophy and details.  Let's get on with the 
resurrection of the software.


THE INPUT UNIT

A key concept that we've used since Day 1 has been the idea of an 
input stream with one lookahead character.  All the parsing 
routines examine this character, without changing it, to decide 
what they should do next.  (Compare this approach with the C/Unix 
approach using getchar and unget, and I think you'll agree that 
our approach is simpler). We'll begin our hike into the future by 
translating this concept into our new, unit-based organization.  
The first unit, appropriately called Input, is shown below:


{--------------------------------------------------------------}
unit Input;
{--------------------------------------------------------------}
interface
var Look: char;              	{ Lookahead character }
procedure GetChar;            { Read new character  }

{--------------------------------------------------------------}
implementation

{--------------------------------------------------------------}
{ Read New Character From Input Stream }

procedure GetChar;
begin
	Read(Look);
end;

{--------------------------------------------------------------}
{ Unit Initialization }
begin
	GetChar;
end.
{--------------------------------------------------------------}


As you can see, there's nothing very profound, and certainly 
nothing complicated, about this unit, since it consists of only a 
single procedure.  But already, we can see how the use of units 
gives us advantages.  Note the executable code in the 
initialization block.  This code "primes the pump" of the input 
stream for us, something we've always had to do before, by 
inserting the call to GetChar in line, or in procedure Init.  This 
time, the call happens without any special reference to it on our 
part, except within the unit itself. As I predicted earlier, this 
mechanism is going to make our lives much simpler as we proceed.
I consider it to be one of the most useful features of Turbo 
Pascal, and I lean on it heavily. 

Copy this unit into your compiler's IDE, and compile it. To test 
the software, of course, we always need a main program.  I used 
the following, really complex test program, which we'll later 
evolve into the Main for our compiler:


{--------------------------------------------------------------}
program Main;
uses WinCRT, Input;
begin
	WriteLn(Look);
end.
{--------------------------------------------------------------}


Note the use of the Borland-supplied unit, WinCRT.  This unit is 
necessary if you intend to use the standard Pascal I/O routines, 
Read, ReadLn, Write, and WriteLn, which of course we intend to do.
If you forget to include this unit in the "uses" clause, you will 
get a really bizarre and indecipherable error message at run time.

Note also that we can access the lookahead character, even though 
it's not declared in the main program.  All variables declared 
within the interface section of a unit are global, but they're 
hidden from prying eyes; to that extent, we get a modicum of 
information hiding.  Of course, if we were writing in an object-
oriented fashion, we should not allow outside modules to access 
the units internal variables.  But, although Turbo units have a 
lot in common with objects, we're not doing object-oriented design 
or code here, so our use of Look is appropriate.

Go ahead and save the test program as Main.pas.  To make life 
easier as we get more and more files, you might want to take this 
opportunity to declare this file as the compiler's Primary file.  
That way, you can execute the program from any file.  Otherwise, 
if you press Cntl-F9 to compile and run from one of the units, 
you'll get an error message.  You set the primary file using the 
main submenu, "Compile," in the Turbo IDE.

I hasten to point out, as I've done before, that the function of 
unit Input is, and always has been, considered to be a dummy 
version of the real thing.  In a production version of a compiler, 
the input stream will, of course, come from a file rather than 
from the keyboard.  And it will almost certainly include line 
buffering, at the very least, and more likely, a rather large text 
buffer to support efficient disk I/O.  The nice part about the 
unit approach is that, as with objects, we can modify the code in 
the unit to be as simple or as sophisticated as we like. As long 
as the interface, as embodied in the public procedures and the 
lookahead character, don't change, the rest of the program is 
totally unaffected.  And since units are compiled, rather than 
merely included, the time required to link with them is virtually 
nil.  Again, the result is that we can get all the benefits of 
sophisticated implementations, without having to carry the code 
around as so much baggage.

In later installments, I intend to provide a full-blown IDE for 
the KISS compiler, using a true Windows application generated by 
Borland's OWL applications framework.  For now, though, we'll obey 
my #1 rule to live by:  Keep It Simple.



THE OUTPUT UNIT

Of course, every decent program should have output, and ours is no 
exception.  Our output routines included the Emit functions.  The 
code for the corresponding output unit is shown next:


{--------------------------------------------------------------}
unit Output;
{--------------------------------------------------------------}
interface
procedure Emit(s: string);			{ Emit an instruction 	}
procedure EmitLn(s: string);		{ Emit an instruction line }

{--------------------------------------------------------------}
implementation
const TAB = ^I;

{--------------------------------------------------------------}
{ Emit an Instruction }

procedure Emit(s: string);
begin
	Write(TAB, s);
end;

{--------------------------------------------------------------}
{ Emit an Instruction, Followed By a Newline }

procedure EmitLn(s: string);
begin
	Emit(s);
	WriteLn;
end;

end.
{--------------------------------------------------------------}


(Notice that this unit has no initialization clause, so it needs 
no begin-block.)
 
Test this unit with the following main program:

{--------------------------------------------------------------}
program Test;
uses WinCRT, Input, Output, Scanner, Parser;
begin
	WriteLn('MAIN:");
	EmitLn('Hello, world!');
end.
{--------------------------------------------------------------}

Did you see anything that surprised you?  You may have been 
surprised to see that you needed to type something, even though 
the main program requires no input.  That's because of the 
initialization in unit Input, which still requires something to 
put into the lookahead character.  Sorry, there's no way out of 
that box, or rather, we don't _WANT_ to get out. Except for simple 
test cases such as this, we will always want a valid lookahead 
character, so the right thing to do about this "problem" is ... 
nothing.

Perhaps more surprisingly, notice that the TAB character had no 
effect; our line of "instructions" begins at column 1, same as the 
fake label.  That's right:  WinCRT doesn't support tabs. We have a 
problem.

There are a few ways we can deal with this problem. The one thing 
we can't do is to simply ignore it.  Every assembler I've ever 
used reserves column 1 for labels, and will rebel to see 
instructions starting there.  So, at the very least, we must space 
the instructions over one column to keep the assembler happy.  .  
That's easy enough to do:  Simply change, in procedure Emit, the 
line:

	Write(TAB, s);
	
by:

	Write(' ', s);

I must admit that I've wrestled with this problem before, and find 
myself changing my mind as often as a chameleon changes color.  
For the purposes we're going to be using, 99% of which will be 
examining the output code as it's displayed on a CRT, it would be 
nice to see neatly blocked out "object" code.  The line:

SUB1:		MOVE	#4,D0

just plain looks neater than the different, but functionally 
identical code,

SUB1:
 MOVE #4,D0

In test versions of my code, I included a more sophisticated 
version of the procedure PostLabel, that avoids having labels on 
separate lines, but rather defers the printing of a label so it 
can end up on the same line as the associated instruction.  As 
recently as an hour ago, my version of unit Output provided full 
support for tabs, using an internal column count variable and 
software to manage it.  I had, if I do say so myself, some rather 
elegant code to support the tab mechanism, with a minimum of code 
bloat. It was awfully tempting to show you the "prettyprint" 
version, if for no other reason than to show off the elegance.

Nevertheless, the code of the "elegant" version was considerably 
more complex and larger.  Since then, I've had second thoughts. In 
spite of our desire to see pretty output, the inescapable fact is 
that the two versions of the MAIN: code fragment shown above are 
functionally identical; the assembler, which is the ultimate 
destination of the code, couldn't care less which version it gets, 
except that the prettier version will contain more characters, 
therefore will use more disk space and take longer to assemble.   
but the prettier one not only takes more code to generate, but 
will create a larger output file, with many more space characters 
than the minimum needed.  When you look at it that way, it's not 
very hard to decide which approach to use, is it?

What finally clinched the issue for me was a reminder to consider 
my own first commandment: KISS.  Although I was pretty proud of 
all my elegant little tricks to implement tabbing, I had to remind 
myself that, to paraphrase Senator Barry Goldwater, elegance in 
the pursuit of complexity is no virtue.  Another wise man once 
wrote, "Any idiot can design a Rolls-Royce. It takes a genius to 
design a VW."  So the elegant, tab-friendly version of Output is 
history, and what you see is the simple, compact, VW version.


THE ERROR UNIT

Our next set of routines are those that handle errors.  To refresh 
your memory, we take the approach, pioneered by Borland in Turbo 
Pascal, of halting on the first error.  Not only does this greatly 
simplify our code, by completely avoiding the sticky issue of 
error recovery, but it also makes much more sense, in my opinion, 
in an interactive environment.  I know this may be an extreme 
position, but I consider the practice of reporting all errors in a 
program to be an anachronism, a holdover from the days of batch 
processing.  It's time to scuttle the practice.  So there.

In our original Cradle, we had two error-handling procedures: 
Error, which didn't halt, and Abort, which did.  But I don't think 
we ever found a use for the procedure that didn't halt, so in the 
new, lean and mean unit Errors, shown next, procedure Error takes 
the place of Abort.


{--------------------------------------------------------------}
unit Errors;
{--------------------------------------------------------------}
interface
procedure Error(s: string);
procedure Expected(s: string);

{--------------------------------------------------------------}
implementation

{--------------------------------------------------------------}
{ Write error Message and Halt }

procedure Error(s: string);
begin
	WriteLn;
	WriteLn(^G, 'Error: ', s, '.');
	Halt;
end;

{--------------------------------------------------------------}
{ Write "<something> Expected" }

procedure Expected(s: string);
begin
	Error(s + ' Expected');
end;

end.
{--------------------------------------------------------------}


As usual, here's a test program:




{--------------------------------------------------------------}
program Test;
uses WinCRT, Input, Output, Errors;

begin
	Expected('Integer');
end.
{--------------------------------------------------------------}

Have you noticed that the "uses" line in our main program keeps 
getting longer?  That's OK. In the final version, the main program 
will only call procedures in our parser, so its use clause will 
only have a couple of entries. But for now, it's probably best to 
include all the units so we can test procedures in them.


SCANNING AND PARSING

The classical compiler architecture consists of separate modules 
for the lexical scanner, which supplies tokens in the language, 
and the parser, which tries to make sense of the tokens as syntax 
elements.  If you can still remember what we did in earlier 
installments, you'll recall that we didn't do things that way.  
Because we're using a predictive parser, we can almost always tell 
what language element is coming next, just by examining the 
lookahead character.  Therefore, we found no need to prefetch 
tokens, as a scanner would do.

But, even though there is no functional procedure called 
"Scanner," it still makes sense to separate the scanning functions 
from the parsing functions.  So I've created two more units 
called, amazingly enough, Scanner and Parser.  The Scanner unit 
contains all of the routines known as recognizers.  Some of these, 
such as IsAlpha, are pure boolean routines which operate on the 
lookahead character only.  The other routines are those which 
collect tokens, such as identifiers and numeric constants. The 
Parser unit will contain all of the routines making up the 
recursive-descent parser.  The general rule should be that unit 
Parser contains all of the information that is language-specific; 
in other words, the syntax of the language should be wholly 
contained in Parser.  In an ideal world, this rule should be true 
to the extent that we can change the compiler to compile a 
different language, merely by replacing the single unit, Parser. 

In practice, things are almost never this pure.  There's always a 
small amount of "leakage" of syntax rules into the scanner as 
well.  For example, the rules concerning what makes up a legal 
identifier or constant may vary from language to language.  In 
some languages, the rules concerning comments permit them to be 
filtered by the scanner, while in others they do not. So in 
practice, both units are likely to end up having language-
dependent components, but the changes required to the scanner 
should be relatively trivial. 

Now, recall that we've used two versions of the scanner routines: 
One that handled only single-character tokens, which we used for a 
number of our tests, and another that provided full support for 
multi-character tokens.  Now that we have our software separated 
into units, I don't anticipate getting much use out of the single-
character version, but it doesn't cost us much to provide for 
both.  I've created two versions of the Scanner unit.  The first 
one, called Scanner1, contains the single-digit version of the 
recognizers:


{--------------------------------------------------------------}
unit Scanner1;
{--------------------------------------------------------------}
interface
uses Input, Errors;

function IsAlpha(c: char): boolean;
function IsDigit(c: char): boolean;
function IsAlNum(c: char): boolean;
function IsAddop(c: char): boolean;
function IsMulop(c: char): boolean;

procedure Match(x: char);
function GetName: char;
function GetNumber: char;

{--------------------------------------------------------------}
implementation

{--------------------------------------------------------------}
{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;
begin
	IsAlpha := UpCase(c) in ['A'..'Z'];
end;

{--------------------------------------------------------------}
{ Recognize a Numeric Character }

function IsDigit(c: char): boolean;
begin
	IsDigit := c in ['0'..'9'];
end;

{--------------------------------------------------------------}
{ Recognize an Alphanumeric Character }

function IsAlnum(c: char): boolean;
begin
	IsAlnum := IsAlpha(c) or IsDigit(c);
end;

{--------------------------------------------------------------}
{ Recognize an Addition Operator }

function IsAddop(c: char): boolean;
begin
	IsAddop := c in ['+','-'];
end;

{--------------------------------------------------------------}
{ Recognize a Multiplication Operator }

function IsMulop(c: char): boolean;
begin
	IsMulop := c in ['*','/'];
end;

{--------------------------------------------------------------}
{ Match One Character }

procedure Match(x: char);
begin
	if Look = x then GetChar
	else Expected('''' + x + '''');
end;

{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: char;
begin
	if not IsAlpha(Look) then Expected('Name');
	GetName := UpCase(Look);
	GetChar;
end;

{--------------------------------------------------------------}
{ Get a Number }

function GetNumber: char;
begin
	if not IsDigit(Look) then Expected('Integer');
	GetNumber := Look;
	GetChar;
end;

end.
{--------------------------------------------------------------}


The following code fragment of the main program provides a good 
test of the scanner.  For brevity, I'll only include the 
executable code here; the rest remains the same.  Don't forget, 
though, to add the name Scanner1 to the "uses" clause.

	Write(GetName);
	Match('=');
	Write(GetNumber);
	Match('+');
	WriteLn(GetName);

This code will recognize all sentences of the form:

	x=0+y

where x and y can be any single-character variable names, and 0 
any digit.  The code should reject all other sentences, and give a 
meaningful error message. If it did, you're in good shape and we 
can proceed.


THE SCANNER UNIT

The next, and by far the most important, version of the scanner is 
the one that handles the multi-character tokens that all real 
languages must have.  Only the two functions, GetName and 
GetNumber, change between the two units, but just to be sure there 
are no mistakes, I've reproduced the entire unit here.  This is 
unit Scanner:


{--------------------------------------------------------------}
unit Scanner;
{--------------------------------------------------------------}
interface
uses Input, Errors;

function IsAlpha(c: char): boolean;
function IsDigit(c: char): boolean;
function IsAlNum(c: char): boolean;
function IsAddop(c: char): boolean;
function IsMulop(c: char): boolean;

procedure Match(x: char);
function GetName: string;
function GetNumber: longint;

{--------------------------------------------------------------}
implementation

{--------------------------------------------------------------}
{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;
begin
	IsAlpha := UpCase(c) in ['A'..'Z'];
end;

{--------------------------------------------------------------}
{ Recognize a Numeric Character }

function IsDigit(c: char): boolean;
begin
	IsDigit := c in ['0'..'9'];
end;

{--------------------------------------------------------------}
{ Recognize an Alphanumeric Character }

function IsAlnum(c: char): boolean;
begin
	IsAlnum := IsAlpha(c) or IsDigit(c);
end;

{--------------------------------------------------------------}
{ Recognize an Addition Operator }

function IsAddop(c: char): boolean;
begin
	IsAddop := c in ['+','-'];
end;

{--------------------------------------------------------------}
{ Recognize a Multiplication Operator }

function IsMulop(c: char): boolean;
begin
	IsMulop := c in ['*','/'];
end;

{--------------------------------------------------------------}
{ Match One Character }

procedure Match(x: char);
begin
	if Look = x then GetChar
	else Expected('''' + x + '''');
end;

{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: string;
var n: string;
begin
	n := '';
	if not IsAlpha(Look) then Expected('Name');
	while IsAlnum(Look) do begin
		n := n + Look;
		GetChar;
	end;
	GetName := n;
end;

{--------------------------------------------------------------}
{ Get a Number }

function GetNumber: string;
var n: string;
begin
	n := '';
	if not IsDigit(Look) then Expected('Integer');
	while IsDigit(Look) do begin
		n := n + Look;
		GetChar;
	end;
	GetNumber := n;
end;

end.
{--------------------------------------------------------------}


The same test program will test this scanner, also. Simply change 
the "uses" clause to use Scanner instead of Scanner1.  Now you 
should be able to type multi-character names and numbers.


DECISIONS, DECISIONS

In spite of the relative simplicity of both scanners, a lot of 
thought has gone into them, and a lot of decisions had to be made.  
I'd like to share those thoughts with you now so you can make your 
own educated decision, appropriate for your application.  First, 
note that both versions of GetName translate the input characters 
to upper case.  Obviously, there was a design decision made here, 
and this is one of those cases where the language syntax splatters 
over into the scanner.  In the C language, the case of characters 
in identifiers is significant.  For such a language, we obviously 
can't map the characters to upper case.  The design I'm using 
assumes a language like Pascal, where the case of characters 
doesn't matter.  For such languages, it's easier to go ahead and 
map all identifiers to upper case in the scanner, so we don't have 
to worry later on when we're comparing strings for equality.

We could have even gone a step further, and map the characters to 
upper case right as they come in, in GetChar.  This approach works 
too, and I've used it in the past, but it's too confining. 
Specifically, it will also map characters that may be part of 
quoted strings, which is not a good idea.  So if you're going to 
map to upper case at all, GetName is the proper place to do it.

Note that the function GetNumber in this scanner returns a string, 
just as GetName does.  This is another one of those things I've 
oscillated about almost daily, and the last swing was all of ten 
minutes ago.  The alternative approach, and one I've used many 
times in past installments, returns an integer result.

Both approaches have their good points. Since we're fetching a 
number, the approach that immediately comes to mind is to return 
it as an integer.  But bear in mind that the eventual use of the 
number will be in a write statement that goes back to the outside 
world.  Someone -- either us or the code hidden inside the write 
statement -- is going to have to convert the number back to a 
string again.  Turbo Pascal includes such string conversion 
routines, but why use them if we don't have to?  Why convert a 
number from string to integer form, only to convert it right back 
again in the code generator, only a few statements later?

Furthermore, as you'll soon see, we're going to need a temporary 
storage spot for the value of the token we've fetched. If we treat 
the number in its string form, we can store the value of either a 
variable or a number in the same string.  Otherwise, we'll have to 
create a second, integer variable.

On the other hand, we'll find that carrying the number as a string 
virtually eliminates any chance of optimization later on.  As we 
get to the point where we are beginning to concern ourselves with 
code generation, we'll encounter cases in which we're doing 
arithmetic on constants.  For such cases, it's really foolish to 
generate code that performs the constant arithmetic at run time.  
Far better to let the parser do the arithmetic at compile time, 
and merely code the result.  To do that, we'll wish we had the 
constants stored as integers rather than strings.

What finally swung me back over to the string approach was an 
aggressive application of the KISS test, plus reminding myself 
that we've studiously avoided issues of code efficiency.  One of 
the things that makes our simple-minded parsing work, without the 
complexities of a "real" compiler, is that we've said up front 
that we aren't concerned about code efficiency.  That gives us a 
lot of freedom to do things the easy way rather than the efficient 
one, and it's a freedom we must be careful not to abandon 
voluntarily, in spite of the urges for efficiency shouting in our 
ear.  In addition to being a big believer in the KISS philosophy, 
I'm also an advocate of "lazy programming," which in this context 
means, don't program anything until you need it.  As P.J. Plauger 
says, "Never put off until tomorrow what you can put off 
indefinitely."  Over the years, much code has been written to 
provide for eventualities that never happened.  I've learned that 
lesson myself, from bitter experience.  So the bottom line is:  We 
won't convert to an integer here because we don't need to.  It's 
as simple as that.

For those of you who still think we may need the integer version 
(and indeed we may), here it is:


{--------------------------------------------------------------}
{ Get a Number (integer version) }

function GetNumber: longint;
var n: longint;
begin
	n := 0;
	if not IsDigit(Look) then Expected('Integer');
	while IsDigit(Look) do begin
		n := 10 * n + (Ord(Look) - Ord('0'));
		GetChar;
	end;
	GetNumber := n;
end;
{--------------------------------------------------------------}

You might file this one away, as I intend to, for a rainy day.


PARSING

At this point, we have distributed all the routines that made up 
our Cradle into units that we can draw upon as we need them.  
Obviously, they will evolve further as we continue the process of 
bootstrapping ourselves up again, but for the most part their 
content, and certainly the architecture that they imply, is 
defined.  What remains is to embody the language syntax into the 
parser unit.  We won't do much of that in this installment, but I 
do want to do a little, just to leave us with the good feeling 
that we still know what we're doing.  So before we go, let's 
generate just enough of a parser to process single factors in an 
expression.  In the process, we'll also, by necessity, find we 
have created a code generator unit, as well.

Remember the very first installment of this series?  We read an 
integer value, say n, and generated the code to load it into the 
D0 register via an immediate move:

	MOVE #n,D0

Shortly afterwards, we repeated the process for a variable, 

	MOVE X(PC),D0

and then for a factor that could be either constant or variable.
For old times sake, let's revisit that process.  Define the 
following new unit:


{--------------------------------------------------------------}
unit Parser;
{--------------------------------------------------------------}
interface
uses Input, Scanner, Errors, CodeGen;
procedure Factor;

{--------------------------------------------------------------}
implementation

{--------------------------------------------------------------}
{ Parse and Translate a Factor }

procedure Factor;
begin
	LoadConstant(GetNumber);
end;

end.
{--------------------------------------------------------------}


As you can see, this unit calls a procedure, LoadConstant, which 
actually effects the output of the assembly-language code.  The 
unit also uses a new unit, CodeGen.  This step represents the last 
major change in our architecture, from earlier installments: The 
removal of the machine-dependent code to a separate unit. If I 
have my way, there will not be a single line of code, outside of 
CodeGen, that betrays the fact that we're targeting the 68000 CPU.  
And this is one place I think that having my way is quite 
feasible.  

For those of you who wish I were using the 80x86 architecture (or 
any other one) instead of the 68000, here's your answer:  Merely 
replace CodeGen with one suitable for your CPU of choice.

So far, our code generator has only one procedure in it.  Here's 
the unit:


{--------------------------------------------------------------}
unit CodeGen;

{--------------------------------------------------------------}
interface
uses Output;
procedure LoadConstant(n: string);

{--------------------------------------------------------------}
implementation

{--------------------------------------------------------------}
{ Load the Primary Register with a Constant }

procedure LoadConstant(n: string);
begin
	EmitLn('MOVE #' + n + ',D0' );
end;

end.
{--------------------------------------------------------------}


Copy and compile this unit, and execute the following main 
program:

{--------------------------------------------------------------}
program Main;
uses WinCRT, Input, Output, Errors, Scanner, Parser;
begin
	Factor;
end.
{--------------------------------------------------------------}


There it is, the generated code, just as we hoped it would be.

Now, I hope you can begin to see the advantage of the unit-based 
architecture of our new design.  Here we have a main program 
that's all of five lines long. That's all of the program we need 
to see, unless we choose to see more.  And yet, all those units 
are sitting there, patiently waiting to serve us.  We can have our 
cake and eat it too, in that we have simple and short code, but 
powerful allies.  What remains to be done is to flesh out the 
units to match the capabilities of earlier installments.  We'll do 
that in the next installment, but before I close, let's finish out 
the parsing of a factor, just to satisfy ourselves that we still 
know how.  The final version of CodeGen includes the new 
procedure, LoadVariable:

{--------------------------------------------------------------}
unit CodeGen;

{--------------------------------------------------------------}
interface
uses Output;
procedure LoadConstant(n: string);
procedure LoadVariable(Name: string);

{--------------------------------------------------------------}
implementation

{--------------------------------------------------------------}
{ Load the Primary Register with a Constant }

procedure LoadConstant(n: string);
begin
	EmitLn('MOVE #' + n + ',D0' );
end;

{--------------------------------------------------------------}
{ Load a Variable to the Primary Register }

procedure LoadVariable(Name: string);
begin
	EmitLn('MOVE ' + Name + '(PC),D0');
end;


end.
{--------------------------------------------------------------}


The parser unit itself doesn't change, but we have a more complex 
version of procedure Factor:

{--------------------------------------------------------------}
{ Parse and Translate a Factor }

procedure Factor;
begin
	if IsDigit(Look) then
		LoadConstant(GetNumber)
	else if IsAlpha(Look)then
		LoadVariable(GetName)
	else
		Error('Unrecognized character ' + Look);
end;
{--------------------------------------------------------------}

 
Now, without altering the main program, you should find that our 
program will process either a variable or a constant factor.  At 
this point, our architecture is almost complete; we have units to 
do all the dirty work, and enough code in the parser and code 
generator to demonstrate that everything works.  What remains is 
to flesh out the units we've defined, particularly the parser and 
code generator, to support the more complex syntax elements that 
make up a real language.  Since we've done this many times before 
in earlier installments, it shouldn't take long to get us back to 
where we were before the long hiatus.  We'll continue this process 
in Installment 16, coming soon.  See you then.



REFERENCES

1. Crenshaw, J.W., "Object-Oriented Design of Assemblers and 
Compilers," Proc. Software Development '91 Conference, Miller 
Freeman, San Francisco, CA, February 1991, pp. 143-155.

2. Crenshaw, J.W., "A Perfect Marriage," Computer Language, Volume 
8, #6, June 1991, pp. 44-55.

3. Crenshaw, J.W., "Syntax-Driven Object-Oriented Design," Proc. 
1991 Embedded Systems Conference, Miller Freeman, San 
Francisco, CA, September 1991, pp. 45-60.





 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
                       LET'S BUILD A COMPILER! 
 
                                 By 
 
                       Jack W. Crenshaw, Ph.D. 
 
                            29 May, 1995 
 
                     Part 16: UNIT CONSTRUCTION 
 



 
INTRODUCTION 
 
This series of tutorials promises to be perhaps one of the longest-
running mini-series in history, rivalled only by the delay in Volume IV 
of Knuth.  Begun in 1988, the series ran into a four-year hiatus in 1990 
when the "cares of this world," changes in priorities and interests, and 
the need to make a living seemed to stall it out after Installment 14.  
Those of you with loads of patience were finally rewarded, in the spring 
of last year, with the long-awaited Installment 15.  In it, I began to 
try to steer the series back on track, and in the process, to make it 
easier to continue on to the goal, which is to provide you with not only 
enough understanding of the difficult subject of compiler theory, but 
also enough tools, in the form of canned subroutines and concepts, so 
that you would be able to continue on your own and become proficient 
enough to build your own parsers and translators.  Because of that long 
hiatus, I thought it appropriate to go back and review the concepts we 
have covered so far, and to redo some of the software, as well.  In the 
past, we've never concerned ourselves much with the development of 
production-quality software tools ... after all, I was trying to teach 
(and learn) concepts, not production practice.  To do that, I tended to 
give you, not complete compilers or parsers, but only those snippets of 
code that illustrated the particular point we were considering at the 
moment. 
 
I still believe that's a good way to learn any subject; no one wants to 
have to make changes to 100,000 line programs just to try out a new 
idea.  But the idea of just dealing with code snippets, rather than 
complete programs, also has its drawbacks in that we often seemed to be 
writing the same code fragments over and over.  Although repetition has 
been thoroughly proven to be a good way to learn new ideas, it's also 
true that one can have too much of a good thing.  By the time I had 
completed Installment 14 I seemed to have reached the limits of my 
abilities to juggle multiple files and multiple versions of the same 
software functions.  Who knows, perhaps that's one reason I seemed to 
have run out of gas at that point. 
 
Fortunately, the later versions of Borland's Turbo Pascal allow us to 
have our cake and eat it too.  By using their concept of separately 
compilable units, we can still write small subroutines and functions, 
and keep our main programs and test programs small and simple.  But, 
once written, the code in the Pascal units will always be there for us 
to use, and linking them in is totally painless and transparent. 
 
Since, by now, most of you are programming in either C or C++, I know 
what you're thinking:  Borland, with their Turbo Pascal (TP), certainly 
didn't invent the concept of separately compilable modules.  And of 
course you're right.  But if you've not used TP lately, or ever, you may 
not realize just how painless the whole process is.  Even in C or C++, 
you still have to build a make file, either manually or by telling the 
compiler how to do so.  You must also list, using "extern" statements or 
header files, the functions you want to import.  In TP, you don't even 
have to do that.  You need only name the units you wish to use, and all 
of their procedures automatically become available.   
 
 
It's not my intention to get into a language-war debate here, so I won't 
pursue the subject any further.  Even I no longer use Pascal on my job 
... I use C at work and C++ for my articles in Embedded Systems 
Programming and other magazines.  Believe me, when I set out to 
resurrect this series, I thought long and hard about switching both 
languages and target systems to the ones that we're all using these 
days, C/C++ and PC architecture, and possibly object-oriented methods as 
well.  In the end, I felt it would cause more confusion than the hiatus 
itself has. And after all, Pascal still remains one of the best possible 
languages for teaching, not to mention production programming.  Finally, 
TP still compiles at the speed of light, much faster than competing 
C/C++ compilers. And Borland's smart linker, used in TP but not in their 
C++ products, is second to none.  Aside from being much faster than 
Microsoft-compatible linkers, the Borland smart linker will cull unused 
procedures and data items, even to the extent of trimming them out of 
defined objects if they're not needed.  For one of the few times in our 
lives, we don't have to compromise between completeness and efficiency.  
When we're writing a TP unit, we can make it as complete as we like, 
including any member functions and data items we may think we will ever 
need, confident that doing so will not create unwanted bloat in the 
compiled and linked executable. 
 
The point, really, is simply this:  By using TP's unit mechanism, we can 
have all the advantages and convenience of writing small, seemingly 
stand-alone test programs, without having to constantly rewrite the 
support functions that we need.  Once written, the TP units sit there, 
quietly waiting to do their duty and give us the support we need, when 
we need it. 
 
Using this principle, in Installment 15 I set out to minimize our 
tendency to re-invent the wheel by organizing  our code into separate 
Turbo Pascal units, each containing different parts of the compiler.  We 
ended up with the following units: 
 

 
Each of these units serves a different function, and encapsulates 
specific areas of functionality.  The Input and Output units, as their 
name implies, provide character stream I/O and the all-important 
lookahead character upon which our predictive parser is based.  The 
Errors unit, of course, provides standard error handling.  The Scanner 
unit contains all of our boolean functions such as IsAlpha, and the 
routines GetName and GetNumber, which process multi-character tokens. 
 
The two units we'll be working with the most, and the ones that most 
represent the personality of our compiler, are Parser and CodeGen.  
Theoretically, the Parser unit should encapsulate all aspects of the 
compiler that depend on the syntax of the compiled language (though, as 
we saw last time, a small amount of this syntax spills over into 
Scanner).  Similarly, the code generator unit, CodeGen, contains all of 
the code dependent upon the target machine.  In this installment, we'll 
be continuing with the development of the functions in these two all-
important units. 
 
 


JUST LIKE CLASSICAL? 
 
Before we proceed, however, I think I should clarify the relationship  
between, and the functionality of these units.  Those of you who are 
familiar with compiler theory as taught in universities will, of course, 
recognize the names, Scanner, Parser, and CodeGen, all of which are 
components of a classical compiler implementation.  You may be thinking 
that I've abandoned my commitment to the KISS philosophy, and drifted 
towards a more conventional architecture than we once had.  A closer 
look, however, should convince you that, while the names are similar, 
the functionalities are quite different. 
 
Together, the scanner and parser of a classical implementation comprise 
the so-called "front end," and the code generator, the back end.  The 
front end routines process the language-dependent, syntax-related 
aspects of the source language, while the code generator, or back end, 
deals with the target machine-dependent parts of the problem.  In 
classical compilers, the two ends communicate via a file of instructions 
written in an intermediate language (IL). 
 
Typically, a classical scanner is a single procedure, operating as a co-
procedure with the parser.  It "tokenizes" the source file, reading it 
character by character, recognizing language elements, translating them 
into tokens, and passing them along to the parser.  You can think of the 
parser as an abstract machine, executing "op codes," which are the 
tokens.  Similarly, the parser generates op codes of a second abstract 
machine, which mechanizes the IL.  Typically, the IL file is written to 
disk by the parser, and read back again by the code generator. 
 
Our organization is quite different.  We have no lexical scanner, in the 
classical sense;  our unit Scanner, though it has a similar name, is not 
a single procedure or co-procedure, but merely a set of separate 
subroutines which are called by the parser as needed.  
 
Similarly, the classical code generator, the back end,  is a translator 
in its own right, reading an IL "source" file, and emitting an object 
file.  Our code generator doesn't work that way.  In our compiler, there 
IS no intermediate language; every construct in the source language 
syntax is converted into assembly language as it is recognized by the 
parser.  Like Scanner, the unit CodeGen consists of individual 
procedures which are called by the parser as needed. 
 
This "code 'em as you find 'em" philosophy may not produce the world's 
most efficient code -- for example, we haven't provided (yet!) a 
convenient place for an optimizer to work its magic -- but it sure does 
simplify the compiler, doesn't it? 
 
And that observation prompts me to reflect, once again, on how we have 
managed to reduce a compiler's functions to such comparatively simple 
terms.  I've waxed eloquent on this subject in past installments, so I 
won't belabor the point too much here.  However, because of the time 
that's elapsed since those last soliloquies, I hope you'll grant me just 
a little time to remind myself, as well as you, how we got here.  We got 
here by applying several principles that writers of commercial compilers 
seldom have the luxury of using.  These are: 
 
o	The KISS philosophy -- Never do things the hard way without a 
reason 
 
o	Lazy coding -- Never put off until tomorrow what you can put 
of forever (with credits to P.J. Plauger) 
 
o	Skepticism -- Stubborn refusal to do something just because 
that's the way it's always been done. 
 
o	Acceptance of inefficient code 
 
o	Rejection of arbitrary constraints 
 
As I've reviewed the history of compiler construction, I've learned that 
virtually every production compiler in history has suffered from pre-
imposed conditions that strongly influenced its design. The original 
FORTRAN compiler of John Backus, et al, had to compete with assembly 
language, and therefore was constrained to produce extremely efficient 
code.  The IBM compilers for the minicomputers of the 70's had to run in 
the very small RAM memories then available -- as small as 4k.  The early 
Ada compiler had to compile itself.  Per Brinch Hansen decreed that his 
Pascal compiler developed for the IBM PC must execute in a 64k machine.  
Compilers developed in Computer Science courses had to compile the 
widest variety of languages, and therefore required LALR parsers. 
 
In each of these cases, these preconceived constraints literally 
dominated the design of the compiler.  
 
A good example is Brinch Hansen's compiler, described in his excellent 
book, "Brinch Hansen on Pascal Compilers" (highly recommended).  Though 
his compiler is one of the most clear and un-obscure compiler 
implementations I've seen, that one decision, to compile large files in 
a small RAM, totally drives the design, and he ends up with not just 
one, but many intermediate files, together with the drivers to write and 
read them. 
 
In time, the architectures resulting from such decisions have found 
their way into computer science lore as articles of faith. In this one 
man's opinion, it's time that they were re-examined critically.  The 
conditions, environments, and requirements that led to classical 
architectures are not the same as the ones we have today.  There's no 
reason to believe the solutions should be the same, either. 
 
In this tutorial, we've followed the leads of such pioneers in the world 
of small compilers for Pcs as Leor Zolman, Ron Cain, and James Hendrix, 
who didn't know enough compiler theory to know that they "couldn't do it 
that way."  We have resolutely refused to accept arbitrary constraints, 
but rather have done whatever was easy.  As a result, we have evolved an 
architecture that, while quite different from the classical one, gets 
the job done in very simple and straightforward fashion. 
 
I'll end this philosophizing with an observation re the notion of an 
intermediate language.  While I've noted before that we don't have one 
in our compiler, that's not exactly true; we _DO_ have one, or at least 
are evolving one, in the sense that we are defining code generation 
functions for the parser to call.  In essence, every call to a code 
generation procedure can be thought of as an instruction in an 
intermediate language.  Should we ever find it necessary to formalize an 
intermediate language, this is the way we would do it:  emit codes from 
the parser, each representing a call to one of the code generator 
procedures, and then process each code by calling those procedures in a 
separate pass, implemented in a back end. Frankly, I don't see that 
we'll ever find a need for this approach, but there is the connection, 
if you choose to follow it, between the classical and the current 
approaches. 
 


FLESHING OUT THE PARSER 
 
Though I promised you, somewhere along about Installment 14, that we'd 
never again write every single function from scratch, I ended up 
starting to do just that in Installment 15.  One reason: that long 
hiatus between the two installments made a review seem eminently 
justified ... even imperative, both for you and for me. More 
importantly, the decision to collect the procedures into modules 
(units), forced us to look at each one yet again, whether we wanted to 
or not.  And, finally and frankly, I've had some new ideas in the last 
four years that warranted a fresh look at some old friends.  When I 
first began this series, I was frankly amazed, and pleased, to learn 
just how simple parsing routines can be made.  But this last time 
around, I've surprised myself yet again, and been able to make them just 
that last little bit simpler, yet. 
 
Still, because of this total rewrite of the parsing modules, I was only 
able to include so much in the last installment.  Because of this, our 
hero, the parser, when last seen, was a shadow of his former self,  
consisting of only enough code to parse and process a factor consisting 
of either a variable or a constant.  The main effort of this current 
installment will be to help flesh out the parser to its former glory.  
In the process, I hope you'll bear with me if we sometimes cover ground 
we've long since been over and dealt with. 
 
First, let's take care of a problem that we've addressed before: Our 
current version of procedure Factor, as we left it in Installment 15,  
can't handle negative arguments.  To fix that, we'll introduce the 
procedure SignedFactor: 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate a Factor with Optional Sign } 
 
procedure SignedFactor; 
var Sign: char; 
begin 
	Sign := Look; 
	if IsAddop(Look) then 
		GetChar; 
	Factor; 
	if Sign = '-' then Negate; 
end; 
{--------------------------------------------------------------}  
 
 
Note that this procedure calls a new code generation routine, Negate: 
 
 
{--------------------------------------------------------------} 
{ Negate Primary } 
 
procedure Negate; 
begin 
	EmitLn('NEG D0'); 
end; 
{--------------------------------------------------------------} 
 
 
(Here, and elsewhere in this series, I'm only going to show you the new 
routines. I'm counting on you to put them into the proper unit, which 
you should normally have no trouble identifying.  Don't forget to add 
the procedure's prototype to the interface section of the unit.) 
 
In the main program, simply change the procedure called from Factor to 
SignedFactor, and give the code a test.  Isn't it neat how the Turbo 
linker and make facility handle all the details? 
 
Yes, I know, the code isn't very efficient.  If we input a number, -3, 
the generated code is: 
 
	MOVE #3,D0 
	NEG D0 
 
which is really, really dumb.  We can do better, of course, by simply 
pre-appending a minus sign to the string passed to LoadConstant, but it 
adds a few lines of code to SignedFactor, and I'm applying the KISS 
philosophy very aggressively here. What's more,  to tell the truth, I 
think I'm subconsciously enjoying generating "really, really dumb" code, 
so I can have the pleasure of watching it get dramatically better when 
we get into optimization methods. 
 
Most of you have never heard of John Spray, so allow me to introduce him 
to you here.  John's from New Zealand, and used to teach computer 
science at one of its universities.  John wrote a compiler for the 
Motorola 6809, based on a delightful, Pascal-like language of his own 
design called "Whimsical."  He later ported the compiler to the 68000, 
and for awhile it was the only compiler I had for my homebrewed 68000 
system.   
 
For the record, one of my standard tests for any new compiler is to see 
how the compiler deals with a null program like: 
 
	program main; 
	begin 
	end. 
 
My test is to measure the time required to compile and link, and the 
size of the object file generated.  The undisputed _LOSER_ in the test 
is the DEC C compiler for the VAX, which took 60 seconds to compile, on 
a VAX 11/780, and generated a 50k object file.  John's compiler is the 
undisputed, once, future, and forever king in the code size department.  
Given the null program, Whimsical generates precisely two bytes of code, 
implementing the one instruction, 
 
	RET 
 
By setting a compiler option to generate an include file rather than a 
standalone program, John can even cut this size, from two bytes to zero!  
Sort of hard to beat a null object file, wouldn't you say? 
 
Needless to say, I consider John to be something of an expert on code 
optimization, and I like what he has to say: "The best way to optimize 
is not to have to optimize at all, but to produce good code in the first 
place." Words to live by.  When we get started on optimization, we'll 
follow John's advice, and our first step will not be to add a peephole 
optimizer or other after-the-fact device, but to improve the quality of 
the code emitted before optimization.  So make a note of SignedFactor as 
a good first candidate for attention, and for now we'll leave it be. 
 
TERMS AND EXPRESSIONS 
 
I'm sure you know what's coming next: We must, yet again, create the 
rest of the procedures that implement the recursive-descent parsing of 
an expression.  We all know that the hierarchy of procedures for 
arithmetic expressions is: 
 
expression 
	term 
		factor 
 
However, for now let's continue to do things one step at a time, 
and consider only expressions with additive terms in them.  The 
code to implement expressions, including a possibly signed first 
term, is shown next: 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate an Expression } 
 
procedure Expression; 
begin 
	SignedFactor; 
	while IsAddop(Look) do 
		case Look of 
			'+': Add; 
			'-': Subtract; 
		end; 
end; 
{--------------------------------------------------------------} 
 
 
This procedure calls two other procedures to process the 
operations: 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate an Addition Operation } 
 
procedure Add; 
begin 
	Match('+'); 
	Push; 
	Factor; 
	PopAdd; 
end; 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate a Subtraction Operation } 
 
procedure Subtract; 
begin 
	Match('-'); 
	Push; 
	Factor; 
	PopSub; 
end; 
{--------------------------------------------------------------} 
 
 
The three procedures Push, PopAdd, and PopSub are new code generation 
routines.  As the name implies, procedure Push generates code to push 
the primary register (D0, in our 68000 implementation) to the stack.  
PopAdd and PopSub pop the top of the stack again, and add it to, or 
subtract it from, the primary register.  The code is shown next: 
 
 


{--------------------------------------------------------------} 
{ Push Primary to Stack } 
 
procedure Push; 
begin 
	EmitLn('MOVE D0,-(SP)'); 
end; 
 
{--------------------------------------------------------------} 
{ Add TOS to Primary } 
 
procedure PopAdd; 
begin 
	EmitLn('ADD (SP)+,D0'); 
end; 
 
{--------------------------------------------------------------} 
{ Subtract TOS from Primary } 
 
procedure PopSub; 
begin 
	EmitLn('SUB (SP)+,D0'); 
	Negate; 
end; 
{--------------------------------------------------------------} 
 
 
Add these routines to Parser and CodeGen, and change the main program to 
call Expression. Voila! 
 
The next step, of course, is to add the capability for dealing with 
multiplicative terms.  To that end, we'll add a procedure Term, and code 
generation procedures PopMul and PopDiv.  These code generation 
procedures are shown next: 
 
 
{--------------------------------------------------------------} 
{ Multiply TOS by Primary } 
 
procedure PopMul; 
begin 
	EmitLn('MULS (SP)+,D0'); 
end; 
 
{--------------------------------------------------------------} 
{ Divide Primary by TOS } 
 
procedure PopDiv; 
begin 
	EmitLn('MOVE (SP)+,D7'); 
	EmitLn('EXT.L D7'); 
	EmitLn('DIVS D0,D7'); 
	EmitLn('MOVE D7,D0'); 
end; 
{--------------------------------------------------------------} 
 
 
I admit, the division routine is a little busy, but there's no help for 
it.  Unfortunately, while the 68000 CPU allows a division using the top 
of stack (TOS), it wants the arguments in the wrong order, just as it 
does for subtraction.  So our only recourse is to pop the stack to a 
scratch register (D7), perform the division there, and then move the 
result back to our primary register, D0. Note the use of signed multiply 
and divide operations.  This follows an implied, but unstated, 
assumption, that all our variables will be signed 16-bit integers. This 
decision will come back to haunt us later, when we start looking at 
multiple data types, type conversions, etc. 
 
Our procedure Term is virtually a clone of Expression, and looks like 
this: 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate a Term } 
 
procedure Term; 
begin 
	Factor; 
	while IsMulop(Look) do 
		case Look of 
			'*': Multiply; 
			'/': Divide; 
		end; 
end; 
{--------------------------------------------------------------} 
 
 
Our next step is to change some names.  SignedFactor now becomes 
SignedTerm, and the calls to Factor in Expression, Add, Subtract and 
SignedTerm get changed to call Term: 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate a Term with Optional Leading Sign } 
 
procedure SignedTerm; 
var Sign: char; 
begin 
	Sign := Look; 
	if IsAddop(Look) then 
		GetChar; 
	Term; 
	if Sign = '-' then Negate; 
end; 
{--------------------------------------------------------------} 
... 
{--------------------------------------------------------------} 
{ Parse and Translate an Expression } 
 
procedure Expression; 
begin 
	SignedTerm; 
	while IsAddop(Look) do 
		case Look of 
			'+': Add; 
			'-': Subtract; 
		end; 
end; 
{--------------------------------------------------------------} 
 
 
If memory serves me correctly, we once had BOTH a procedure SignedFactor 
and a procedure SignedTerm. I had reasons for doing that at the time ... 
they had to do with the handling of Boolean algebra and, in particular, 
the Boolean "not" function.  But certainly, for arithmetic operations, 
that duplication isn't necessary.  In an expression like: 
 
	-x*y 
 
it's very apparent that the sign goes with the whole TERM, x*y, and not 
just the factor x, and that's the way Expression is coded.   
 
Test this new code by executing Main.  It still calls Expression, so you 
should now be able to deal with expressions containing any of the four 
arithmetic operators. 
 
Our last bit of business, as far as expressions goes, is to modify 
procedure Factor to allow for parenthetical expressions.  By using a 
recursive call to Expression, we can reduce the needed code to virtually 
nothing.  Five lines added to Factor do the job: 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate a Factor } 
 
procedure Factor; 
begin 
	if Look ='(' then begin 
		Match('('); 
		Expression; 
		Match(')'); 
		end 
	else if IsDigit(Look) then 
		LoadConstant(GetNumber) 
	else if IsAlpha(Look)then 
		LoadVariable(GetName) 
	else 
		Error('Unrecognized character ' + Look); 
end; 
{--------------------------------------------------------------} 
 
 
At this point, your "compiler" should be able to handle any legal 
expression you can throw at it.  Better yet, it should reject all 
illegal ones! 
 
ASSIGNMENTS 
 
As long as we're this close, we might as well create the code to deal 
with an assignment statement.  This code needs only to remember the name 
of the target variable where we are to store the result of an 
expression, call Expression, then store the number.  The procedure is 
shown next: 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate an Assignment Statement } 
 
procedure Assignment; 
var Name: string; 
begin 
	Name := GetName; 
	Match('='); 
	Expression; 
	StoreVariable(Name); 
end; 
{--------------------------------------------------------------} 
 
The assignment calls for yet another code generation routine: 
 


{--------------------------------------------------------------} 
{ Store the Primary Register to a Variable } 
 
procedure StoreVariable(Name: string); 
begin 
	EmitLn('LEA ' + Name + '(PC),A0'); 
	EmitLn('MOVE D0,(A0)'); 
end; 
{--------------------------------------------------------------} 
 
 
Now, change the call in Main to call Assignment, and you should see a 
full assignment statement being processed correctly.  Pretty neat, eh?  
And painless, too. 
 
In the past, we've always tried to show BNF relations to define the 
syntax we're developing. I haven't done that here, and it's high time I 
did.  Here's the BNF: 
 
 
<factor>      ::= <variable> | <constant> | '(' <expression> ')'	 
<signed_term> ::= [<addop>] <term> 
<term>        ::= <factor> (<mulop> <factor>)*	 
<expression>  ::= <signed_term> (<addop> <term>)* 
<assignment>  ::= <variable> '=' <expression> 
 
BOOLEANS 
 
The next step, as we've learned several times before, is to add Boolean 
algebra.  In the past, this step has at least doubled the amount of code 
we've had to write.  As I've gone over this step in my mind, I've found 
myself diverging more and more from what we did in previous 
installments.  To refresh your memory, I noted that Pascal treats the 
Boolean operators pretty much identically to the way it treats 
arithmetic ones.  A Boolean "and" has the same precedence level as 
multiplication, and the "or" as addition.  C, on the other hand, sets 
them at different precedence levels, and all told has a whopping 17 
levels.  In our earlier work, I chose something in between, with seven 
levels.  As a result, we ended up with things called Boolean 
expressions, paralleling in most details the arithmetic expressions, but 
at a different precedence level.  All of this, as it turned out, came 
about because I didn't like having to put parentheses around the Boolean 
expressions in statements like: 
 
	     IF (c >= 'A') and (c <= 'Z') then ... 
 
In retrospect, that seems a pretty petty reason to add many layers of 
complexity to the parser.  Perhaps more to the point, I'm not sure I was 
even able to avoid the parens.   
 
For kicks, let's start anew, taking a more Pascal-ish approach, and just 
treat the Boolean operators at the same precedence level as the 
arithmetic ones. We'll see where it leads us.  If it seems to be down 
the garden path, we can always backtrack to the earlier approach. 
 
For starters, we'll add the "addition-level" operators to Expression. 
That's easily done; first, modify the function IsAddop in unit Scanner 
to include two extra operators: '|' for "or," and '~' for "exclusive 
or": 
 
 


{--------------------------------------------------------------} 
function IsAddop(c: char): boolean; 
begin 
	IsAddop := c in ['+','-', '|', '~']; 
end; 
{--------------------------------------------------------------} 
 
 
Next, we must include the parsing of the operators in procedure 
Expression: 
 
 
{--------------------------------------------------------------} 
procedure Expression; 
begin 
	SignedTerm; 
	while IsAddop(Look) do 
		case Look of 
			'+': Add; 
			'-': Subtract; 
			'|': _Or; 
			'~': _Xor; 
		end; 
{--------------------------------------------------------------} 
end; 
 
 
(The underscores are needed, of course, because "or" and "xor" are 
reserved words in Turbo Pascal.) 
 
Next, the procedures _Or and _Xor: 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate a Subtraction Operation } 
 
procedure _Or; 
begin 
	Match('|'); 
	Push; 
	Term; 
	PopOr; 
end; 
 
{--------------------------------------------------------------} 
{ Parse and Translate a Subtraction Operation } 
 
procedure _Xor; 
begin 
	Match('~'); 
	Push; 
	Term; 
	PopXor; 
end; 
{--------------------------------------------------------------} 
 
And, finally, the new code generator procedures: 
 
 


{--------------------------------------------------------------} 
{ Or TOS with Primary } 
 
procedure PopOr; 
begin 
	EmitLn('OR (SP)+,D0'); 
end; 
 
{--------------------------------------------------------------} 
{ Exclusive-Or TOS with Primary } 
 
procedure PopXor; 
begin 
	EmitLn('EOR (SP)+,D0'); 
end; 
{--------------------------------------------------------------} 
 
Now, let's test the translator (you might want to change the call 
in Main back to a call to Expression, just to avoid having to type 
"x=" for an assignment every time). 
 
So far, so good.  The parser nicely handles expressions of the 
form: 
 
	x|y~z 
 
Unfortunately, it also does nothing to protect us from mixing 
Boolean and arithmetic algebra.  It will merrily generate code 
for: 
 
	(a+b)*(c~d) 
 
We've talked about this a bit, in the past.  In general the rules 
for what operations are legal or not cannot be enforced by the 
parser itself, because they are not part of the syntax of the 
language, but rather its semantics.  A compiler that doesn't allow 
mixed-mode expressions of this sort must recognize that c and d 
are Boolean variables, rather than numeric ones, and balk at 
multiplying them in the next step. But this "policing" can't be 
done by the parser; it must be handled somewhere between the 
parser and the code generator. We aren't in a position to enforce 
such rules yet, because we haven't got either a way of declaring 
types, or a symbol table to store the types in.  So, for what 
we've got to work with at the moment, the parser is doing 
precisely what it's supposed to do. 
 
Anyway, are we sure that we DON'T want to allow mixed-type 
operations?  We made the decision some time ago (or, at least, I 
did) to adopt the value 0000 as a Boolean "false," and -1, or 
FFFFh, as a Boolean "true."  The nice part about this choice is 
that bitwise operations work exactly the same way as logical ones.  
In other words, when we do an operation on one bit of a logical 
variable, we do it on all of them.  This means that we don't need 
to distinguish between logical and bitwise operations, as is done 
in C with the operators & and &&, and | and ||.  Reducing the 
number of operators by half certainly doesn't seem all bad. 
 
From the point of view of the data in storage, of course, the 
computer and compiler couldn't care less whether the number FFFFh 
represents the logical TRUE, or the numeric -1.  Should we?  I 
sort of think not.  I can think of many examples (though they 
might be frowned upon as "tricky" code) where the ability to mix 
the types might come in handy.  Example, the Dirac delta function, 
which could be coded in one simple line: 
 
	-(x=0) 
 
or the absolute value function (DEFINITELY tricky code!): 
 
	x*(1+2*(x<0)) 
 
Please note, I'm not advocating coding like this as a way of life.  
I'd almost certainly write these functions in more readable form,  
using IFs, just to keep from confusing later maintainers.  Still, 
a moral question arises:  Do we have the right to ENFORCE our 
ideas of good coding practice on the programmer, but writing the 
language so he can't do anything else?  That's what Nicklaus Wirth 
did, in many places in Pascal, and Pascal has been criticized for 
it -- for not being as "forgiving" as C.   
 
An interesting parallel presents itself in the example of the 
Motorola 68000 design.  Though Motorola brags loudly about the 
orthogonality of their instruction set, the fact is that it's far 
from orthogonal.  For example, you can read a variable from its 
address:
 
	MOVE X,D0 (where X is the name of a variable) 
 
but you can't write in the same way.  To write, you must load an 
address register with the address of X.  The same is true for PC-
relative addressing:
 
	MOVE X(PC),DO	(legal) 
	MOVE D0,X(PC)	(illegal) 
 
When you begin asking how such non-orthogonal behavior came about, 
you find that someone in Motorola had some theories about how 
software should be written.  Specifically, in this case, they 
decided that self-modifying code, which you can implement using 
PC-relative writes, is a Bad Thing.  Therefore, they designed the 
processor to prohibit it.  Unfortunately, in the process they also 
prohibited _ALL_ writes of the forms shown above, however benign.  
Note that this was not something done by default.  Extra design 
work had to be done, and extra gates added, to destroy the natural 
orthogonality of the instruction set. 
 
One of the lessons I've learned from life: If you have two 
choices, and can't decide which one to take, sometimes the best 
thing to do is nothing.  Why add extra gates to a processor to 
enforce some stranger's idea of good programming practice?  Leave 
the instructions in, and let the programmers debate what good 
programming practice is.  Similarly, why should we add extra code 
to our parser, to test for and prevent conditions that the user 
might prefer to do, anyway?  I'd rather leave the compiler simple, 
and let the software experts debate whether the practices should 
be used or not. 
 
All of which serves as rationalization for my decision as to how 
to prevent mixed-type arithmetic:  I won't.  For a language 
intended for systems programming, the fewer rules, the better. If 
you don't agree, and want to test for such conditions, we can do 
it once we have a symbol table. 
 
BOOLEAN "AND" 
 
With that bit of philosophy out of the way, we can press on to the 
"and" operator, which goes into procedure Term. By now, you can 
probably do this without me, but here's the code, anyway: 
 
In Scanner, 
 
{--------------------------------------------------------------} 
function IsMulop(c: char): boolean; 
begin 
	IsMulop := c in ['*','/', '&']; 
end; 
{--------------------------------------------------------------} 
 
In Parser, 
 
 
{--------------------------------------------------------------} 
procedure Term; 
begin 
	Factor; 
	while IsMulop(Look) do 
		case Look of 
			'*': Multiply; 
			'/': Divide; 
			'&': _And; 
		end; 
end; 
 
{--------------------------------------------------------------} 
{ Parse and Translate a Boolean And Operation } 
 
procedure _And; 
begin 
	Match('&'); 
	Push; 
	Factor; 
	PopAnd; 
end; 
{--------------------------------------------------------------} 
 
and in CodeGen, 
 
 
{--------------------------------------------------------------} 
{ And Primary with TOS } 
 
procedure PopAnd; 
begin 
	EmitLn('AND (SP)+,D0'); 
end; 
{--------------------------------------------------------------} 
 
Your parser should now be able to process almost any sort of logical 
expression, and (should you be so inclined), mixed-mode expressions as 
well. 
 
Why not "all sorts of logical expressions"?  Because, so far, we haven't 
dealt with the logical "not" operator, and this is where it gets tricky.  
The logical "not" operator seems, at first glance, to be identical in 
its behavior to the unary minus, so my first thought was to let the 
exclusive or operator, '~', double as the unary "not."  That didn't 
work. In my first attempt, procedure SignedTerm simply ate my '~', 
because the character passed the test for an addop, but SignedTerm 
ignores all addops except '-'.  It would have been easy enough to add 
another line to SignedTerm, but that would still not solve the problem, 
because note that Expression only accepts a signed term for the _FIRST_ 
argument.   
 
Mathematically, an expression like: 
 
	-a * -b 
 
makes little or no sense, and the parser should flag it as an error.  
But the same expression, using a logical "not," makes perfect sense: 
 
	not a and not b 
 
In the case of these unary operators, choosing to make them act the same 
way seems an artificial force fit, sacrificing reasonable behavior on 
the altar of implementational ease.  While I'm all for keeping the 
implementation as simple as possible, I don't think we should do so at 
the expense of reasonableness.  Patching like this would be missing the 
main point, which is that the logical "not" is simply NOT the same kind 
of animal as the unary minus.  Consider the exclusive or, which is most 
naturally written as:   
 
	a~b ::= (a and not b) or (not a and b) 
 
If we allow the "not" to modify the whole term, the last term in 
parentheses would be interpreted as: 
 
	not(a and b) 
 
which is not the same thing at all.  So it's clear that the logical 
"not" must be thought of as connected to the FACTOR, not the term. 
 
The idea of overloading the '~' operator also makes no sense from a 
mathematical point of view.  The implication of the unary minus is that 
it's equivalent to a subtraction from zero: 
 
	-x <=> 0-x 
 
In fact, in one of my more simple-minded versions of Expression, I 
reacted to a leading addop by simply preloading a zero, then processing 
the operator as though it were a binary operator.  But a "not" is not 
equivalent to an exclusive or with zero ... that would just give back 
the original number.  Instead, it's an exclusive or with FFFFh, or -1. 
 
In short, the seeming parallel between the unary "not" and the unary 
minus falls apart under closer scrutiny. "not" modifies the factor, not 
the term, and it is not related to either the unary minus nor the 
exclusive or.  Therefore, it deserves a symbol to call its own. What 
better symbol than the obvious one, also used by C, the '!' character?  
Using the rules about the way we think the "not" should behave, we 
should be able to code the exclusive or (assuming we'd ever need to), in 
the very natural form: 
 
	a & !b | !a & b 
 
Note that no parentheses are required -- the precedence levels we've 
chosen automatically take care of things. 
 
If you're keeping score on the precedence levels, this definition puts 
the '!' at the top of the heap.  The levels become: 
 
1.	! 
2.	- (unary) 
3.	*, /, & 
4.	+, -, |, ~ 
 
Looking at this list, it's certainly not hard to see why we had trouble 
using '~' as the "not" symbol! 
 
So how do we mechanize the rules?  In the same way as we did with 
SignedTerm, but at the factor level.  We'll define a procedure 
NotFactor: 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate a Factor with Optional "Not" } 
 
procedure NotFactor; 
begin 
	if Look ='!' then begin 
		Match('!'); 
		Factor; 
		Notit; 
		end 
	else 
		Factor; 
end; 
{--------------------------------------------------------------} 
 
 
and call it from all the places where we formerly called Factor, i.e., 
from Term, Multiply, Divide, and _And.  Note the new code generation 
procedure: 
 
 
{--------------------------------------------------------------} 
{ Bitwise Not Primary } 
 
procedure NotIt; 
begin 
	EmitLn('EOR #-1,D0'); 
end; 
 
{--------------------------------------------------------------} 
 
 
Try this now, with a few simple cases. In fact, try that exclusive or 
example, 
 
	a&!b|!a&b 
 
 
You should get the code (without the comments, of course): 
 
 MOVE A(PC),DO    ; load a 
 MOVE D0,-(SP)		; push it 
 MOVE B(PC),DO		; load b 
 EOR #-1,D0		; not it 
 AND (SP)+,D0		; and with a 
 MOVE D0,-(SP)		; push result 
 MOVE A(PC),DO		; load a 
 EOR #-1,D0		; not it 
 MOVE D0,-(SP)		; push it 
 MOVE B(PC),DO		; load b 
 AND (SP)+,D0		; and with !a 
 OR (SP)+,D0		; or with first term 
 
That's precisely what we'd like to get.  So, at least for both 
arithmetic and logical operators, our new precedence and new, slimmer 
syntax hang together.  Even the peculiar, but legal, expression with 
leading addop: 
 
	~x 
 
makes sense.  SignedTerm ignores the leading '~', as it should, since 
the expression is equivalent to: 
 
	0~x, 
 
which is equal to x. 
 
When we look at the BNF we've created, we find that our boolean algebra 
now adds only one extra line: 
 
 
<not_factor> 	::= [!] <factor> 
<factor>      	::= <variable> | <constant> | '(' <expression> ')'	 
<signed_term> 	::= [<addop>] <term> 
<term>        	::= <not_factor> (<mulop> <not_factor>)*	 
<expression>  	::= <signed_term> (<addop> <term>)* 
<assignment>  	::= <variable> '=' <expression> 
 
 
That's a big improvement over earlier efforts.  Will our luck continue 
to hold when we get to relational operators?  We'll find out soon, but 
it will have to wait for the next installment. We're at a good stopping 
place, and I'm anxious to get this installment into your hands.  It's 
already been a year since the release of Installment 15.  I blush to 
admit that all of this current installment has been ready for almost as 
long, with the exception of relational operators.  But the information 
does you no good at all, sitting on my hard disk, and by holding it back 
until the relational operations were done, I've kept it out of your 
hands for that long.  It's time for me to let go of it and get it out 
where you can get value from it. Besides, there are quite a number of 
serious philosophical questions associated with the relational 
operators, as well, and I'd rather save them for a separate installment 
where I can do them justice. 
 
Have fun with the new, leaner arithmetic and logical parsing, and I'll 
see you soon with relationals.