💾 Archived View for kamalatta.ddnss.de › awk.txt captured on 2021-11-30 at 20:18:30.

View Raw

More Information

➡️ Next capture (2024-07-08)

-=-=-=-=-=-=-




	       Awk -- A Pattern Scanning and
		    Processing Language
		      (Second Edition)


		       Alfred V. Aho

		     Brian W. Kernighan

		    Peter J. Weinberger

		     Bell Laboratories
	       Murray Hill, New Jersey 07974



			  ABSTRACT

	  Awk is a  programming  language  whose  basic
     operation	is  to	search	a set of files for pat-
     terns, and to perform specified actions upon lines
     or  fields  of  lines  which  contain instances of
     those patterns.  Awk makes certain data  selection
     and transformation operations easy to express; for
     example, the awk program

			length > 72

     prints all input lines  whose  length  exceeds  72
     characters; the program

			NF % 2 == 0

     prints  all  lines  with an even number of fields;
     and the program

		  { $1 = log($1); print }

     replaces the first field of each line by its loga-
     rithm.

	  Awk  patterns  may  include arbitrary boolean
     combinations of regular expressions and  of  rela-
     tional  operators	on  strings,  numbers,	fields,
     variables,  and  array  elements.	  Actions   may
     include the same pattern-matching constructions as
     in patterns, as  well  as	arithmetic  and  string
     expressions  and  assignments, if-else, while, for
     statements, and multiple output streams.

	  This report contains a user's guide,	a  dis-
     cussion  of  the design and implementation of awk,
     and some timing statistics.



September 1, 1978









			    - 2 -







































































	       Awk -- A Pattern Scanning and

		    Processing Language

		      (Second Edition)



		       Alfred V. Aho


		     Brian W. Kernighan


		    Peter J. Weinberger


		     Bell Laboratories

	       Murray Hill, New Jersey 07974




1.  Introduction


     Awk is a programming language  designed  to  make	many

common	information  retrieval	and  text manipulation tasks

easy to state and to perform.


     The basic operation of awk is to scan a  set  of  input

lines in order, searching for lines which match any of a set

of patterns which the user has specified.  For each pattern,

an action can be specified; this action will be performed on

each line that matches the pattern.


program grep unix program manual will recognize the approach, al-
though
     Readers familiar with the UNIX*

----------------------------------------------------










			    - 2 -


in  awk  the patterns may be more general than in grep,
and the actions allowed are more involved  than  merely
printing  the matching line.  For example, the awk pro-
gram

   {print $3, $2}

prints the third and second columns of a table in  that

order.	The program


   $2 ~ /A|B|C/


prints all input lines with an A, B, or C in the second

field.	The program


   $1 != prev	  { print; prev = $1 }


prints all lines in which the first field is  different

from the previous first field.


1.1.  Usage


     The command


   awk	program  [files]


executes  the  awk commands in the string program on the set

of named files, or on the standard input  if  there  are  no

files.	 The  statements can also be placed in a file pfile,

and executed by the command


   awk	-f pfile  [files]



----------------------------------------------------












			    - 3 -


1.2.  Program Structure


     An awk program is a sequence of statements of the form:


	pattern   { action }

	pattern   { action }

	...


Each  line  of input is matched against each of the patterns

in turn.  For each pattern that matches, the associated  ac-

tion  is  executed.  When all the patterns have been tested,

the next line is fetched and the matching starts over.


     Either the pattern or the action may be left  out,  but

not both.  If there is no action for a pattern, the matching

line is simply copied to the output.   (Thus  a  line  which

matches  several patterns can be printed several times.)  If

there is no pattern for an action, then the action  is	per-

formed	for  every input line.	A line which matches no pat-

tern is ignored.


     Since patterns and actions are both  optional,  actions

must  be  enclosed  in	braces to distinguish them from pat-

terns.


1.3.  Records and Fields


     Awk input is divided into ``records'' terminated  by  a

record	separator.   The  default record separator is a new-

line, so by default awk processes its  input  a  line  at  a

time.	The  number  of the current record is available in a









			    - 4 -


variable named NR.


     Each input record is  considered  to  be  divided	into

``fields.''  Fields are normally separated by white space --

blanks or tabs -- but  the  input  field  separator  may  be

changed,  as described below.  Fields are referred to as $1,

$2, and so forth, where $1 is the first field, and $0 is the

whole  input record itself.  Fields may be assigned to.  The

number of fields in the current record	is  available  in  a

variable named NF.


     The  variables  FS  and RS refer to the input field and

record separators; they may be changed at any  time  to  any

single	character.   The  optional command-line argument -Fc

may also be used to set FS to the character c.


     If the record separator is empty, an empty  input	line

is  taken as the record separator, and blanks, tabs and new-

lines are treated as field separators.


     The variable FILENAME contains the name of the  current

input file.


1.4.  Printing


     An action may have no pattern, in which case the action

is executed for all lines.  The simplest action is to  print

some  or  all  of  a record; this is accomplished by the awk

command print.	The awk program


   { print }









			    - 5 -


prints each record, thus copying the input to the output in-

tact.	More  useful is to print a field or fields from each

record.  For instance,


   print $2, $1


prints the first two fields in reverse order.	Items  sepa-

rated by a comma in the print statement will be separated by

the current output field separator when output.   Items  not

separated by commas will be concatenated, so


   print $1 $2


runs the first and second fields together.


     The predefined variables NF and NR can be used; for ex-

ample


   { print NR, NF, $0 }


prints each record preceded by the  record  number  and  the

number of fields.


     Output may be diverted to multiple files; the program


   { print $1 >"foo1"; print $2 >"foo2" }


writes the first field, $1, on the file foo1, and the second

field on file foo2.  The >> notation can also be used:


   print $1 >>"foo"


appends the output to the file foo.  (In each case, the out-










			    - 6 -


put files are created if necessary.)  The file name can be a

variable or a field as well as a constant; for example,


   print $1 >$2


uses the contents of field 2 as a file name.


     Naturally there is a limit  on  the  number  of  output

files; currently it is 10.


     Similarly, output can be piped into another process (on

UNIX only); for instance,


   print | "mail bwk"


mails the output to bwk.


     The variables OFS and ORS may be  used  to  change  the

current  output field separator and output record separator.

The output record separator is appended to the output of the

print statement.


     Awk  also provides the printf statement for output for-

matting:


   printf format expr, expr, ...


formats the expressions in the list according to the  speci-

fication in format and prints them.  For example,


   printf "%8.2f  %10ld\n", $1, $2


prints $1 as a floating point number 8 digits wide, with two










			    - 7 -


after the decimal point, and $2 as a 10-digit  long  decimal

number,  followed  by  a  newline.  No output separators are

produced automatically; you must add them  yourself,  as  in

this  example.	 The  version of printf is identical to that

used with C.  C programm language prentice hall 1978


2.  Patterns


     A pattern in front of an action acts as a selector that

determines  whether the action is to be executed.  A variety

of expressions may be used as patterns: regular expressions,

arithmetic  relational	expressions,  string-valued  expres-

sions, and arbitrary boolean combinations of these.


2.1.  BEGIN and END


     The special pattern BEGIN matches the beginning of  the

input,	before	the  first  record is read.  The pattern END

matches the end of the input, after the last record has been

processed.  BEGIN and END thus provide a way to gain control

before and after processing, for initialization and  wrapup.


     As  an  example,  the  field  separator can be set to a

colon by


   BEGIN     { FS = ":" }

   ... rest of program ...


Or the input lines may be counted by


   END	{ print NR }










			    - 8 -


If BEGIN is present, it must be the first pattern; END	must

be the last if used.


2.2.  Regular Expressions


     The  simplest regular expression is a literal string of

characters enclosed in slashes, like


   /smith/


This is actually a complete awk program which will print all

lines  which  contain  any occurrence of the name ``smith''.

If a line contains ``smith'' as part of a  larger  word,  it

will also be printed, as in


   blacksmithing



     Awk  regular expressions include the regular expression

forms found in the UNIX text editor ed unix  program  manual

and  grep  (without back-referencing).	In addition, awk al-

lows parentheses for grouping, |  for  alternatives,  +  for

``one  or  more'', and ? for ``zero or one'', all as in lex.

Character classes may be abbreviated: [a-zA-Z0-9] is the set

of all letters and digits.  As an example, the awk program


   /[Aa]ho|[Ww]einberger|[Kk]ernighan/


will print all lines which contain any of the names ``Aho,''

``Weinberger'' or ``Kernighan,'' whether capitalized or not.


     Regular  expressions (with the extensions listed above)










			    - 9 -


must be enclosed in slashes, just as in ed and sed.   Within

a  regular  expression,  blanks  and  the regular expression

metacharacters are significant.  To turn of the magic  mean-

ing  of one of the regular expression characters, precede it

with a backslash.  An example is the pattern


   /\/.*\//


which matches any string of characters enclosed in  slashes.


     One can also specify that any field or variable matches

a regular expression (or does not match it) with the  opera-

tors ~ and !~.	The program


   $1 ~ /[jJ]ohn/


prints	all  lines where the first field matches ``john'' or

``John.''  Notice that this  will  also  match	``Johnson'',

``St.  Johnsbury'',  and  so  on.  To restrict it to exactly

[jJ]ohn, use


   $1 ~ /^[jJ]ohn$/


The caret ^ refers to the beginning of a line or field;  the

dollar sign $ refers to the end.


2.3.  Relational Expressions


     An awk pattern can be a relational expression involving

the usual relational operators <, <=, ==, !=, >=, and >.  An

example is











			   - 10 -


   $2 > $1 + 100


which  selects	lines where the second field is at least 100

greater than the first field.  Similarly,


   NF % 2 == 0


prints lines with an even number of fields.


     In relational tests, if neither operand is  numeric,  a

string comparison is made; otherwise it is numeric.  Thus,


   $1 >= "s"


selects  lines	that begin with an s, t, u, etc.  In the ab-

sence of  any  other  information,  fields  are  treated  as

strings, so the program


   $1 > $2


will perform a string comparison.


2.4.  Combinations of Patterns


     A	pattern  can be any boolean combination of patterns,

using the operators || (or), && (and), and ! (not).  For ex-

ample,


   $1 >= "s" && $1 < "t" && $1 != "smith"


selects  lines	where the first field begins with ``s'', but

is not ``smith''.  && and || guarantee that  their  operands

will  be  evaluated  from left to right; evaluation stops as










			   - 11 -


soon as the truth or falsehood is determined.


2.5.  Pattern Ranges


     The ``pattern'' that selects an action may also consist

of two patterns separated by a comma, as in


   pat1, pat2	  { ... }


In  this case, the action is performed for each line between

an occurrence of pat1 and the next occurrence of  pat2	(in-

clusive).  For example,


   /start/, /stop/


prints all lines between start and stop, while


   NR == 100, NR == 200 { ... }


does the action for lines 100 through 200 of the input.


3.  Actions


     An awk action is a sequence of action statements termi-

nated by newlines or semicolons.   These  action  statements

can be used to do a variety of bookkeeping and string manip-

ulating tasks.


3.1.  Built-in Functions


     Awk provides  a  ``length''  function  to	compute  the

length	of a string of characters.  This program prints each

record, preceded by its length:










			   - 12 -


   {print length, $0}


length by itself is a ``pseudo-variable'' which  yields  the

length of the current record; length(argument) is a function

which yields the length of its argument, as in	the  equiva-

lent


   {print length($0), $0}


The argument may be any expression.


     Awk  also	provides the arithmetic functions sqrt, log,

exp, and int, for square root, base  e	logarithm,  exponen-

tial, and integer part of their respective arguments.


     The  name	of  one of these built-in functions, without

argument or parentheses, stands for the value of  the  func-

tion on the whole record.  The program


   length < 10 || length > 20


prints	lines  whose  length is less than 10 or greater than

20.


     The function substr(s, m, n) produces the substring  of

s  that  begins  at  position  m (origin 1) and is at most n

characters long.  If n is omitted, the substring goes to the

end  of  s.  The function index(s1, s2) returns the position

where the string s2 occurs in s1, or zero if it does not.


     The function sprintf(f, e1, e2, ...) produces the value

of the expressions e1, e2, etc., in the printf format speci-









			   - 13 -


fied by f.  Thus, for example,


   x = sprintf("%8.2f %10ld", $1, $2)


sets x to the string produced by formatting the values of $1

and $2.


3.2.  Variables, Expressions, and Assignments


     Awk  variables  take  on  numeric	(floating  point) or

string values according to context.  For example, in


   x = 1


x is clearly a number, while in


   x = "smith"


it is clearly a string.  Strings are  converted  to  numbers

and vice versa whenever context demands it.  For instance,


   x = "3" + "4"


assigns 7 to x.  Strings which cannot be interpreted as num-

bers in a numerical context will generally have numeric val-

ue zero, but it is unwise to count on this behavior.


     By  default,  variables (other than built-ins) are ini-

tialized to the null string, which has numerical value zero;

this eliminates the need for most BEGIN sections.  For exam-

ple, the sums of the first two fields can be computed by


	{ s1 += $1; s2 += $2 }










			   - 14 -


   END	{ print s1, s2 }



     Arithmetic is done internally in floating	point.	 The

arithmetic operators are +, -, *, /, and % (mod).  The C in-

crement ++ and decrement -- operators  are  also  available,

and  so are the assignment operators +=, -=, *=, /=, and %=.

These operators may all be used in expressions.


3.3.  Field Variables


     Fields in awk share essentially all of  the  properties

of variables -- they may be used in arithmetic or string op-

erations, and may be assigned to.  Thus one can replace  the

first field with a sequence number like this:


   { $1 = NR; print }


or accumulate two fields into a third, like this:


   { $1 = $2 + $3; print $0 }


or assign a string to a field:


   { if ($3 > 1000)

	$3 = "too big"

     print

   }


which  replaces  the  third field by ``too big'' when it is,

and in any case prints the record.


     Field references may be numerical expressions, as in









			   - 15 -


   { print $i, $(i+1), $(i+n) }


Whether a field is deemed numeric or string depends on	con-

text; in ambiguous cases like


   if ($1 == $2) ...


fields are treated as strings.


     Each  input  line is split into fields automatically as

necessary.  It is also possible to  split  any	variable  or

string into fields:


   n = split(s, array, sep)


splits	the  the string s into array[1], ..., array[n].  The

number of elements found is returned.  If the  sep  argument

is provided, it is used as the field separator; otherwise FS

is used as the separator.


3.4.  String Concatenation


     Strings may be concatenated.  For example


   length($1 $2 $3)


returns the length of the first three fields.  Or in a print

statement,


   print $1 " is " $2


prints	the two fields separated by `` is ''.  Variables and

numeric expressions may also appear in concatenations.










			   - 16 -


3.5.  Arrays


     Array elements are not declared; they spring into exis-

tence  by being mentioned.  Subscripts may have any non-null

value, including non-numeric strings.  As an  example  of  a

conventional numeric subscript, the statement


   x[NR] = $0


assigns the current input record to the NR-th element of the

array x.  In fact, it is possible in principle (though	per-

haps  slow)  to  process  the entire input in a random order

with the awk program


	{ x[NR] = $0 }

   END	{ ... program ... }


The first action merely records each input line in the array

x.


     Array  elements  may  be  named  by non-numeric values,

which gives awk a capability  rather  like  the  associative

memory	of Snobol tables.  Suppose the input contains fields

with values like apple, orange, etc.  Then the program


   /apple/   { x["apple"]++ }

   /orange/  { x["orange"]++ }

   END	     { print x["apple"], x["orange"] }


increments counts for the named array elements,  and  prints

them at the end of the input.










			   - 17 -


3.6.  Flow-of-Control Statements


     Awk  provides  the basic flow-of-control statements if-

else, while, for, and statement grouping with braces, as  in

C.   We  showed  the if statement in section 3.3 without de-

scribing it.  The condition in parentheses is evaluated;  if

it  is	true,  the  statement following the if is done.  The

else part is optional.


     The while statement is exactly like that of C.  For ex-

ample, to print all input fields one per line,


   i = 1

   while (i <= NF) {

	print $i

	++i

   }



     The for statement is also exactly that of C:


   for (i = 1; i <= NF; i++)

	print $i


does the same job as the while statement above.


     There  is	an alternate form of the for statement which

is suited for accessing the elements of an  associative  ar-

ray:


   for (i in array)

	statement









			   - 18 -


does  statement with i set in turn to each element of array.

The elements are accessed in  an  apparently  random  order.

Chaos will ensue if i is altered, or if any new elements are

accessed during the loop.


     The expression in the condition part of an if, while or

for  can  include relational operators like <, <=, >, >=, ==

(``is equal to''), and != (``not equal	to'');	regular  ex-

pression matches with the match operators ~ and !~; the log-

ical operators ||, &&, and !; and of course parentheses  for

grouping.


     The  break  statement  causes an immediate exit from an

enclosing while or for; the continue  statement  causes  the

next iteration to begin.


     The  statement  next  causes awk to skip immediately to

the next record and begin scanning  the  patterns  from  the

top.   The statement exit causes the program to behave as if

the end of the input had occurred.


     Comments may be placed in awk programs: they begin with

the character # and end with the end of the line, as in


   print x, y	  # this is a comment



4.  Design


     The  UNIX system already provides several programs that

operate by passing  input  through  a  selection  mechanism.










			   - 19 -


Grep,  the first and simplest, merely prints all lines which

match a single specified pattern.  Egrep provides more	gen-

eral patterns, i.e., regular expressions in full generality;

fgrep searches for a set of  keywords  with  a	particularly

fast  algorithm.   Sed unix programm manual provides most of

the editing facilities of the editor ed, applied to a stream

of input.  None of these programs provides numeric capabili-

ties, logical relations, or variables.


     Lex lesk lexical analyzer cstr provides general regular

expression  recognition capabilities, and, by serving as a C

program generator, is essentially open-ended in its capabil-

ities.	 The  use of lex, however, requires a knowledge of C

programming, and a lex program must be compiled  and  loaded

before	use, which discourages its use for one-shot applica-

tions.


     Awk is an attempt to fill in another part of the matrix

of  possibilities.   It  provides general regular expression

capabilities and an implicit input/output loop.  But it also

provides convenient numeric processing, variables, more gen-

eral selection, and control flow in the  actions.   It	does

not  require  compilation or a knowledge of C.	Finally, awk

provides a convenient way to access fields within lines;  it

is unique in this respect.


     Awk  also	tries  to integrate strings and numbers com-

pletely, by treating all quantities as both string  and  nu-

meric,	deciding which representation is appropriate as late









			   - 20 -


as possible.  In most cases the user can simply  ignore  the

differences.


     Most of the effort in developing awk went into deciding

what awk should or should not do (for instance,  it  doesn't

do  string  substitution)  and what the syntax should be (no

explicit operator for concatenation) rather than on  writing

or  debugging  the  code.   We have tried to make the syntax

powerful but easy to use and well adapted to scanning files.

For  example,  the absence of declarations and implicit ini-

tializations, while probably a bad idea for  a	general-pur-

pose  programming  language, is desirable in a language that

is meant to be used for tiny programs that may even be	com-

posed on the command line.


     In  practice,  awk  usage	seems to fall into two broad

categories.  One is what might be  called  ``report  genera-

tion''	-- processing an input to extract counts, sums, sub-

totals, etc.  This also includes the writing of trivial data

validation programs, such as verifying that a field contains

only numeric information  or  that  certain  delimiters  are

properly  balanced.   The combination of textual and numeric

processing is invaluable here.


     A second area of use is as a data transformer, convert-

ing data from the form produced by one program into that ex-

pected by another.   The  simplest  examples  merely  select

fields, perhaps with rearrangements.











			   - 21 -


5.  Implementation


     The  actual implementation of awk uses the language de-

velopment tools available on the UNIX operating system.  The

grammar  is specified with yacc; yacc johnson cstr the lexi-

cal analysis is done by lex; the regular  expression  recog-

nizers are deterministic finite automata constructed direct-

ly from the expressions.  An awk program is translated	into

a parse tree which is then directly executed by a simple in-

terpreter.


     Awk was designed for ease of use rather than processing

speed;	the delayed evaluation of variable types and the ne-

cessity to break input into fields makes high  speed  diffi-

cult  to  achieve in any case.	Nonetheless, the program has

not proven to be unworkably slow.


     Table I below shows the execution (user + system)	time

on  a PDP-11/70 of the UNIX programs wc, grep, egrep, fgrep,

sed, lex, and awk on the following simple tasks:


  1. count the number of lines.


  2. print all lines containing ``doug''.


  3. print  all  lines	containing  ``doug'',	``ken''   or

     ``dmr''.


  4. print the third field of each line.













			   - 22 -


  5. print the third and second fields of each line, in that

     order.


  6. append all  lines	containing  ``doug'',  ``ken'',  and

     ``dmr'' to files ``jdoug'', ``jken'', and ``jdmr'', re-

     spectively.


  7. print each line prefixed by ``line-number : ''.


  8. sum the fourth column of a table.


The program wc merely counts words, lines and characters  in

its  input;  we  have  already mentioned the others.  In all

cases the input was a file containing 10,000 lines as creat-

ed by the command ls -l; each line has the form


   -rw-rw-rw- 1 ava 123 Oct 15 17:05 xxx


The total length of this input is 452,960 characters.  Times

for lex do not include compile or load.


     As might be expected, awk is not as fast  as  the	spe-

cialized  tools wc, sed, or the programs in the grep family,

but is faster than the more general tool lex.  In all cases,

the  tasks  were about as easy to express as awk programs as

programs in these other languages;  tasks  involving  fields

were  considerably  easier to express as awk programs.	Some

of the test programs are shown in awk, sed and lex.  $LIST$


				 Task



Program    1	   2	   3	  4	 5	 6	7      8

--------------------------------------------------------------------






			   - 23 -

	|      |       |       |      |      |	     |	    |	   |
  wc	|  8.6 |       |       |      |      |	     |	    |	   |
	|      |       |       |      |      |	     |	    |	   |
 grep	| 11.7 |  13.1 |       |      |      |	     |	    |	   |
	|      |       |       |      |      |	     |	    |	   |
 egrep	|  6.2 |  11.5 |  11.6 |      |      |	     |	    |	   |
	|      |       |       |      |      |	     |	    |	   |
 fgrep	|  7.7 |  13.8 |  16.1 |      |      |	     |	    |	   |
	|      |       |       |      |      |	     |	    |	   |
  sed	| 10.2 |  11.6 |  15.8 | 29.0 | 30.5 |	16.1 |	    |	   |
	|      |       |       |      |      |	     |	    |	   |
  lex	| 65.1 | 150.1 | 144.2 | 67.7 | 70.3 | 104.0 | 81.7 | 92.8 |
	|      |       |       |      |      |	     |	    |	   |
  awk	| 15.0 |  25.6 |  29.9 | 33.3 | 38.9 |	46.4 | 71.4 | 31.1 |
	|      |       |       |      |      |	     |	    |	   |
--------+------+-------+-------+------+------+-------+------+------+


 Table I.  Execution Times of Programs. (Times are in sec.)



     The programs  for	some	   6.	/ken/	  {print >"jken"}

of   these  jobs  are  shown		/doug/	  {print >"jdoug"}

below.	The lex programs are		/dmr/	  {print >"jdmr"}

generally  too long to show.

				   7.	{print NR ": " $0}
AWK:


				   8.	     {sum = sum + $4}
   1.	END  {print NR}
					END  {print sum}


   2.	/doug/
				SED:


   3.	/ken|doug|dmr/
				   1.	$=


   4.	{print $3}
				   2.	/doug/p


   5.	{print $3, $2}
				   3.	/doug/p

					/doug/d









			   - 24 -


	/ken/p				^.*doug.*$     printf("%s\n", yytext);

	/ken/d				.    ;

	/dmr/p				\n   ;

	/dmr/d



   4.	/[^ ]* [ ]*[^ ]* [ ]*\([^ ]*\) .*/s//\1/p



   5.	/[^ ]* [ ]*\([^ ]*\) [ ]*\([^ ]*\) .*/s//\2 \1/p



   6.	/ken/w jken

	/doug/w jdoug

	/dmr/w jdmr



LEX:



   1.	%{

	int i;

	%}

	%%

	\n   i++;

	.    ;

	%%

	yywrap() {

	     printf("%d\n", i);

	}



   2.	%%