Introduction to the PERL Programming Language
                              Eric Brill

INTRODUCTION

PERL is a very nice programming language for language processing.
Once you learn a small bag of tricks, you should be able to develop
programs very rapidly.  Below is a simple PERL program:

=======================================================
#!/usr/local/bin/perl

# this is a comment

print ``hello, world\n'';
=======================================================

The first line specifies where to find perl.  If your system has perl
elsewhere, you will have to modify this line.  Comments begin with the
symbol #.  The third line prints ``hello, world''.  The last
character, \n, returns a new line.  To run this program, you would
write this to some file, say foo.prl.  Then you would make foo.prl
executable by executing:

chmod +x foo.prl

To run the program, you would just type: foo.prl


SCALAR VARIABLES


Here is a program that adds two numbers together and prints them.

=======================================================
#!/usr/local/bin/perl

$first_num = 10;
$sec_num = 5;
$third_num = $first_num + $sec_num;
print ``The sum of '', $first_num, `` and ``, $sec_num, `` is ``, 
   $third_num,''\n'';
=======================================================

In this program $first_num, $sec_num and $third_num are variables.
You do not have to declare variables in perl.  Also, perl variables
are weakly typed.  The first character of a variable indicates its
type.  $ means that the variable is a scalar.  A scalar variable can
be a string, an integer or a real number.  You do not have to specify
which of these it is.  Perl figures it out based on context.  Before a
scalar is declared, its value is 0 if in a number context or the null
string if in a string context.  Here is another version of the ``hello
world'' program.  (We will drop the #!/usr/local/bin/perl line from
now on.  You still need to include it with your program).

=======================================================
$the_string = ``hello world'';
print $the_string,''\n'';
=======================================================

A period is used for string concatenation.  So yet another version:

=======================================================
$hello = ``hello'';
$world = ``world'';
$hello_world = $hello . `` `` . $world . ``\n'';
print $hello_world;
=======================================================

ARRAYS

Another variable type is the array.  Arrays do not have to be
predeclared, nor does their size have to be specified.  Arrays hold
scalar values.  A variable beginning with the character @ is an array.
Now things get a bit confusing.  @x is the array x.  $x is the scalar
variable x.  These two variables are not in any way related.  To
reference the first item of the array @x, we use $x[0].  This is
because the element in x[0] will be a scalar, and so the first
character indicates that this is a scalar value.  For an array @x, the
special variable $#x indicates the highest index used in the array.
So, $x[$#x] will be the last element in array @x.  Here are two
versions of a program for assigning values to an array.

====================================================
$x[0] = ``dog'';
$x[1] = 34;
===================================================

===================================================
# $#x is initially -1, as @x does not exist.
$x[++$#x] = ``dog'';
$x[++$#x] = 34;
===================================================

In perl, strings and arrays are closely related.  In fact, there are
two functions for converting from one to the other.

split takes a string and splits it into an array.  split takes
two arguments.  The first argument is a regular expression specifying
what to split on, the second argument is the string to split.
The code:

=======================================================
$avar = ``abcAAdefAAg'';
@x = split(/AA/,$avar);
=======================================================

will result in $x[0] holding ``abc'', $x[1] holding ``def'', and 
$x[2] holding ``g''.  split does not alter the contents of $avar.

join takes two arguments.  The first is the character sequence that
will be placed between elements of the array, and the second specifies
the array to be joined.  The reverse operation of the split above is:

=======================================================
$x[0] = ``abc'';
$x[1] = ``def'';
$x[2] = ``g'';
$avar = join(``AA'',@x);
=======================================================

CONTROL STRUCTURES

Perl's control structures are similar to those in C, except that
all blocks must be bracketed.

==========================================
$x = 0;
while($x < 5) {
    print $x,''\n'';
    ++$x;
}

==========================================

for ($count=0;$count<5;++$count) {
   print $x,''\n'';
}

==========================================

if ($x == 2) {
  print ``yes\n'';
}
else {
  print ``no\n'';
}

==========================================

if ($x == 1) {
        print ``ONE\n'';
}
elsif ($x == 2){
        print ``TWO\n'';
}
else {
        print ``OTHER\n'';
}
==========================================

For comparing numbers, perl uses the same symbols as C.
For comparing strings, eq is true if two strings are equivalent,
and ne returns true if two strings are not equivalent.
So, 

=======================================================
$x = ``yes'';
$y = ``no'';
if ($x eq $y) {
        print ``1\n'';
}
else {
        print ``2\n'';
}
=======================================================

will print 2.  And

=======================================================
$x = ``yes'';
$y = ``no'';
if ($x ne $y) {
        print ``1\n'';
}
else {
        print ``2\n'';
}
=======================================================

will print 1.

The following program loads an array with some integers and then prints
out the contents of the array:

=================================================================
$x[0] = 10;
$x[1] = 15;
$x[2] = 5;
for ($c=0;$c<=$#x;++$c) {
        print ``$x[$c]\n'';
}
=================================================================


ASSOCIATIVE ARRAYS (HASH TABLES)

The final variable type is the associative array.  An associative
array is a structure of key, value pairs.  A variable beginning with
the character %x is an associative array.  Here are some examples:


$x{```dog''} returns the value (a scalar) associated with the key ``dog''.
$x{``dog''} = ``cat''; sets the value associated with ``dog'' to be 
``cat''.

Note that associative arrays  use curly brackets ($x{2}), while arrays
use square brackets ($x[2]).  In other words, $x{2} returns the value
associated with the key 2 in the associative array %x, whereas $x[2]
returns the third element (because index starts at 0) of the array @x.

Before a key is inserted into an associative array, the value
associated with that key is 0 or the empty string.  Here's a program
to count the number of even numbers in [0..10]:

=======================================================
for ($count=0;$count<=10;++$count) {
  if ($count % 2 == 0) # if count mod 2 is 0, then even
    {
       $thearray{``EVEN''}++;
       # this is shorthand for 
       #    $thearray{``EVEN''} = $thearray{``EVEN''} +1;
    }
}
print ``There were ``,$thearray{``EVEN''}, `` even numbers\n'';
=======================================================


If we want to print out all of the keys and values in an associative
array, we do the following:

=======================================================
while (($key,$val) = each %thearray) {
        print $key,'' ``,$val,''\n'';
}
=======================================================

Note that we use % in front of thearray here because we are talking
about the entire associative array, while we use $ for
$thearray{``EVEN''} because we are talking about a specific value 
in the associative array.


If your program takes arguments, you can refer to those arguments much
as you do in C.  $ARGV[0] refers to the first argument, not the name
of the program as in C.  $ARGV[1] refers to the second argument, and
so on.  Here's a simple program that takes two files as arguments and
tells which file contains more lines.

=======================================================
open(FILE1,$ARGV[0]);
open(FILE2,$ARGV[1]);
while(<FILE1>) {
        $num_lines_1++;
}
close(FILE1);
while(<FILE2>) {
        $num_lines_2++;
}
close(FILE2);
if ($num_lines_1 > $num_lines_2) {
        print ``The first file contained more lines\n'';
}
else {
        print ``The second file contained more lines\n'';
}
=======================================================


The line while(<FILE1>) reads lines from FILE1 until it reaches the
end of file.  When a line is read, it is stored in the special
variable $_.  This is a program to print all lines from a file.

=======================================================
open(FILE1,$ARGV[0]);
while(<FILE1>){
        print $_;
}
close(FILE1);
=======================================================

To read from stdin, we do not need to call open:

=======================================================
while(<STDIN>) {
        print $_;
}
=======================================================

REGULAR EXPRESSIONS

\s matches a space or tab
^ matches the start of a string
$ matches the end of a string
a matches the letter a
a+ matches 1 or more a's
a* matches 0 or more a's
(ab)+ matches 1 or more ab's
[^abc] matches a character that is not a or b or c
[a-z] matches any lower case letter
. matches any character

To test whether a string in $x contains the string ``abc'', we can
use:

if ($x =~ /abc/) { . . . }

To test whether a string begins with ``abc'', 

if ($x =~ /^abc/) { . . . }

To test whether a string begins with a capital letter:

if ($x =~ /^[A-Z]/) { . . . }

To test whether a string does not begin with a lower case letter:

if ($x =~ /^[^a-z]/) { . . . }

In the above example, the first ^ matches the beginning of the string,
while the ^ within the square brackets means ``not''.

In addition to using regular expressions for testing strings, we can
also use them to change strings.  To do this, we use a command of the 
form:

s/FROM/TO/options

where FROM is the matching regular expression and TO is what to change
this to.  options can either be blank, meaning to only do this to the
first match of FROM in the string, or it can be g, meaning do it
globally.

To change all a's to b's in the string in variable $x:

$x =~ s/a/b/g;

To change the first a to b:

$x =~ s/a/b/;

To change all strings of consecutive a's into one a:

$x =~ s/a*/a/g;

To remove all strings of consecutive a's:

$x =~ s/a*//g;

To remove blanks from the start of a string:

$x =~ s/\s+//g;


SAMPLE PROGRAMS

An infinite loop to take a line of input with two numbers separated by
a space and return their sum:

======================================================
while(<STDIN>) { 
        $_ =~ s/^\s+//; # removes spaces at start of line,
                        # since we will split on space
        @nums = split(/\s+/,$_); # we can now easily access the two
                                 # numbers
        $answer = $nums[0] + $nums[1];
        print ``THE ANSWER IS: ``,$answer,''\n'';
}
======================================================

A messier way to do this:

=====================================================
while(<STDIN>) { 
        $_ =~ s/^\s+//; # removes spaces at start of line,
                        # since we will split on space
        $num1 = $_;
        $num2 = $_;  # makes fresh copies of the input line
        $num1 =~ s/\s+[0-9]$//;
        $num2 =~ s/^[0-9]+\s+//;
        $answer = $num1 + $num2;
        print ``THE ANSWER IS: ``,$answer,''\n'';
}
=====================================================

Given a text, return a list of words and word counts:

=====================================================
while(<STDIN>) {
        $_ =~ s/^\s+//; # Good idea to always do this.  If the line
                        # starts with blanks, then the first element
                        # of the array after splitting wound be null
        @words_in_line = split(/\s+/,$_);
                 # splits the line into an array of words
        for ($count=0;$count<=$#words_in_line;++$count) {
                $word_count{$words_in_line[$count]}++;
         }
}
while(($key,$val) = each %word_count) {
        print ``$key $val\n'';
}
=====================================================


Given a text, return a list of word pairs and their counts:

=====================================================
while(<STDIN>) {
        $_ =~ s/^\s+//; # Good idea to always do this.  If the line
                        # starts with blanks, then the first element
                        # of the array after splitting wound be null
        @words_in_line = split(/\s+/,$_);
                 # splits the line into an array of words
        for ($count=0;$count<=$#words_in_line-1;++$count) {
                $word_count{$words_in_line[$count] . `` ``
                    . $words_in_line[$count+1]}++;
         }
}
while(($key,$val) = each %word_count) {
        print ``$key $val\n'';
}
=====================================================

A program to calculate the frequency of three-letter endings for
words in a text:

====================================================

while(<STDIN>) {
        $_ =~ s/^\s+//;
        @words = split(/\s+/,$_);
        for ($count=0;$count<=$#words;++$count) {
                @chars = split(//,$words[$count]);
                # we split on nothing, which gives
                # an array of characters.
                if ($#chars > 1) {
                        # make sure there are at least three chars
                        $ending{$chars[$#chars-2] . `` `` . 
                                $chars[$#chars-1] . `` `` . 
                                $chars[$#chars]}++;
                 }
         }
}
while (($key,$val) = each %ending) {
        print ``$key $val\n'';
}

====================================================

A program that takes two files and outputs all lines in
the first file where the same line occurs in the same position
in the second program:

=====================================================

open(FILE1,$ARGV[0]);
open(FILE2,$ARGV[1]);
while(<FILE1>) {
        $line_from_2 = <FILE2>;
        if ($_ eq $line_from_2) {
                print $_;
        }
}
close(FILE1);
close(FILE2);

======================================================


Given text labelled with parts of speech, such as

The/det boy/noun ate/verb . . .

strip off the part of speech tags:


======================================================

while(<STDIN>) {
        $_ =~ s/^\s+//;
        @words = split(/\s+/,$_);
        for ($count=0;$count<=$#words;++$count) {
                $word = $words[$count]; # but word has tag on it
                $word =~ s/\/.*$//;
                # this says given a string that starts with
                # a slash and then contains any character sequence
                # until the end of the string, convert it
                # to the null string.  Note that we have to
                # backslash the / character in the regular expression.
                print $word,'' ``;
        }
        print ``\n'';
}

======================================================

Given the same input, this program strips off the words and
returns the part of speech tags:

======================================================


while(<STDIN>) {
        $_ =~ s/^\s+//;
        @words = split(/\s+/,$_);
        for ($count=0;$count<=$#words;++$count) {
                $word = $words[$count]; # but word has tag on it
                $word =~ s/^.*\///;
                print $word,'' ``;
        }
        print ``\n'';
}

======================================================


Return the length of the longest string in a text:

=====================================================


while(<STDIN>) {
        $_ =~ s/^\s+//;
        @words = split(/\s+/,$_);
        for ($count=0;$count<=$#words;++$count) {
                @chars = split(//,$words[$count]);
                if ($#chars > $maxlength) {
                        $maxlength = $#chars;
                }
        }
}
$maxlength++;
# must add one, since the array index starts with 0
print $maxlength,''\n'';

===================================================

Print out a random number from 1 to 10:

===================================================

srand; # sets the random number generator seed
$num = rand(10);
$num = int($num);
print ``$num\n'';

===================================================

Common Mistakes: The following two mistakes probably account for about
half the debugging time of students in the class.  Looking out for
these errors could save you a great deal of time.

(1) Make sure every variable starts with the appropriate type symbol.
For instance, check that you haven't typed something like:
        myvar = 5;  (instead of $myvar = 5) 

(2) Make sure you have spelled all variables correctly.
        $myvar = 5;
        $myvaar++;   # spelling error

In general, try to write a number of small programs instead of one
monolithic program to get the job done.  And pass the results using
Unix pipes.  This will facilitate debugging and code reusability.

===================================================


If in doubt, look at the manual pages for perl.  In addition to
hard-copy books, there are a number of on-line perl manuals.

Everything discussed so far works for both perl4 and perl5. Perl5 has
some additional very nice features, including pointers and hash tables
of hash tables.  You may want to explore these features.  However,
they can easily be mimicked in perl4.  If you want to hash based on a
key of two words, in perl5 you can say:

$hashtable{$word1}{$word2} = $value;

In perl4, you can still do this by:

$hashtable{$word1 . `` `` . $word2} = $value;

However, finding all word pairs in the hash table that have ``the'' as
the first word would be much easier in perl5:

while(($key,$val) = each %{$hashtable{``the''}}) {
        print ``the'' . `` `` . $key . ``\n''; }

In perl4, you could do:

while (($key,$val) = each %hashtable) {
        @temp = split(/\s+/,$key);
        if ($temp[0] eq ``the'') {
                print ``$key\n'';
        }
}

============================================================================