Introduction to the PERL Programming Language Eric Brill INTRODUCTION PERL is a very nice programming language for language processing. Once you learn a small bag of tricks, you should be able to develop programs very rapidly. Below is a simple PERL program: ======================================================= #!/usr/local/bin/perl # this is a comment print ``hello, world\n''; ======================================================= The first line specifies where to find perl. If your system has perl elsewhere, you will have to modify this line. Comments begin with the symbol #. The third line prints ``hello, world''. The last character, \n, returns a new line. To run this program, you would write this to some file, say foo.prl. Then you would make foo.prl executable by executing: chmod +x foo.prl To run the program, you would just type: foo.prl SCALAR VARIABLES Here is a program that adds two numbers together and prints them. ======================================================= #!/usr/local/bin/perl $first_num = 10; $sec_num = 5; $third_num = $first_num + $sec_num; print ``The sum of '', $first_num, `` and ``, $sec_num, `` is ``, $third_num,''\n''; ======================================================= In this program $first_num, $sec_num and $third_num are variables. You do not have to declare variables in perl. Also, perl variables are weakly typed. The first character of a variable indicates its type. $ means that the variable is a scalar. A scalar variable can be a string, an integer or a real number. You do not have to specify which of these it is. Perl figures it out based on context. Before a scalar is declared, its value is 0 if in a number context or the null string if in a string context. Here is another version of the ``hello world'' program. (We will drop the #!/usr/local/bin/perl line from now on. You still need to include it with your program). ======================================================= $the_string = ``hello world''; print $the_string,''\n''; ======================================================= A period is used for string concatenation. So yet another version: ======================================================= $hello = ``hello''; $world = ``world''; $hello_world = $hello . `` `` . $world . ``\n''; print $hello_world; ======================================================= ARRAYS Another variable type is the array. Arrays do not have to be predeclared, nor does their size have to be specified. Arrays hold scalar values. A variable beginning with the character @ is an array. Now things get a bit confusing. @x is the array x. $x is the scalar variable x. These two variables are not in any way related. To reference the first item of the array @x, we use $x[0]. This is because the element in x[0] will be a scalar, and so the first character indicates that this is a scalar value. For an array @x, the special variable $#x indicates the highest index used in the array. So, $x[$#x] will be the last element in array @x. Here are two versions of a program for assigning values to an array. ==================================================== $x[0] = ``dog''; $x[1] = 34; =================================================== =================================================== # $#x is initially -1, as @x does not exist. $x[++$#x] = ``dog''; $x[++$#x] = 34; =================================================== In perl, strings and arrays are closely related. In fact, there are two functions for converting from one to the other. split takes a string and splits it into an array. split takes two arguments. The first argument is a regular expression specifying what to split on, the second argument is the string to split. The code: ======================================================= $avar = ``abcAAdefAAg''; @x = split(/AA/,$avar); ======================================================= will result in $x[0] holding ``abc'', $x[1] holding ``def'', and $x[2] holding ``g''. split does not alter the contents of $avar. join takes two arguments. The first is the character sequence that will be placed between elements of the array, and the second specifies the array to be joined. The reverse operation of the split above is: ======================================================= $x[0] = ``abc''; $x[1] = ``def''; $x[2] = ``g''; $avar = join(``AA'',@x); ======================================================= CONTROL STRUCTURES Perl's control structures are similar to those in C, except that all blocks must be bracketed. ========================================== $x = 0; while($x < 5) { print $x,''\n''; ++$x; } ========================================== for ($count=0;$count<5;++$count) { print $x,''\n''; } ========================================== if ($x == 2) { print ``yes\n''; } else { print ``no\n''; } ========================================== if ($x == 1) { print ``ONE\n''; } elsif ($x == 2){ print ``TWO\n''; } else { print ``OTHER\n''; } ========================================== For comparing numbers, perl uses the same symbols as C. For comparing strings, eq is true if two strings are equivalent, and ne returns true if two strings are not equivalent. So, ======================================================= $x = ``yes''; $y = ``no''; if ($x eq $y) { print ``1\n''; } else { print ``2\n''; } ======================================================= will print 2. And ======================================================= $x = ``yes''; $y = ``no''; if ($x ne $y) { print ``1\n''; } else { print ``2\n''; } ======================================================= will print 1. The following program loads an array with some integers and then prints out the contents of the array: ================================================================= $x[0] = 10; $x[1] = 15; $x[2] = 5; for ($c=0;$c<=$#x;++$c) { print ``$x[$c]\n''; } ================================================================= ASSOCIATIVE ARRAYS (HASH TABLES) The final variable type is the associative array. An associative array is a structure of key, value pairs. A variable beginning with the character %x is an associative array. Here are some examples: $x{```dog''} returns the value (a scalar) associated with the key ``dog''. $x{``dog''} = ``cat''; sets the value associated with ``dog'' to be ``cat''. Note that associative arrays use curly brackets ($x{2}), while arrays use square brackets ($x[2]). In other words, $x{2} returns the value associated with the key 2 in the associative array %x, whereas $x[2] returns the third element (because index starts at 0) of the array @x. Before a key is inserted into an associative array, the value associated with that key is 0 or the empty string. Here's a program to count the number of even numbers in [0..10]: ======================================================= for ($count=0;$count<=10;++$count) { if ($count % 2 == 0) # if count mod 2 is 0, then even { $thearray{``EVEN''}++; # this is shorthand for # $thearray{``EVEN''} = $thearray{``EVEN''} +1; } } print ``There were ``,$thearray{``EVEN''}, `` even numbers\n''; ======================================================= If we want to print out all of the keys and values in an associative array, we do the following: ======================================================= while (($key,$val) = each %thearray) { print $key,'' ``,$val,''\n''; } ======================================================= Note that we use % in front of thearray here because we are talking about the entire associative array, while we use $ for $thearray{``EVEN''} because we are talking about a specific value in the associative array. If your program takes arguments, you can refer to those arguments much as you do in C. $ARGV[0] refers to the first argument, not the name of the program as in C. $ARGV[1] refers to the second argument, and so on. Here's a simple program that takes two files as arguments and tells which file contains more lines. ======================================================= open(FILE1,$ARGV[0]); open(FILE2,$ARGV[1]); while() { $num_lines_1++; } close(FILE1); while() { $num_lines_2++; } close(FILE2); if ($num_lines_1 > $num_lines_2) { print ``The first file contained more lines\n''; } else { print ``The second file contained more lines\n''; } ======================================================= The line while() reads lines from FILE1 until it reaches the end of file. When a line is read, it is stored in the special variable $_. This is a program to print all lines from a file. ======================================================= open(FILE1,$ARGV[0]); while(){ print $_; } close(FILE1); ======================================================= To read from stdin, we do not need to call open: ======================================================= while() { print $_; } ======================================================= REGULAR EXPRESSIONS \s matches a space or tab ^ matches the start of a string $ matches the end of a string a matches the letter a a+ matches 1 or more a's a* matches 0 or more a's (ab)+ matches 1 or more ab's [^abc] matches a character that is not a or b or c [a-z] matches any lower case letter . matches any character To test whether a string in $x contains the string ``abc'', we can use: if ($x =~ /abc/) { . . . } To test whether a string begins with ``abc'', if ($x =~ /^abc/) { . . . } To test whether a string begins with a capital letter: if ($x =~ /^[A-Z]/) { . . . } To test whether a string does not begin with a lower case letter: if ($x =~ /^[^a-z]/) { . . . } In the above example, the first ^ matches the beginning of the string, while the ^ within the square brackets means ``not''. In addition to using regular expressions for testing strings, we can also use them to change strings. To do this, we use a command of the form: s/FROM/TO/options where FROM is the matching regular expression and TO is what to change this to. options can either be blank, meaning to only do this to the first match of FROM in the string, or it can be g, meaning do it globally. To change all a's to b's in the string in variable $x: $x =~ s/a/b/g; To change the first a to b: $x =~ s/a/b/; To change all strings of consecutive a's into one a: $x =~ s/a*/a/g; To remove all strings of consecutive a's: $x =~ s/a*//g; To remove blanks from the start of a string: $x =~ s/\s+//g; SAMPLE PROGRAMS An infinite loop to take a line of input with two numbers separated by a space and return their sum: ====================================================== while() { $_ =~ s/^\s+//; # removes spaces at start of line, # since we will split on space @nums = split(/\s+/,$_); # we can now easily access the two # numbers $answer = $nums[0] + $nums[1]; print ``THE ANSWER IS: ``,$answer,''\n''; } ====================================================== A messier way to do this: ===================================================== while() { $_ =~ s/^\s+//; # removes spaces at start of line, # since we will split on space $num1 = $_; $num2 = $_; # makes fresh copies of the input line $num1 =~ s/\s+[0-9]$//; $num2 =~ s/^[0-9]+\s+//; $answer = $num1 + $num2; print ``THE ANSWER IS: ``,$answer,''\n''; } ===================================================== Given a text, return a list of words and word counts: ===================================================== while() { $_ =~ s/^\s+//; # Good idea to always do this. If the line # starts with blanks, then the first element # of the array after splitting wound be null @words_in_line = split(/\s+/,$_); # splits the line into an array of words for ($count=0;$count<=$#words_in_line;++$count) { $word_count{$words_in_line[$count]}++; } } while(($key,$val) = each %word_count) { print ``$key $val\n''; } ===================================================== Given a text, return a list of word pairs and their counts: ===================================================== while() { $_ =~ s/^\s+//; # Good idea to always do this. If the line # starts with blanks, then the first element # of the array after splitting wound be null @words_in_line = split(/\s+/,$_); # splits the line into an array of words for ($count=0;$count<=$#words_in_line-1;++$count) { $word_count{$words_in_line[$count] . `` `` . $words_in_line[$count+1]}++; } } while(($key,$val) = each %word_count) { print ``$key $val\n''; } ===================================================== A program to calculate the frequency of three-letter endings for words in a text: ==================================================== while() { $_ =~ s/^\s+//; @words = split(/\s+/,$_); for ($count=0;$count<=$#words;++$count) { @chars = split(//,$words[$count]); # we split on nothing, which gives # an array of characters. if ($#chars > 1) { # make sure there are at least three chars $ending{$chars[$#chars-2] . `` `` . $chars[$#chars-1] . `` `` . $chars[$#chars]}++; } } } while (($key,$val) = each %ending) { print ``$key $val\n''; } ==================================================== A program that takes two files and outputs all lines in the first file where the same line occurs in the same position in the second program: ===================================================== open(FILE1,$ARGV[0]); open(FILE2,$ARGV[1]); while() { $line_from_2 = ; if ($_ eq $line_from_2) { print $_; } } close(FILE1); close(FILE2); ====================================================== Given text labelled with parts of speech, such as The/det boy/noun ate/verb . . . strip off the part of speech tags: ====================================================== while() { $_ =~ s/^\s+//; @words = split(/\s+/,$_); for ($count=0;$count<=$#words;++$count) { $word = $words[$count]; # but word has tag on it $word =~ s/\/.*$//; # this says given a string that starts with # a slash and then contains any character sequence # until the end of the string, convert it # to the null string. Note that we have to # backslash the / character in the regular expression. print $word,'' ``; } print ``\n''; } ====================================================== Given the same input, this program strips off the words and returns the part of speech tags: ====================================================== while() { $_ =~ s/^\s+//; @words = split(/\s+/,$_); for ($count=0;$count<=$#words;++$count) { $word = $words[$count]; # but word has tag on it $word =~ s/^.*\///; print $word,'' ``; } print ``\n''; } ====================================================== Return the length of the longest string in a text: ===================================================== while() { $_ =~ s/^\s+//; @words = split(/\s+/,$_); for ($count=0;$count<=$#words;++$count) { @chars = split(//,$words[$count]); if ($#chars > $maxlength) { $maxlength = $#chars; } } } $maxlength++; # must add one, since the array index starts with 0 print $maxlength,''\n''; =================================================== Print out a random number from 1 to 10: =================================================== srand; # sets the random number generator seed $num = rand(10); $num = int($num); print ``$num\n''; =================================================== Common Mistakes: The following two mistakes probably account for about half the debugging time of students in the class. Looking out for these errors could save you a great deal of time. (1) Make sure every variable starts with the appropriate type symbol. For instance, check that you haven't typed something like: myvar = 5; (instead of $myvar = 5) (2) Make sure you have spelled all variables correctly. $myvar = 5; $myvaar++; # spelling error In general, try to write a number of small programs instead of one monolithic program to get the job done. And pass the results using Unix pipes. This will facilitate debugging and code reusability. =================================================== If in doubt, look at the manual pages for perl. In addition to hard-copy books, there are a number of on-line perl manuals. Everything discussed so far works for both perl4 and perl5. Perl5 has some additional very nice features, including pointers and hash tables of hash tables. You may want to explore these features. However, they can easily be mimicked in perl4. If you want to hash based on a key of two words, in perl5 you can say: $hashtable{$word1}{$word2} = $value; In perl4, you can still do this by: $hashtable{$word1 . `` `` . $word2} = $value; However, finding all word pairs in the hash table that have ``the'' as the first word would be much easier in perl5: while(($key,$val) = each %{$hashtable{``the''}}) { print ``the'' . `` `` . $key . ``\n''; } In perl4, you could do: while (($key,$val) = each %hashtable) { @temp = split(/\s+/,$key); if ($temp[0] eq ``the'') { print ``$key\n''; } } ============================================================================