"I realized at that point that there was a huge ecological niche between the C language and Unix shells. C was good for manipulating complex things - you can call it 'manipulexity.' And the shells were good at whipping up things - what I call 'whipupitude.' But there was this big blank area where neither C nor shell were good, and that's where I aimed Perl."
-- Larry Wall, author of Perl
Overview
In the first part, we talked about some basics and general issues in
Perl - writing a script, hash-bangs, style - as well as a number of specifics,
such as scalars, arrays, hashes, operators, and quoting methods. This month,
we'll take a look at the intrinsic Perl tools that make it so easy to use
from the command line, as well as their equivalents in scripts. We'll also
go a little deeper into quoting methods, and get a bit of a start on regexes
(regular expressions, or REs) - one of the most powerful tools in Perl,
and one that deserves an entire book all its own. [1]
Quote Mechanisms
Most of you will be familiar with the standard quoting mechanisms in Unix: the single and the double quote, which I'd already mentioned in my previous article, have much the same functionality in Perl as they do in the shell. Sometimes, though, escaping all the in-line metacharacters can be a bit painful. Imagine trying to print a string like this:
``/// Don't say "shan't," "can't," or "won't." ///''
Good grief! What can we do with a mess like that?
Well, we could put in a whole bunch of escapes ("\"), but that would be a pain - as well as a case of the LTS ("Leaning Toothpick Syndrome"):
print '\`\`\/\/\/ Don\'t...
<shudder> Obviously not a good answer. For times like these, Perl provides alternate quoting mechanisms:
q// # Single quotes
qq// # Double quotes
qx// # Back quotes, for shell
execution
qw// # Word list - useful for
populating arrays
Note also that the delimiter does not have to be '/', but can be any character. Now our job becomes a bit easier:
print q-``/// Don't say "shan't," "can't," or "won't." ///''-;
Simple, eh? By the way, this is something you would use only inside
a script; the shell interpretation mechanism would make a horrendous mess
of this if you tried it from the command line, especially things like back
quotes and slashes.
Perl Invocation
"Hear my plea, O Perl of Great Wisdom!" Oh, never mind; I think that was standard in Perl3, and is now deprecated... :)
The most commonly-used switch in invoking Perl, if you're running it from the command line, is '-e'; this one tells Perl to execute whatever comes immediately after it. In fact, '-e' must be the last switch used on the command line because everything after it is considered to be part of the script!
perl -we 'print "The Gods send thread for the Web begun.\n"'
"-w" is the "warn" switch that I mentioned the last time. It tells you about all the non-fatal errors in your code, including variables that you set but didn't use (invaluable for finding mistyped variable names) as well as many, many other things. You should always - yes, always - use "-w", whether on the command line or in a script.
"-n" is the "non-printing loop" switch, which causes Perl to iterate over the input, one line at a time - somewhat like "awk". If you want to print a given line, you'll need to specify a condition for it:
perl -wne 'print if /holiday/' schedule.txt
Perl will loop through "schedule.txt" and print any line that contains the word "holiday", so you can get depressed about how little time off you actually have.
"-p" is the invocation for a "printing loop", which acts just like "-n" except that it prints every line that it loops over. This is very useful for "sed"-like operations, like modifying a file and writing it back out (we'll discuss 's///', the substitution operator, in just a bit):
perl -wpe 's/holiday/Party time!/' schedule.txt
This will perform the substitution on the first occurrence of the word 'holiday' in any given line (see "perldoc perlre" for discussion of modifiers used with 's///', such as 'g'lobal.)
The "-i" switch works well in combination with either of the above, depending on the desired action; it allows you to perform an "in-place" edit, i.e. make the changes in the specified file (optionally performing a backup beforehand) rather than printing them out to the screen. Note that we can't just tack an "i" onto the "wpe" string: it takes an optional argument - the extension to be appended to the backup copy - and the text that follows it is what specifies that extension.
perl -i~ -wpe 's/holiday/Party time!/' schedule.txt
The above line will produce a "schedule.txt" with the modified text
in it, and a "schedule.txt~" that is the original file. "-i" without any
extension overwrites the original file; this is far more convenient than
producing a modified file and renaming it back to the original, but be
sure that your code is correct, or you'll wipe out your original
data!
RegExes, or "Has The Cat Been Walking On My Keyboard Again?"
One of the most powerful tools available in Perl, the regular expression is the way to match almost any imaginable character arrangement. Here (necessarily) I'll cover only the very basics; if you find that you need more information, dig into the "perlre" manpage that comes with Perl. That should keep you busy for a while. :)
REs are used for pattern matching, most commonly with the "m//" (matching) and "s///" substitution) operators. Note that the delimiters in these, just like in the quoting mechanisms, are not restricted to '/'; in fact, the leading 'm' in the matching operator is required only if a non-default delimiter is used. Otherwise, just the "//" is sufficient.
Here are some of the metacharacters used with REs. Note that there are many more; these are just enough to get us started:
. Matches any character
except the newline
^ Match the beginning
of the line
$ Match the end of the
line
| Alternation (match
"left|right|up|down|sideways")
* Match 0 or more times
+ Match 1 or more times
? Match 0 or 1 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
As an example, let's say that we have a file with a list of names:
Anne Bonney
Bartholomew Roberts
Charles Bellamy
Diego Grillo
Edward Teach
Francois Gautier
George Watling
Henry Every
Israel Hands
John Derdrake
KuoHsing Yeh
...
and we want to replace the first name with 'Captain'. Obviously, we would go through the file with a printing loop and do a substution if it matched our criteria:
s/^.+ /Captain /;
The caret ('^') matches at the beginning of the line, the ".+" says "any character, repeated 1 or more times", and the space matches a space. Once we find what we're looking for, we're going to replace it with 'Captain' followed by a space - since the string that we're replacing contains one, we'll need to put it back.
Let's say that we also knew that somewhere in the file, there are a couple of names that contain apostrophes (Francois L'Ollonais), and we wanted to skip them - or anything else that contained 'non-letter' characters. Let's expand the regex a bit:
s/^[A-Z][a-z]* /Captain /;
We've used the "character class" specifiers, "[]", to first match one character between 'A' and 'Z' - note that only one character is matched by this mechanism, a very important distinction! - followed by a one-character match of 'a' through 'z' and an asterisk, which, again, says "zero or more of the preceding character".
Oops, wait! How about "KuoHsing"? The match would fail on the 'H', since upper-case characters were not included in the specified range. OK, we'll modify the regex:
s/^\w* /Captain /;
The '\w' is a "word character" - once again, it matches only one character - that includes 'A-Z', 'a-z', and '_'. It is preferable to [A-Za-z_] because it uses the value of $LOCALE (a system value) to determine what characters should or should not be part of words - and this is important in languages other than English. As well, '\w' is easier to type than '[A-Za-z_]'.
Let's try something a bit different: What if we still wanted to match all the first names, but now, rather than replacing them, we wanted to swap them around with the last names, separate the two with a comma, and precede the last name with the word 'Captain'? With regexes at our command, it's not a problem:
s/^(\w*) (\w*)$/Captain $2, $1/;
Note the parentheses and the "$1" and "$2" variables: the parentheses "capture" the enclosed part of the regex, which we can then refer to via the variables (the first captured piece is $1, the second is $2, and so on.) So, here is the above regex in English:
Starting from the beginning of the line, (begin capture into $1) match any "word character" repeated zero or more times (end capture) and followed by a space, (begin capture into $2) followed by any "word character" repeated zero or more times (end capture) until the end of the line. Return the word 'Captain' followed by a space, which is followed by the value of $2, a comma, a space, and the value of $1.
I'd say that regexes are a very compact way to say all of the above. At times like these, it becomes pretty obvious that Larry Wall is a professional linguist. :)
These are just simple examples of what goes into building a regex. I
must admit to cheating a bit: name-parsing is probably one of the biggest
challenges out there, and I could have spun these example out as long as
I wanted. Considering that the possibilities include "John deJongh", "Jan
M.
van de Geijn", "Kathleen O'Hara-Mears", "Siu Tim Au Yeung", "Nang-Soa-Anee
Bongoj Niratpattanasai", and "Mjölby J. de Wærn" (remember to
use those LOCALE-aware matches, right?), the field is pretty broad and
very odd in spots. (Miss Niratpattanasai, after looking at something like
"John Smith". would probably agree. :)
Here's an important factor to be aware of in the regex mechanism: by default, it does "greedy matching". In other words, given a phrase like
Acciones son amores, no besos ni apachurrones
and a regex like
/A.*es/
it would match the following:
Acciones son amores, no besos ni apachurrones
|___________________________________________|
Hmmm. Everything from the first 'A' (followed by zero or more of any character) to the last 'es'. How can we match just the first instance, then? To counteract the greed, Perl provides a "generosity" modifier to quantifiers such as '*', '+', and '?':
/A.*?es/
Acciones son amores, no besos ni apachurrones
|______|
There. Much better. For future reference, remember: if you're breaking
up a string by matching its pieces with a series of regexes, and the last
"chunks" are coming up empty, you've probably got a "greed" problem.
The Default Buffer/Variable
Some of you, especially those who have done some programming in the past, have probably been curious about some of the code constructs above, like
print if /holiday/;
"Print what if what? Where is the variable that we're checking for the match? Shouldn't it be something like 'if $x == /holiday/', the way it is in the shell?"
I'm glad you asked that question. :)
Perl uses an interesting concept, found in a few other languages, of the default buffer - also referred to as the default variable and the default pattern space. Not surprisingly, it's used in the looping constructs - when we use the "-n/-p" syntax in the Perl invocation, it is the variable used to hold the current line - as well as in substitution and matching, and a number of other places. The '$_' variable is the default for all of the above; when a variable is not specified in a place where you'd expect one, '$_' is usually the "culprit." In fact, '$_' is rather difficult to explain - it turns up in so many places that coming up with an algorithm is seemingly impossible - but it is wonderfully easy and intuitive to use, once you get the idea.
Consider the following:
perl -wne 'if ( $_ =~ /Henry/ ) { print $_; } pirates
If a line in the "pirates" file, above, matches "Henry", it will be printed. Fine; but now, let's play some amateur "Perl Golf" - that's a contest among Perl hackers to see how many (key)strokes can be taken off a piece of code and still leave it functional.
Since we already know that Perl reads each line into '$_', we'll just get rid of all the explicit declarations of it:
perl -wne 'if ( /Henry/ ) { print; } pirates
Perl "knows" that we're matching against the default variable, and it "knows" that the "print" statement applies to the same thing. Now, we apply a little Perl idiom:
perl -wne 'print if /Henry/' pirates
Isn't that nice? Perl actually allows you to write out your code with
the condition following the action; kinda the way you'd say things in English.
Oh, and we've snipped off the semicolon on the end because we don't need
it: it's a statement separator, and there's no statement following
"/Henry/".
<grin> For those of you playing along at home, try
perl -ne'/Henry/&&print' pirates
It shouldn't be that hard to figure out; the '&&' operator
in Perl works the same way as it does in the shell. Perl Golf is fun to
play, but be careful: it's easy to write code that will work but will
require lots of head-scratching to understand. Don't Do That. I may have
to maintain your code tomorrow... just like you may have to maintain mine.
In the first example, note the "binding operator", '=~', which checks for a match in the supplied variable. This is what you would use if you were matching against a variable other than "$_". There is also a "negative match" operator, '!~', which returns true if the match fails (the inverse of '=~'.)
Note also that the available modifiers for simple statements, like that
above, include not only the "if", but also "unless", "while", "until",
and "for". All of these, and more, are coming up in Part 3...
Ben Okopnik
perl -we '$perl=0;JsP $perl "perl"; $perl->perl(0)'\
2>&1|perl -ne '{print ((split//)[19,29,20,4,5,1,2,
15,13,14,12,52,5,21,12,52,8,5,14,1,6,37,12,52,75])}'
References:
Relevant Perl man pages (available on any pro-Perl-y configured system):
perl - overview
perlfaq - Perl FAQ
perltoc - doc TOC
perldata - data structures
perlsyn - syntax
perlop - operators/precedence
perlrun - execution
perlfunc - builtin functions
perltrap - traps for the unwary perlstyle - style guide
"perldoc", "perldoc -q" and "perldoc -f"