PLUS - Grep
For the final post in this series, let’s write a real program in PLUS. Rather than working on a Rosetta Code problem, I’ll use this opportunity to implement a useful program, a version of Unix grep for MTS.
Grep
Grep is a program from Unix to find text in files. The first version was written in around 1974 in PDP-11 assembly language, and is available on pretty much every Unix derived system today. I don’t believe there is an equivalent for MTS - the MTS File Editor does have a rich pattern matching language but is a bit unwieldy to use from the command line.
The requirements for the program are:
- input a pattern to search for, which should support some kind of regular expression to match arbitrary text
- input a file or list of files to search in
- compare the pattern to the contents of the files and print any lines that match.
- follow MTS conventions for calling the program.
Pattern matching
The first problem is how can we match patterns? There is no regular expression library like regex in Unix. Looking through Volume 3, on page 366.3 there is a set of routines that look promising:
Three system subroutines, PATBUILD, PATMATCH, and PATFREE, are available for implementing $FILESTATUS-like pattern-matching capabili- ties from user programs.
PATBUILD will build a pattern from an input string. The input string may be of any length and may specify a file name or a generic string. For example:
File names: “2CYB:data?”, “?.doc”, “2ABC:?”
Generic strings: “Bill R?”, “in the state of ?”
PATMATCH will compare an input string against the pattern built by PATBUILD. The input string may be of any length.
PATFREE will free the storage used to build the pattern.
Although this is more suited for file wildcard expansion than general text search, we’ll use this as a quick way to get the program running.
The PLUS Source Library
Definition
does not mention these routines, but looking through the actual contents of *PLUS.SOURCELIB
there are definitions available. For example for patbuild
:
%Include(Numeric_Types);
type Patbuild_Procedure_Type is
procedure
reference parameters Patstring is value unknown,
Strlen is value Integer,
Patworkaddr is pointer to unknown,
optional reference parameters Flags is value bit(32),
Ccid is value character(4),
Chars is value character(2),
end;
/* Values used in Flags word */
constant Patbuild_Filename is '00000000',
Patbuild_Nonfilename is '00000001',
Patbuild_Ignore_Case is '00000000',
Patbuild_Dont_Ignore_Case is '00000004';
Driving the pattern matcher
Let’s define a data structure to be used by our grep program.
global Grep
/* The pat subroutines need a working area. */
type Pattern_Work_Area_Storage_Type is bit(32);
type Pattern_Work_Area_Type is pointer to Pattern_Work_Area_Storage_Type;
constant Pattern_Work_Area_Size is Byte_Size(Pattern_Work_Area_Storage_Type);
constant Wildcard_Character is "?";
type Pattern_Struct is
record
Work_Area is Pattern_Work_Area_Type,
Error_Message is Varying_String,
Match_Count is Integer
end record;
end global Grep;
And also some procedures, using pass-by-reference for structs and strings to avoid copying.
procedure Grep_Initialize is
procedure
reference parameters
Pattern is Pattern_Struct,
Pattern_Expression is Varying_String
result
Grep_Initialize_Result is Boolean
end;
procedure Grep_Match is
procedure
reference parameters
Pattern is Pattern_Struct,
Query_Text is Varying_String
result
Match_Result is Boolean
end;
procedure Grep_Terminate is
procedure
reference parameter
Pattern is Pattern_Struct
end;
Grep_Initialize
will take our data structure and a pattern expression input by the user. It will fill out the data structure and return a Boolean to show if it worked or not. If it failed, the Error_Message
field in the struct will be set to an appropriate message.
Grep_Match
will apply the pattern to a line of query text and return a Boolean stating whether there was a match or not. It will update the Match_Count
member in the structure so we can keep track of how many results we get.
Finally, Grep_Terminate
will free up the memory used by the system.
Grep_Initialize
in detail
Let’s look at the implementation of Grep_Initialize
.
definition Grep_Initialize;
Pattern.Match_Count := 0;
Pattern.Error_Message := "";
Pattern.Work_Area := Getspace(0, Pattern_Work_Area_Size);
if Length(Pattern_Expression) = 0
then
Pattern.Error_Message := "Missing pattern in PAR field";
return with False;
end if;
/* Add leading and trailing wildcards to match anywhere
on a line. */
variable Match_In_Line is Varying_String;
Match_In_Line := Wildcard_Character || Pattern_Expression ||
Wildcard_Character;
variable Len is Integer;
variable Text_Ptr is pointer to Fixed_String;
Get_String_Text_And_Length(Match_In_Line, Text_Ptr, Len);
variable Return_Code is Return_Code_Type;
Patbuild(Text_Ptr@, Len, Pattern.Work_Area, Pattern_Flags,
return code Return_Code);
/* Any error code greater than this is unrecoveable. */
constant Patbuild_Error_Code_Threshold is 8;
if Return_Code > Patbuild_Error_Code_Threshold
then
Pattern.Error_Message := "Error constructing pattern";
return with False;
end if;
return with True;
end Grep_Initialize;
It first starts by initializing the Pattern
structure we passed in. The work area is a piece of memory needed by the system pattern routines so this is allocated. It then checks the pattern expression provided to see if it is not empty. If OK, it adds the wildcard ?
to the start and end of the expression, so we can match text anywhere on a line.
We are nearly ready to call Patbuild
to initialize the pattern. However, our pattern expression is a Varying_String
and Patbuild
requires a location in memory for the underlying string and a length. To get this, we define a macro Get_String_Text_And_Length
that does some type casting to get access to the members of Varying_String
:
macro Get_String_Text_And_Length
/* Given a Varying_String, get the address of the underlying
Fixed_String and copy the length to an Integer variable.
This is needed for the calls to system subrountines
like Patbuild below. */
parameters are V_String, Text_Ptr, Len;
equate V_String_Struct to V_String as Varying_String_Structure_Type;
equate Text to V_String_Struct.Varying_String_Text as Fixed_String;
Text_Ptr := Address(Text);
Len := Length(V_String);
end macro;
The macro encapsulates the details of accessing these members and allows us to use this in other parts of the program, but avoids the cost of a procedure call as the code will be inlined to the current function.
We can now call Patbuild
using the return code Return_Code
parameter to get the system routine result code. Volume 3 states this can fail under some cases, so we check this code and return an appropriate Boolean.
Grep_Match
and Grep_Terminate
The procedure to do the actual matching is then simple:
definition Grep_Match;
variable Len is Integer;
variable Text_Ptr is pointer to Fixed_String;
Get_String_Text_And_Length(Query_Text, Text_Ptr, Len);
variable Return_Code is Return_Code_Type;
Patmatch(Text_Ptr@, Len, Pattern.Work_Area, return code Return_Code);
if Return_Code = 0
then
Pattern.Match_Count +:= 1;
return with True;
end if;
return with False;
end Grep_Match;
Grep_Terminate
just calls Patfree
.
Calling the Grep
routines
The Main
procedure can now be constructed by
- reading the PAR parameter from the command line to get the pattern expression
- calling
Grep_Initialize
- loop, reading lines from
SCARDS
, matching and printing results - print a final count of matches
definition Main;
variable Msg is pointer to Stream_Type;
variable Pattern is Pattern_Struct;
variable Pattern_Expression is Varying_String;
variable Line is Varying_String;
variable Return_Code is Return_Code_Type;
Msg := Message_Initialize();
Pattern_Expression := Trim(Par);
if not Grep_Initialize(Pattern, Pattern_Expression)
then
Message(Msg, "<v></>", Pattern.Error_Message);
else
/* Read lines from SCARDS and print any that match. */
cycle
Scards_Varying(Line, Return_Code);
if Return_Code > 0
then
exit;
end if;
if Grep_Match(Pattern, Line)
then
Sprint_Varying(Line);
end if ;
end cycle;
Message(Msg, "<i> match(es) found</>", Pattern.Match_Count);
end if;
Grep_Terminate(Pattern);
Message_Terminate(Msg);
end Main;
Compiling and running the program
The program can be compiled with $run *PLUS SCARDS=grep.plus SPUNCH=grep
Here are some examples of running the program.
Matching a wildcard expression:
# $run grep scards=hello.plus par=hello?world
# Execution begins 16:37:36
%Title := "Hello World Program";
Sprint_String("Hello, world!");
2 match(es) found
# Execution terminated 16:37:37 T=0.001
Matching against a large file (around 372kB)
# $run grep scards=*PLUS.SOURCELIB par=patmatch
# Execution begins 16:45:30
PATMATCH 30300
PATMATCH_PROCEDURE_TYPE 30400
/begin Patmatch
%Include(Patmatch_Procedure_Type);
procedure Patmatch is system Patmatch_Procedure_Type
/begin Patmatch_Procedure_Type
type Patmatch_Procedure_Type is
7 match(es) found
# Execution terminated 16:45:31 T=0.232
Matching across several files:
# $run grep scards=hello.plus+grep.plus par=title
# Execution begins 16:47:19
%Title := "Hello World Program";
%Title := "Find text in files";
2 match(es) found
# Execution terminated 16:47:19 T=0.006
Areas for improvement
By using the pattern routines provided by the MTS system, we have produced a basic grep program in 166 lines of code. There are, however, several ways we could improve it.
The program only supports case-insensitive matching. This is a by-product of the way the PAR command line is passed to the program in upper case. It would be nice to search multiple files, such as all files in a catalog, without having to use concatenation. It would also be good to show which file and line was matched.
One way to support this would be to use the PAR command line to pass a file specification, eg PLUS:?
, and use SCARDS
to pass in the text to be searched. The file specification could be expanded to a list of files, and by reading from SCARDS
we could supply a mixed case pattern expression.
Another area to improve would be the pattern matching. As you can see from the run of grep
on a large file, it’s not that fast. We also don’t provide as rich a pattern matching language as POSIX regexes. Looking at the source code for the pattern matching routines, they do a simple scan of the text - which is reasonable for the purpose they were designed for, to match filenames - but for a more complete implementation we’d want to look at writing our own regex library using something like the Boyer-Moore algorithm to improve the search time.
One last crazy idea. MTS has a PLUS compiler that produced PDP-11 object code, *PLUS11
. What if we compiled grep.plus
to run on the PDP_11, bringing grep back home? Unfortunately the pattern system routines like patbuild
aren’t available on the PDP-11, but it’s only a matter of implementation…
Update 28 June 2020: according to the comment from Jeff Ogden, it is possible to get access to the contents of the parameter field without case conversion by using the PARSTR subroutine documented on page 366.1 of Volume 3.
Any other suggestions for improvements? Please post in the comments.
Final thoughts on PLUS
I really enjoyed using PLUS. The language feels like C with a more Pascal-ly syntax, balancing power with expressiveness. The integration with MTS is extensive, and if you want to extend MTS further it would be an ideal choice.
Further information
Full source code for these programs can be found on github.
There’s an interesting post on a FreeBSD mailing list by the original author of GNU grep on why it is so fast.
The MTS File Editor pattern matching language is described from page 159 onwards in Volume 18 of the MTS documentation set.