PLUS - Grep

For the final post in this series, let’s write a real program in PLUS. Rather than working on a Rosetta Code problem, I’ll use this opportunity to implement a useful program, a version of Unix grep for MTS.

Grep

Grep is a program from Unix to find text in files. The first version was written in around 1974 in PDP-11 assembly language, and is available on pretty much every Unix derived system today. I don’t believe there is an equivalent for MTS - the MTS File Editor does have a rich pattern matching language but is a bit unwieldy to use from the command line.

The requirements for the program are:

Pattern matching

The first problem is how can we match patterns? There is no regular expression library like regex in Unix. Looking through Volume 3, on page 366.3 there is a set of routines that look promising:

Three system subroutines, PATBUILD, PATMATCH, and PATFREE, are available for implementing $FILESTATUS-like pattern-matching capabili- ties from user programs.

PATBUILD will build a pattern from an input string. The input string may be of any length and may specify a file name or a generic string. For example:

File names: “2CYB:data?", “?.doc”, “2ABC:?”

Generic strings: “Bill R?", “in the state of ?”

PATMATCH will compare an input string against the pattern built by PATBUILD. The input string may be of any length.

PATFREE will free the storage used to build the pattern.

Although this is more suited for file wildcard expansion than general text search, we’ll use this as a quick way to get the program running.

The PLUS Source Library Definition does not mention these routines, but looking through the actual contents of *PLUS.SOURCELIB there are definitions available. For example for patbuild:

%Include(Numeric_Types);
type Patbuild_Procedure_Type is
     procedure
     reference parameters Patstring is value unknown,
        Strlen is value Integer,
        Patworkaddr is pointer to unknown,
     optional reference parameters Flags is value bit(32),
        Ccid is value character(4),
        Chars is value character(2),
     end;

/* Values used in Flags word */
constant Patbuild_Filename is '00000000',
  Patbuild_Nonfilename is '00000001',
  Patbuild_Ignore_Case is '00000000',
  Patbuild_Dont_Ignore_Case is '00000004';

Driving the pattern matcher

Let’s define a data structure to be used by our grep program.

global Grep
   /* The pat subroutines need a working area. */
   type Pattern_Work_Area_Storage_Type is bit(32);
   type Pattern_Work_Area_Type is pointer to Pattern_Work_Area_Storage_Type;
   constant Pattern_Work_Area_Size is Byte_Size(Pattern_Work_Area_Storage_Type);

   constant Wildcard_Character is "?";

   type Pattern_Struct is
      record
         Work_Area is Pattern_Work_Area_Type,
         Error_Message is Varying_String,
         Match_Count is Integer
      end record;
end global Grep;

And also some procedures, using pass-by-reference for structs and strings to avoid copying.

procedure Grep_Initialize is
   procedure
   reference parameters
      Pattern is Pattern_Struct,
      Pattern_Expression is Varying_String
   result
      Grep_Initialize_Result is Boolean
   end;

procedure Grep_Match is
   procedure
   reference parameters
      Pattern is Pattern_Struct,
      Query_Text is Varying_String
   result
      Match_Result is Boolean
   end;

procedure Grep_Terminate is
   procedure
   reference parameter
      Pattern is Pattern_Struct
   end;

Grep_Initialize will take our data structure and a pattern expression input by the user. It will fill out the data structure and return a Boolean to show if it worked or not. If it failed, the Error_Message field in the struct will be set to an appropriate message.

Grep_Match will apply the pattern to a line of query text and return a Boolean stating whether there was a match or not. It will update the Match_Count member in the structure so we can keep track of how many results we get.

Finally, Grep_Terminate will free up the memory used by the system.

Grep_Initialize in detail

Let’s look at the implementation of Grep_Initialize.

definition Grep_Initialize;
   Pattern.Match_Count := 0;
   Pattern.Error_Message := "";
   Pattern.Work_Area := Getspace(0, Pattern_Work_Area_Size);

   if Length(Pattern_Expression) = 0
   then
      Pattern.Error_Message := "Missing pattern in PAR field";
      return with False;
   end if;

   /* Add leading and trailing wildcards to match anywhere
      on a line. */
   variable Match_In_Line is Varying_String;
   Match_In_Line := Wildcard_Character || Pattern_Expression ||
                    Wildcard_Character;

   variable Len is Integer;
   variable Text_Ptr is pointer to Fixed_String;
   Get_String_Text_And_Length(Match_In_Line, Text_Ptr, Len);

   variable Return_Code is Return_Code_Type;
   Patbuild(Text_Ptr@, Len, Pattern.Work_Area, Pattern_Flags,
            return code Return_Code);

   /* Any error code greater than this is unrecoveable. */
   constant Patbuild_Error_Code_Threshold is 8;
   if Return_Code > Patbuild_Error_Code_Threshold
   then
      Pattern.Error_Message := "Error constructing pattern";
      return with False;
   end if;

   return with True;
end Grep_Initialize;

It first starts by initializing the Pattern structure we passed in. The work area is a piece of memory needed by the system pattern routines so this is allocated. It then checks the pattern expression provided to see if it is not empty. If OK, it adds the wildcard ? to the start and end of the expression, so we can match text anywhere on a line.

We are nearly ready to call Patbuild to initialize the pattern. However, our pattern expression is a Varying_String and Patbuild requires a location in memory for the underlying string and a length. To get this, we define a macro Get_String_Text_And_Length that does some type casting to get access to the members of Varying_String:

macro Get_String_Text_And_Length

   /* Given a Varying_String, get the address of the underlying
      Fixed_String and copy the length to an Integer variable.
      This is needed for the calls to system subrountines
      like Patbuild below. */

   parameters are V_String, Text_Ptr, Len;
   equate V_String_Struct to V_String as Varying_String_Structure_Type;
   equate Text to V_String_Struct.Varying_String_Text as Fixed_String;
   Text_Ptr := Address(Text);
   Len := Length(V_String);
end macro;

The macro encapsulates the details of accessing these members and allows us to use this in other parts of the program, but avoids the cost of a procedure call as the code will be inlined to the current function.

We can now call Patbuild using the return code Return_Code parameter to get the system routine result code. Volume 3 states this can fail under some cases, so we check this code and return an appropriate Boolean.

Grep_Match and Grep_Terminate

The procedure to do the actual matching is then simple:

definition Grep_Match;
   variable Len is Integer;
   variable Text_Ptr is pointer to Fixed_String;
   Get_String_Text_And_Length(Query_Text, Text_Ptr, Len);

   variable Return_Code is Return_Code_Type;
   Patmatch(Text_Ptr@, Len, Pattern.Work_Area, return code Return_Code);

   if Return_Code = 0
   then
      Pattern.Match_Count +:= 1;
      return with True;
   end if;
   return with False;
end Grep_Match;

Grep_Terminate just calls Patfree.

Calling the Grep routines

The Main procedure can now be constructed by

definition Main;
   variable Msg is pointer to Stream_Type;
   variable Pattern is Pattern_Struct;
   variable Pattern_Expression is Varying_String;
   variable Line is Varying_String;
   variable Return_Code is Return_Code_Type;

   Msg := Message_Initialize();

   Pattern_Expression := Trim(Par);
   if not Grep_Initialize(Pattern, Pattern_Expression)
   then
      Message(Msg, "<v></>", Pattern.Error_Message);
   else
      /* Read lines from SCARDS and print any that match. */
      cycle
         Scards_Varying(Line, Return_Code);
         if Return_Code > 0
         then
            exit;
         end if;
         if Grep_Match(Pattern, Line)
         then
            Sprint_Varying(Line);
         end if ;
      end cycle;
      Message(Msg, "<i> match(es) found</>", Pattern.Match_Count);
   end if;

   Grep_Terminate(Pattern);
   Message_Terminate(Msg);
end Main;

Compiling and running the program

The program can be compiled with $run *PLUS SCARDS=grep.plus SPUNCH=grep

Here are some examples of running the program.

Matching a wildcard expression:

 # $run grep scards=hello.plus par=hello?world
 # Execution begins   16:37:36
   %Title := "Hello World Program";
      Sprint_String("Hello, world!");
   2 match(es) found
 # Execution terminated   16:37:37  T=0.001

Matching against a large file (around 372kB)

# $run grep scards=*PLUS.SOURCELIB par=patmatch
# Execution begins   16:45:30
  PATMATCH    30300
  PATMATCH_PROCEDURE_TYPE         30400
  /begin Patmatch
  %Include(Patmatch_Procedure_Type);
  procedure Patmatch is system Patmatch_Procedure_Type
  /begin Patmatch_Procedure_Type
  type Patmatch_Procedure_Type is
  7 match(es) found
# Execution terminated   16:45:31  T=0.232

Matching across several files:

# $run grep scards=hello.plus+grep.plus par=title
# Execution begins   16:47:19
  %Title := "Hello World Program";
  %Title := "Find text in files";
  2 match(es) found
# Execution terminated   16:47:19  T=0.006

Areas for improvement

By using the pattern routines provided by the MTS system, we have produced a basic grep program in 166 lines of code. There are, however, several ways we could improve it.

The program only supports case-insensitive matching. This is a by-product of the way the PAR command line is passed to the program in upper case. It would be nice to search multiple files, such as all files in a catalog, without having to use concatenation. It would also be good to show which file and line was matched.

One way to support this would be to use the PAR command line to pass a file specification, eg PLUS:?, and use SCARDS to pass in the text to be searched. The file specification could be expanded to a list of files, and by reading from SCARDS we could supply a mixed case pattern expression.

Another area to improve would be the pattern matching. As you can see from the run of grep on a large file, it’s not that fast. We also don’t provide as rich a pattern matching language as POSIX regexes. Looking at the source code for the pattern matching routines, they do a simple scan of the text - which is reasonable for the purpose they were designed for, to match filenames - but for a more complete implementation we’d want to look at writing our own regex library using something like the Boyer-Moore algorithm to improve the search time.

One last crazy idea. MTS has a PLUS compiler that produced PDP-11 object code, *PLUS11. What if we compiled grep.plus to run on the PDP_11, bringing grep back home? Unfortunately the pattern system routines like patbuild aren’t available on the PDP-11, but it’s only a matter of implementation…

Any other suggestions for improvements? Please post in the comments.

Final thoughts on PLUS

I really enjoyed using PLUS. The language feels like C with a more Pascal-ly syntax, balancing power with expressiveness. The integration with MTS is extensive, and if you want to extend MTS further it would be an ideal choice.

Further information

Full source code for these programs can be found on github.

There’s an interesting post on a FreeBSD mailing list by the original author of GNU grep on why it is so fast.

The MTS File Editor pattern matching language is described from page 159 onwards in Volume 18 of the MTS documentation set.

Comments

comments powered by Disqus