News
Latest Releases
 UstLF 2.0.36
 UltraGram 6.0.64

Working with UltraGram

    In this article we will examine some common problems that parser developer can face and the ways to handle them using UltraGram.
UltraGram

 
Brief overview

    To create a working parser code with UltraGram you need to do the following:
 
  1. Create a new grammar file
  2. Compile it and remove all design type errors
  3. Open or create some number of test files and insure that your grammar works
  4. Generate the parser source code from your grammar targeting one of the following programming languages: C++, C#, VB.NET, Java

    The UltraGram grammar file has the following sections: %pragma, precedence, %tokens and %production. In the %pragma section different directives are listed that affect the behaviour of the parser. They are related generally to the error handling and conflict resolution (in more details described below). In the %tokens section a "terminal symbols" are defined ( also known as a "token type" ) . Each symbol represents a class of syntactically equivalent tokens. You use the symbol in grammar rules to indicate that a token in that class is allowed. Each symbol has the following format:

       'expression' [ identifier, [ 'alias' ]] [ %ignore ] ;

Here:

expression defines the rule by which the token in the input text will be recognized
identifier is a unique name of the symbol
alias is an alias of this symbol
%ignore directive instructs to skip the token

Examples:

    '[ \t\r\n\f\b]+'                %ignore;
    '[a-z_A-Z0-9][a-z_A-Z0-9]*'     Id,      'Identifier';
    '[0-9]+'                        Integer, 'Int';
    '([0-9]+\.[0-9]*)|(\.[0-9]+)'   Real,    'Real';
    '\"[^\"]*\"'                    String,  'Str';

You may refer to each symbol or name or by alias enclosed in quotes.
    In the optional precedence section declarations for operator precedence allow you to specify when to shift and when to reduce. Use the %left , %right or %nonassoc declaration to declare a token and specify its precedence and associativity, all at once.
    In the %production section grammar rules are defined. Each rule has the following format:
 
 
    name : production_items ;

where name is the nonterminal symbol that this rule describes and production_items is an optional list of terminal or nonterminal symbol related to this rule and separated by a whitespace. Example:

    exp : exp '-' exp ;

This rule defines that two groupings of type exp, with a '-' token in between, can be combined into a larger grouping of type exp. Multiple rules for the same name can be written separately or can be joined with the vertical-bar character | as follows:

    name: production_items1 | production_items2... ;

Example:

    exp : exp '-' exp | exp '+' exp ;

If rule production_items in a rule is empty, it means that name can match the empty string. Here is how to define a comma-separated sequence of zero or more exp groupings:

    exp_opt : // empty rule
            | expseq ;
    expseq  : exp
            | expseq ',' exp ;

The entry point ( main rule ) is defined by a name that follows the %production keyword. There must always be a rule with the name that corresponds to it. UltraGram IDE allows you to compile and extensively test and debug your grammar. You can define a set of test files, run then individually or in a batch mode, set breakpoints, render graphical representation of some errors, create a complete DFA graph,...
 
Let’s begin

    Lets skip the detailed IDE description (just download and take a look, it’s really simple) and focus on the parsing issues that are common for a general parser developer and solutions that UltraGram offers. We will do this review on several examples. All examples are very simple and deliberately constructed to highlight common problems a developer may run into. Some knowledge of parsing theory and experience will be helpful.
 
Token-Token conflicts

    Let’s say you plan to create a parser that should dial with a file containing list of employee names separated by semicolon. Like that:

            John Smith ;
            Allan Brown ;
            Bret Voisen ;

Simple grammar for this case will look as following:

// ---- optional pragma section --------------------------------------
%pragma
intok( on, off ); // allow inline token declarations
// ---- token declaration section ( terminal symbols ) ---------------
%tokens
'[ \n\t\r]+'             wspace, %ignore;
'[a-z_A-Z][a-z_A-Z]*'    FirstName;
'[a-z_A-Z][a-z_A-Z]*'    LastName;
// ---- production section -------------------------------------------
%production Start
Start    : ItemList;
ItemList : Item | ItemList Item ;
Item     : FirstName LastName ';' ;

This grammar works fine for our input file. Now we will complicate things a bit. Let’s say some of the names may contain additional info in front so the test file might look like that:

            John Smith ;
            Allan Brown ;
            Doctor Bret Voisen ;

Now we have to modify our grammar. The most natural approach will result in the following:

// ---- optional pragma section --------------------------------------
%pragma
intok( on, off ); // allow inline token declarations
// ---- token declaration section ( terminal symbols ) ---------------
%tokens
'[ \n\t\r]+'             wspace, %ignore;
'Doctor'                 Info;
'[a-z_A-Z][a-z_A-Z]*'    FirstName;
'[a-z_A-Z][a-z_A-Z]*'    LastName;
// ---- production section -------------------------------------------
%production Start
Start    : ItemList;
ItemList : Item | ItemList Item ;
Item     : FirstName LastName ';'
         | Info FirstName LastName ';';

Grammar complies nicely but when you start to parse the test file it fails. This is result of a Token-Token conflict. What happens exactly? When parser reaches some state it looks for the lookahead tokens from some set of tokens. If it finds one it takes an appropriate action. Look at the third line of the input file:

            Doctor Bret Voisen ;

The first word ( "Doctor" ) matches the token definitions for Info and for the FirstName. In UltraGram tokens that are declared last have higher priority (other parsers can have this differently). According to this the word "Doctor" is treated as a token FirstName, after which comes LastName and semicolon in the rule:

                     Item : FirstName LastName ';' ;

But that is not the case! We expect here the rule:

             Item : Info FirstName LastName ';' ;

So, what are our options?
    The first option is to remove Token-Token conflict by not allowing to declare conflicting tokens. Some parsers do exactly this. The error is razed in this case. Is this correct? Well, in ideal world - yes. In reality there are many cases when conflicting tokens are declared and will be declared. Some parsers raise an error in that case. Other parsers allow any token declarations and do not detect token conflicts at all. As result there are chances that your grammar will compile but will not work against some test files and you most likely will get no explanations why.
    The second option is to change the priority of the tokens ( we need to have the token Info set to the highest priority ). If you realize that you have conflicting tokens you may change declarations order for example like this:

'[a-z_A-Z][a-z_A-Z]*'    FirstName;
'[a-z_A-Z][a-z_A-Z]*'    LastName;
'Doctor'                 Info;

This will change token priorities. But will it help? In our sample - no. You will run into the problem again but at different location. Frustrating isn’t it? Especially when your grammar is much more complicated then this primitive sample. Now let’s see how this can be handled by UltraGram. First of all UltraGram always detects token conflicts, reports them and may optionally stop at the conflicting locations. Also there are several option to dial with the situation. One of them is to enable special processing by using pragma "dkey( on | off )". This stands for “detect keywords”. When this option is turned on UltraGram analyzes token declarations and filters out declarations that look like keywords in a programming language ( for example consist from low or upper case chars and digits, have fixed length and no "wildcards" ). In our sample there is only one token declaration of this kind:

'Doctor'                 Info;

The grammar with the above mentioned pragma will look like this:

// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );    // allow inline token declarations
dkey( on );          // detect keywords mode is turned ON
 // ---- token declaration section ( terminal symbols ) --------------
%tokens
'[ \n\t\r]+'             wspace, %ignore;
'Doctor'                 Info;
'[a-z_A-Z][a-z_A-Z]*'    FirstName;
'[a-z_A-Z][a-z_A-Z]*'    LastName;
// ---- production section -------------------------------------------
%production Start
Start    : ItemList;
ItemList : Item | ItemList Item ;
Item     : FirstName LastName ';'
         | Info FirstName LastName ';';

We added only one line: dkey( on ); . Now when parsing starts tokens are examined and the highest priority is given to the "keyword" tokens. They are searched first. If one of them will be found - appropriate rule will be reduced. If none of them are found then the rest of the lookahead set is examined and appropriate action is taken. The problem is fixed. This is a good reliable solution for many cases.
    Now let’s examine some more complicated situations. Let’s say we need to parse the following text:

            John Smith ;
            Allan Brown ;
            Doctor Bret Voisen ;
            Engineer Rendy Adams ;

For this case our token Info should be more complicated. Let’s make it handle any word. This will result in the following:

// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );    // allow inline token declarations
 // ---- token declaration section ( terminal symbols ) --------------
%tokens
'[ \n\t\r]+'             wspace, %ignore;
'[a-z_A-Z][a-z_A-Z]*'    Info;
'[a-z_A-Z][a-z_A-Z]*'    FirstName;
'[a-z_A-Z][a-z_A-Z]*'    LastName;
// ---- production section -------------------------------------------
%production Start
Start    : ItemList;
ItemList : Item | ItemList Item ;
Item     : FirstName LastName ';'
         | Info FirstName LastName ';';

Obviously we don’t have "keywords" here and pragma "dkey" will be useless. UltraGram offers another pragma "rtc (on|off,<level>)" ( this stands for resolve token conflicts ). The value <level> can be in a range from 1 to 10 . When this pragma is turned on and conflicting tokens are encountered a special analyzing process kicks in that allows to select a correct token. The reliability of the process can be tuned by the <level> value. This value to some extend represents how far in the text parser will look to insure whether the token is correct or not. This is more powerful solution then the usage of the dkey pragma but requires some extra processing. After this change pragma section of our grammar will look like:

// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );    // allow inline token declarations
rtc( on, 4 );       // resolve token conflicts mode is turned ON

Now test file can be parsed successfully.

Shift-Reduce and Reduce-Reduce conflicts

    Let’s go back and see what happens when we try to eliminate conflicting tokens. Let's rewrite our grammar to the following:

// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );    // allow inline token declarations
 // ---- token declaration section ( terminal symbols ) --------------
%tokens
'[ \n\t\r]+'             wspace, %ignore;
'[a-z_A-Z][a-z_A-Z]*'    Identifier;
// ---- production section -------------------------------------------
%production Start
Start     : ItemList;
ItemList  : Item | ItemList Item ;
Item      : FirstName LastName ';'
          | Info FirstName LastName ';';
FirstName : Identifier;
LastName  : Identifier;
Info      : Identifier;

As you can see we have here only one token Identifier . Everything else is built on it. Our change looks natural and simple. Compile grammar ... and get a Reduce-Reduce conflict !
    Shift-Reduce and Reduce-Reduce conflicts are common. They happen quite often and ideally they should be eliminated. But how? One way is to modify/rewrite your grammar. This can be easily done for our sample. But it is not the goal of this article. Modifying grammar may help in many cases but sometimes changing some grammar rule may result in a message about conflict that happens in the place that seems to be completely unrelated to your modifications. This may be very confusing. Fortunately UltraGram gives you a clear graphical representation of events driving this conflict. Click on the conflict message and you will get:
Reduce-Reduce conflict
This is a small fragment of the DFA graph. Here parsing starts at Node 0. From this point it is possible to find two identical paths to Node 5 (consuming token Identifier). In the Node 5 - if the next lookahead token is Identifier we can reduce two rules:

        Info : Identifier ;

and

        FirstName : Identifier ;

So which one should be reduced? We need only one. This is a Reduce-Reduce conflict (in similar way happen Shift-Reduce conflicts). This graph is very helpful when grammars are more complicated and the chain of steps that lead to the conflict is longer. Anyway, which path should be chosen? And how? What if all your attempts to resolve conflict by modifying grammar failed and conflict still exist? For this case UltraGram offers couple options. First one is to set priority to some conflicting rules using %prec directive. This directive is an old one, it has some history. To keep it short we will skip details. Just note, that if for example you modify grammar in the following way

// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );    // allow inline token declarations
 // ---- token declaration section ( terminal symbols ) --------------
%tokens
'[ \n\t\r]+'             wspace, %ignore;
'[a-z_A-Z][a-z_A-Z]*'    Identifier;
%left Low;
%left High;
// ---- production section -------------------------------------------
%production Start
Start     : ItemList;
ItemList  : Item | ItemList Item ;
Item      : FirstName LastName ';'
          | Info FirstName LastName ';';
FirstName : Identifier %prec Low ;     // has low priority
LastName  : Identifier;
Info      : Identifier %prec High ;   // has high priority

conflict will be gone. Here we simply informed parser that rule Info has higher priority then the rule FirstName. As result conflict is resolved. In some cases this conflict resolution technique works fine. Unfortunately for our sample this is not good enough. Parsing will fail. Let’s take a look at the second solution. It is called GLR. Again – we will skip the theory. In short – parser will choose from all conflicting paths a successful path if any. Let’s look at the grammar:

// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );    // allow inline token declarations
glr( on ) ;          // GLR mode is turned ON
 // ---- token declaration section ( terminal symbols ) --------------
%tokens
'[ \n\t\r]+'             wspace, %ignore;
'[a-z_A-Z][a-z_A-Z]*'    Identifier;
// ---- production section -------------------------------------------
%production Start
Start     : ItemList;
ItemList  : Item | ItemList Item ;
Item      : FirstName LastName ';'
          | Info FirstName LastName ';';
FirstName : Identifier ;
LastName  : Identifier ;
Info      : Identifier ;

GLR mode is enabled by pragma "glr (on|off)". Necessary to note that GLR requires additional processing. However if you are not designing "real-time" software, your files are not very large and computers are fast ( nowdays ), in many cases this is an acceptable solution and you will not notice any significant differences in parsing time. Just give it a try to understand whether it is acceptable for you.

Parsing Algorithms

    Necessary to mention that GLR can be applied not only to the LALR(1) parsing algorithm, but to LR(1) algorithm as well. UltraGram supports LR(1). This is about the way parsing tables are calculated. In general LR(1) tables are more "robust" comparing to LALR(1) but the size of the tables may be enormous. So the common practice is to use LALR. The table calculation algorithm can be selected by setting pragma "alg( lalr | lr1 )". Also keep in mind that in some cases Shift-Reduce and Reduce-Reduce conflicts that exist in LALR(1) do not exist in LR(1). So when you run into this issue try different techniques to resolve them.

Parsing unformatted text

    Sometimes there are cases when it is necessary to parse only some special sections of a file and do not care or do not know format of other sections. Let’s look at the sample file first:

            John Smith ;
            1 2 3 4 5 ;
            Doctor Bret Voisen ;
            Engineer Rendy Adams ;
            agent 007 ;

What means "1 2 3 4 5" or "agent 007"? This does not correspond to the grammar. Can file contain other strange entries separated by semicolon? May be… So what are the options? One of the options is to figure out exact format of the file and create grammar for it. But sometimes this can be a complicated and even impossible thing to do. UltraGram offers another option – a %text token. Think about it as about a token with unknown (or don’t care) content. It makes things simple. Here is our revised grammar:

// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );    // allow inline token declarations
glr( on ) ;          // GLR mode is turned ON
 // ---- token declaration section ( terminal symbols ) --------------
%tokens
'[ \n\t\r]+'             wspace, %ignore;
'[a-z_A-Z][a-z_A-Z]*'    Identifier;
// ---- production section -------------------------------------------
%production Start
Start     : ItemList;
ItemList  : Item | ItemList Item ;
Item      : FirstName LastName ';'
          | Info FirstName LastName ';'
          | %text ';' ;
FirstName : Identifier ;
LastName  : Identifier ;
Info      : Identifier ;

After this change all unknown entries of the file will be treated as %text tokens and parser will succeed.
 
Error handling

    Files that are parsed very often contain different errors. As result error handling is quite important part of the parser core. UltraGram supports two error handling modes:

  • Typo errors mode
  • Generic errors mode

These modes are different and may be used together complementing each other.

Typographical error handling

    This kind of errors are often created during text typing and include duplication, omission, transposition or substitution of characters. To illustrate how they can be handled lets return back to the following grammar:

// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );    // allow inline token declarations
dkey( on );          // detect keywords mode is turned ON
 // ---- token declaration section ( terminal symbols ) --------------
%tokens
'[ \n\t\r]+'             wspace, %ignore;
'Doctor'                 Info;
'[a-z_A-Z][a-z_A-Z]*'    FirstName;
'[a-z_A-Z][a-z_A-Z]*'    LastName;
// ---- production section -------------------------------------------
%production Start
Start    : ItemList;
ItemList : Item | ItemList Item ;
Item     : FirstName LastName ';'
         | Info FirstName LastName ';';

This grammar perfectly parses the following file:

            John Smith ;
            Allan Brown ;
            Doctor Bret Voisen ;

Now what happens if we add some error entries so our test file will look for example like that:

            John Smith ;
            Allan Brown ;
            Dotcor Bret Voisen ;
            Dotor David Duffy ;
            Dooctor Jerry Duncan ;
            Docto Jeff Springer ;

These are very common types of errors that mainly happen during manual typing. One way of handling them is to create long and complicated token declarations that will handle all possible error cases. However this approach is not practical, since there may be too many of combinations that should be considered. Fortunately UltraGram has a build in mechanism to handle this types of errors. Look at the following modified grammar:

// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );    // allow inline token declarations
dkey( on );          // detect keywords mode is turned ON
 // ---- token declaration section ( terminal symbols ) --------------
%tokens
'[ \n\t\r]+'             wspace, %ignore;
'[a-zA-Z]'               MyCharSet;
'Doctor'                 Info, %chars MyCharSet;
'[a-z_A-Z][a-z_A-Z]*'    FirstName;
'[a-z_A-Z][a-z_A-Z]*'    LastName;
// ---- production section -------------------------------------------
%production Start
Start    : ItemList;
ItemList : Item | ItemList Item ;
Item     : FirstName LastName ';'
         | Info FirstName LastName ';';

The difference with the previous grammar is only in two lines:

'[a-zA-Z]'               MyCharSet;
'Doctor'                 Info, %chars MyCharSet;

The first line is a token declaration that defines a set of characters that might be accidently injected or substitute a character in the input text. The next line is the token declaration that has char set MyCharSet attached to it. This tells the parser to enable error correction on this token and use the char set to validate all unexpected characters. Note, this also enables error correction for missing characters or swapped characters ( transposition ). With this change out test file with errors will be parsed successfully. One defined char set may be used with multiple token declarations, however it is possible to have multiple char sets. Also It is possible to enable this type of error correction for inline tokes ( that do not have explicit token declaration ) as well. Last thing to mention is that this type of error correction is applicable only for token that have well defined pattern for which corrections make sense. Token declarations like FirstName or LastName in the sample above cannot have this type or error correction since it is impossible to predict correct input.

Generic errors

    If typo error handling is not enough and some other errors are expected in the input, the %error token can be used. In the current release of UltraGram the usage and behaviour of this token is similar to the %text token. However there are some differences. First of all %error token has the lowest priority and its processing will start only after all other options will fail. The second and in fact the major difference is that %error token does not consider %ignore tokens. This means that in case of %error token parser expects the worst scenario ( like damaged file ) and will not make any difference between normal formatted text or some comments residing in it. Parser will just try to skip one char after another ( and assign them to the %error token ) until parsing of the input can be successfully resumed. This behaviour allows to handle errors in grammars that already utilize %text tokens (however this kind of mix is rare and is used in really complicated cases). Here is the fragment of the above grammar with generic error handling:

Item     : FirstName LastName ';'
         | Info FirstName LastName ';'
         | %error ';' ;


There are other useful features in UltraGram including full DFA graph with lookaheads, UNICODE support, batch file parsing,… Give it a try.
UST Solutions Inc. 2016