chapter4 end.pptx

Introduction
• The syntax of programming language constructs can be specified by context-free
grammars or BNF (Backus-Naur Form) notation, Grammars offer significant benefits
for both language designers and compiler writers.
• A grammar gives a precise, yet easy-to-understand, syntactic specification of a
programming language.
• From certain classes of grammars, we can construct automatically an efficient
parser that determines the syntactic structure of a source program.
• The structure imparted to a language by a properly designed grammar is useful for
translating source programs into correct object code and for detecting errors.
• A grammar allows a language to be evolved or developed iteratively, by adding new
constructs to perform new tasks.

4.1.1 The Role of the Parser
• In our compiler model, the parser obtains a string of tokens from the lexical analyzer, as
shown in Fig. 4.1, and verifies that the string of token names can be generated by the
grammar for the source language.

4.1.1 The Role of the Parser
• There are three general types of parsers for grammars: universal, top-down, and
bottom-up.
• Universal parsing methods such as the Cocke-Younger-Kasami algorithm and
Earley's algorithm can parse any grammar. These general methods are, however,
too inefficient to use in production compilers.
• The methods commonly used in compilers can be classified as being either top-
down or bottom-up. As implied by their names, top-down methods build parse trees
from the top (root) to the bottom (leaves), while bottom-up methods start from the
leaves and work their way up to the root. In either case, the input to the parser is
scanned from left to right, one symbol at a time.
• Parsers implemented by hand often use LL grammars. Parsers for the larger class of
LR grammars are usually constructed using automated tools.

4.1.2 Representative Grammars
• Constructs that begin with keywords like while or int, are relatively easy to parse,
because the keyword guides the choice of the grammar production that must be
applied to match the input. We therefore concentrate on expressions, which present
more of challenge, because of the associativity and precedence of operators.
• Associativity and precedence are captured in the following grammar. E represents
expressions consisting of terms separated by + signs, T represents terms consisting
of factors separated by * signs, and F represents factors that can be either
parenthesized expressions or identifiers:

4.1.2 Representative Grammars
• Expression grammar (4.1)belongs to the class of LR grammars that are suitable for
bottom-up parsing. This grammar can be adapted to handle additional operators and
additional levels of precedence. However, it cannot be used for top-down parsing
because it is left recursive. The following non-left-recursive variant of the expression
grammar (4.1) will be used for top-down parsing:

4.1.3 Syntax Error Handling
The remainder of this section considers the nature of syntactic errors and general
strategies for error recovery. Two of these strategies, called panic-mode and phrase-
level recovery.
• Most programming language specifications do not describe how a compiler should
respond to errors; error handling is left to the compiler designer. Planning the error
handling right from the start can both simplify the structure of a compiler and improve
its handling of errors.
• Common programming errors can occur at many different levels.
• Lexical errors include misspellings of identifiers, keywords, or operators e.g., the use
of an identifier elipsesize instead of
e l l i p s e s i z e - and missing quotes around text intended as a string.

• Syntactic errors include misplaced semicolons or extra or missing braces; that is, '(("
or ")." As another example, in C or Java, the appearance of a case statement without
an enclosing switch is a syntactic error (however,this situation is usually allowed by
the parser and caught later in the processing, as the compiler attempts to generate
code).
• Semantic errors include type mismatches between operators and operands. An
example is a return statement in a Java method with result type void.
• Logical errors can be anything from incorrect reasoning on the part of the
programmer to the use in a C program of the assignment operator = instead of the
comparison operator ==. The program containing = may be well formed; however, it
may not reflect the programmer's intent.

• The precision of parsing methods allows syntactic errors to be detected very
efficiently. Several parsing methods, such as the LL and LR methods, detect an
error as soon as possible; that is, when the stream of tokens from the lexical
analyzer cannot be parsed further according to the grammar for the language.
• Another reason for emphasizing error recovery during parsing is that many errors
appear syntactic, whatever their cause, and are exposed when parsing cannot
continue. A few semantic errors, such as type mismatches, can also be detected
efficiently; however, accurate detection of semantic and logical errors at compile
time is in general a difficult task.

The error handler in a parser has goals that are simple to state but challenging to
realize:
1. Report the presence of errors clearly and accurately.
2. Recover from each error quickly enough to detect subsequent errors.
3. Add minimal overhead to the processing of correct programs.
Fortunately, common errors are simple ones, and a relatively straightforward error-
handling mechanism often suffices.
How should an error handler report the presence of an error?
At the very least, it must report the place in the source program where an error is
detected, because there is a good chance that the actual error occurred within the
previous
few tokens. A common strategy is to print the offending line with a pointer to the
position at which an error is detected.

4.1.4 Error-Recovery Strategies
• Once an error is detected, how should the parser recover? Although no strategy
has proven itself universally acceptable, a few methods have broad applicability. The
simplest approach is for the parser to quit with an informative error message when it
detects the first error.
• The balance of this section is devoted to the following recovery strategies:
panic-mode, phrase-level, error-productions, and global-correction.
1. Panic-Mode Recovery
- the parser discards input symbols one at a time until one of a designated set of
synchronizing tokens (i.e, delimiters: ; , }) is found.
- often skips a considerable amount of input without checking it for additional
errors,
- advantage of simplicity
- guaranteed not to go into an infinite loop.

2. Phrase-Level Recovery
- a parser may perform local correction on the remaining input
- it may replace a prefix of the remaining input by some string that allows the parser to
continue. - A typical local correction is to replace a comma by a semicolon, delete an
extraneous semicolon, or insert a missing semicolon.
- The choice of the local correction is left to the compiler designer.
- used in several error-repairing compilers, as it can correct any input string.
- Its major drawback is the difficulty it has in coping with situations in which the actual
error has occurred before the point of detection.
3. Error Productions
- Anticipating common errors that might be encountered
- A parser constructed from a grammar augmented by these error productions detects
the anticipated errors when an error production is used during parsing.
- The parser can then generate appropriate error diagnostics about the erroneous
construct that has been recognized in the input.

4. Global Correction
- There are algorithms for choosing a minimal sequence of changes
to obtain a globally least-cost correction.
- Given an incorrect input string x and grammar G, these algorithms
will find a parse tree for a related string y
- the number of insertions, deletions, and changes of tokens
required to transform x into y is as small as possible.
- too costly to implement in terms of time and space, so these
techniques are currently only of theoretical interest.

4.2 Context-Free Grammars
• Grammars were introduced in Section 2.2 to systematically describe the syntax
of programming language constructs like expressions and statements. Using
a syntactic variable stmt to denote statements and variable expr to denote
expressions, the production
• specifies the structure of this form of conditional statement. Other productions
then define precisely what an expr is and what else a stmt can be.

4.2.1 The Formal Definition of a Context-Free
Grammar
From Section 2.2, a context-free grammar (grammar for short) consists of terminals,
nonterminals, a start symbol, and productions.
1. Terminals are the basic symbols from which strings are formed. The term "token
name" is a synonym for “terminal" and frequently we will use the word "token" for
terminal when it is clear that we are talking about just the token name. We assume that
the terminals are the first components of the tokens output by the lexical analyzer. In
(4.4), the terminals are
the keywords if and else and the symbols "(" and ") ."
2. Nonterminals are syntactic variables that denote sets of strings. In (4.4), stmt and
expr are nonterminals. The sets of strings denoted by nonterminals help define the
language generated by the grammar. Nonterminals impose a hierarchical structure on
the language that is key to syntax analysis and translation.

3. In a grammar, one nonterminal is distinguished as the start symbol, and the set
of strings it denotes is the language generated by the grammar. Conventionally, the
productions for the start symbol are listed first.
4. The productions of a grammar specify the manner in which the terminals and
nonterminals can be combined to form strings. Each production consists of:
(a) A nonterminal called the head or left side of the production; this production
defines some of the strings denoted by the head.
(b) The symbol has been used in place of the arrow.
(c) A body or right side consisting of zero or more terminals and nonterminals. The
components of the body describe one way in which strings of the nonterminal at the
head can be constructed.

4.2.1 The Formal Definition of a Context-Free Grammar

4.2.2 Notational Conventions
• To avoid always having to state that "these are the terminals," "these are the nontermiaals,"
and so on, the following notational conventions for grammars will be used throughout the
remainder of this book.
1. These symbols are terminals:
(a) Lowercase letters early in the alphabet, such as a, b, c.
(b) Operator symbols such as +, *, and so on.
(c) Punctuation symbols such as parentheses, comma, and so on.
(d) The digits 0,1,. .. ,9.
(e) Boldface strings such as id or if, each of which represents a single terminal symbol.
2. These symbols are nonterminals:
(a) Uppercase letters early in the alphabet, such as A, B, C.
(b) The letter S, which, when it appears, is usually the start symbol.
(c) Lowercase, italic names such as expr or stmt.
(d) When discussing programming constructs, uppercase letters may be used to
represent nonterminals for the constructs. For example, nonterminals for expressions,
terms, and factors are often represented by E, T, and F, respectively

4.2.2 Notational Conventions
• 3. Uppercase letters late in the alphabet, such as X, Y, Z represent grammar
symbols; that is, either nonterminals or terminals.
• 4. Lowercase letters late in the alphabet, chiefly u, v, ...,z , represent (possibly
empty) strings of terminals.
• 5. Lowercase Greek letters, a, ,O, y for example, represent (possibly empty)
strings of grammar symbols. Thus, a generic production can be written
as A + a, where A is the head and a the body.
• 6. A set of productions A -+al, A + a2,... , A -+ a k with a common head
A (call them A-productions), may be written A + a1 / a s I .. I ak. Call
a l , a2,. .. ,a k the alternatives for A.
• 7. Unless stated otherwise, the head of the first production is the start symbol.

4.2.3 Derivations
• The construction of a parse tree can be made precise by taking a derivational view,
in which productions are treated as rewriting rules.

4.2.4 Parse Trees and Derivations
• A parse tree is a graphical representation of a derivation that filters out the order in which productions
are applied to replace nonterminals.
• For example, the parse tree for -(id + id) in Fig. 4.3, results from the derivation (4.8) as well as
derivation (4.9).
The leaves of a parse tree are labeled by nonterminals or terminals and, read from left to right,
constitute a sentential form, called the yield or frontier of the tree.

4.2.5 Ambiguity
• From Section 2.2.4, a grammar that produces more than one parse tree for some sentence is said to
be ambiguous. Put another way, an ambiguous grammar is one that produces more than one leftmost
derivation or more than one rightmost derivation for the same sentence.

4.2.5 Ambiguity
• For most parsers, it is desirable that the grammar be made unambiguous,
for if it is not, we cannot uniquely determine which parse tree to select for a
sentence. In other cases, it is convenient to use carefully chosen
ambiguous grammars, together with disambiguating rules that "throw
away" undesirable parse trees, leaving only one tree for each sentence.

4.2.7 Context-Free Grammars Versus Regular Expressions

4.2.7 Context-Free Grammars Versus Regular
Expressions
• We can construct mechanically a grammar to recognize the same
language
as a nondeterministic finite automaton (NFA). The grammar above was
constructed from the NFA in Fig. 3.24 using the following construction:

4.2.7 Context-Free Grammars Versus Regular
Expressions
• Colloquially, we say that "finite automata cannot count,"
meaning that
a finite automaton cannot accept a language like {a^n b^n I n >
1) that would require it to keep count of the number of a's
before it sees the b's. Likewise, "a grammar can count two
items but not three," as we shall see when we consider non-
context-free language constructs in Section 4.3.5.

4.3.1 Lexical Versus Syntactic Analysis
• Everything that can be described by a regular expression can also be described by a grammar. We
may therefore reasonably ask: "Why use regular expressions to define the lexical syntax of a
language?"
There are several reasons.
1. Separating the syntactic structure of a language into lexical and nonlexical parts provides a
convenient way of modularizing the front end of a compiler into two manageable-sized
components.
2. The lexical rules of a language are frequently quite simple, and to describe them we do not
need a notation as powerful as grammars.
3. Regular expressions generally provide a more concise and easier-to-understand notation
for tokens than grammars.
4. More efficient lexical analyzers can be constructed automatically from regular expressions
than from arbitrary grammars.
• There are no firm guidelines as to what to put into the lexical rules, as opposed to the syntactic
rules. Regular expressions are most useful for describing the structure of constructs such as
identifiers, constants, keywords, and white space. Grammars, on the other hand, are most useful for
describing nested structures such as balanced parentheses, matching begin-end's, corresponding if-

4.3.2 Eliminating Ambiguity
• Sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity. As an example, we shall
eliminate the ambiguity from the following "dangling else" grammar:

• In all programming languages with conditional statements of this form, the first parse tree is preferred.
The general rule is, "Match each else with the closest unmatched then." This disambiguating rule can
theoretically be incorporated directly into a grammar, but in practice it is rarely built into the productions.

4.3.3 Elimination of Left Recursion

• The nonterminal A generates the same strings as before but is no longer left recursive. This
procedure eliminates all left recursion from the A and A' productions, but it does not eliminate left
recursion involving derivations of two or more steps.

Top down parsing
• Top-down parsing processes the input string provided by a lexical analyzer.
• The top-down parsing first creates the root node of the parse tree. And it continues
creating its leaf nodes
• The process of constructing the parse tree which starts from the root and goes down to the
leaf is Top-Down Parsing.
• Top-Down Parsers constructs from the Grammar which is free from ambiguity and left
recursion.
• Top-Down Parsers uses leftmost derivation to construct a parse tree.
• It does not allow Grammar With Common Prefixes.
• Top-down parsers are also called predictive parsers.

Recursive-Descent Parsing
• Recursive descent is a top-down parsing technique that constructs the
parse tree from the top and the input is read from left to right. It uses
procedures for every terminal and non-terminal entity. This parsing
technique recursively parses the input to make a parse tree, which may or
may not require back-tracking.
• A form of recursive-descent parsing that does not require any back-
tracking is known as predictive parsing.
Back-tracking
• General recursive-descent may require backtracking; that is, it may
require repeated scans over the input.
• Top- down parsers start from the root node (start symbol) and match the
input string against the production rules to replace them (if matched).

FIRST and FOLLOW
• The construction of both top-down and bottom-up parsers is supported by two
functions, FIRST and FOLLOW, associated with a grammar G.
• FIRST and FOLLOW are two functions related with grammar that help us fill in the
entries of an M-table.
Computation of FIRST
• FIRST (a) is defined as the
collection of terminal
symbols which are the first
letters of strings.
Computation of Follow
• Follow (A) is defined as the
collection of terminal symbols that
occur directly to the right of A.

Rules of FOLLOW
• If S is the start symbol, FOLLOW (S)
={$}
• If there is a production A  aB ,
then everything in FIRST() except
 is in FOLLOW(B).
• If there is a production A  aB, or
a production A  aB, where
FIRST() contains , then everything
in FOLLOW (A) is in FOLLOW (B)
Rules of first
• A symbol c is in FIRST (α) if and
only if α ⇒ cβ for some sequence
β of grammar symbols.
• If Y1 is Non-terminal and
If Y1 does not derive to an empty
string i.e., If FIRST (Y1) does not
contain ε then, FIRST (X) = FIRST (Y1,
Y2, Y3) = FIRST(Y1)
• If X   is a production, then add
 to FIRST(X)

• Calculate the first and follow functions
for the given grammar.
S → (L) / a
L → SL’
L’ → ,SL’ / ∈
• First Functions-
First(S) = { ( , a }
First(L) = First(S) = { ( , a }
First(L’) = { , , ∈ }
Example
• Follow Functions-
Follow(S) = { $ } ∪ { First(L’) – ∈ } ∪
Follow(L) ∪ Follow(L’) = { $ , , , ) }
Follow(L) = { ) }
Follow(L’) = Follow(L) = { ) }

LL(1) Grammars
• context-free grammar G = (VT, VN, S, P) whose parsing table has no multiple entries is
said to be LL(1).
• The first L stands for scanning the input from left to right,
• The second L stands for producing a leftmost derivation,
• The 1 stands for using one input symbol of lookahead at each step to make parsing
action decision.
• A language is said to be LL(1) if it can be generated by a LL(1) grammar.
conditions to check first are as follows:
1. The grammar is free from left recursion.
2. The grammar should not be ambiguous.

Algorithm to construct LL(1) Parsing Table:
Step 1: First check all the important conditions mentioned above and go to
step 2.
Step 2: Calculate First() and Follow() for all non-terminals.
Step 3: For each production A –> α. (A tends to alpha)
Find First(α) and for each terminal in First(α), make entry A –> α in the table.
If First(α) contains ε as terminal than,
find the Follow(A) and for each terminal in Follow(A),
make entry A –> ε in the table.
If the First(α) contains ε and Follow(A) contains $ as terminal,
then make entry A –> ε in the table for the $.

Example: Consider the Grammar:
E --> TE'
E' --> +TE' | ε
T --> FT'
T' --> *FT' | ε
F --> id | (E)
Step1 :The grammar satisfies all
properties in step 1
Step 2 :calculating first() and
follow().
First Follow
E –> TE’ { id, ( } { $, ) }
E’ –>
+TE’/ε
{ +, ε } { $, ) }
T –> FT’ { id, ( } { +, $, ) }
T’ –> *FT’/ε { *, ε } { +, $, ) }
F –> id/(E) { id, ( } { *, +, $, ) }

Non recursive Predictive Parsing
• Predictive parsing can be performed using a pushdown stack, avoiding
recursive calls.
• Initially the stack holds just the start symbol of the grammar.
• At each step a symbol X is popped from the stack:
• if X is a terminal symbol then it is matched with lookahead and lookahead
is advanced,
• if X is a nonterminal, then using lookahead and a parsing a production is
chosen and its right hand side is pushed onto the stack.
• This process goes on until the stack and the input string become empty. It
is useful to have an end of stack and an end of input symbols. We denote
them both by $.

Example : Consider the grammar G
given by:
S  aAa | BAa | 
A cA | bA | 
B b
a b c $
S
S aAa S BAa
S  
A A A bA A cA
B B b

Stack Remaining input action
$S bcba$ choose S  BAa
$aAB bcba$ choose B b
$aAb bcba$ match b
$aA cba$ choose A  cA
$aAc cba$ match c
$aA ba$ choose A bA
$aAb ba$ match b
$aA a$ choose A  
$a a$ match a

Error Recovery in Predictive Parsing
• error recovery refers to the stack of a table-driven predictive parser,
since it makes explicit the terminals and non-terminals that the
parser hopes to match with the remainder of the input
Panic Mode
• Panic-mode error recovery is based on the idea of skipping symbols
on the the input until a token in a selected set of synchronizing
tokens appears effectiveness depends on the choice of
synchronizing set. The sets should be chosen so that the parser
recovers quickly from errors.

Some heuristics are as follows:
• As a starting point, place all symbols in FOLLOW(A) into the synchronizing set
for nonterminal A. If we skip tokens until an element of FOLLOW(A) is seen
and pop A from the stack, it is likely that parsing can continue.
• It is not enough to use FOLLOW(A) as the synchronizing set for A.
• If we add symbols in FIRST(A) to the synchronizing set for nonterminal A,
then it may be possible to resume parsing according to A if a symbol in
FIRST(A) appears in the input.
• If a nonterminal can generate the empty string, then the production deriving
E can be used as a default.
• If a terminal on top of the stack cannot be matched, a simple idea is to pop
the terminal, issue a message saying that the terminal was inserted, and
continue parsing. In effect, this approach takes the synchronizing set of a
token to consist of all other tokens.

Phrase-level Recovery:
phrase-level error recovery is implemented by filling in the blank entries in the
predictive parsing table with pointers to error routines.
These routines may change, insert, or delete symbols on the input and issue
appropriate error messages.
They may also pop from the stack.
Alteration of stack symbols or the pushing of new symbols onto the stack is
questionable for several reasons.
First, the steps carried out by the parser might then not correspond to the
derivation of any word in the language at all.
Second, we must ensure that there is no possibility of an infinite loop.
Checking that any recovery action eventually results in an input symbol being
consumed (or the stack being shortened if the end of the input has been reached) is
a good way to protect against such loops.

4.7 More Powerful LR Parsers
• In this section, we shall extend the previous LR parsing techniques to use one symbol of lookahead on the input.
There are two different methods:
1. The "canonical-LR" or just "LR" method, which makes full use of the lookahead symbol(s). This method
uses a large set of items, called the LR(1) items.
2. The "lookahead-LR" or "LALR" method, which is based on the LR(0) sets of items, and has many fewer
states than typical parsers based on the LR(1) items. By carefully introducing lookaheads into the LR(0)
items, we can handle many more grammars with the LALR method than with the SLR method, and build
parsing tables that are no bigger than the
SLR tables. LALR is the method of choice in most situations.
• After introducing both these methods, we conclude with a discussion of how to compact LR parsing tables for
environments with limited memory.
• 4.7.1 Canonical LR(1) Items

4.7.2 Constructing LR(1) Sets of Items

Assignment
• 4.9 Parser Generators
• Bottom Up Parser

chapter4 end.pptx

Recommended

Recommended

More Related Content

Similar to chapter4 end.pptx

Similar to chapter4 end.pptx (20)

Recently uploaded

Recently uploaded (20)

chapter4 end.pptx