An Introduction to the Compiler Designss

20IT011-COMPILER DESIGN
Text Book :
Alfred V Aho, Monica S Lam, Ravi Sethi and Jeffrey D Ullman, “Compilers - Principles,
Techniques and Tools”, Pearson Education, 2nd Edition, 2018.

COURSE OUTCOME
CO1: Ability to understand the functionalities of the different phases of compiler.
CO2: Ability to understand syntax-directed translation and run-time environment.
CO3: Ability to implement code optimization techniques.
CO4: Ability to apply different parsing algorithms to develop the parsers for a
given grammar.
CO5: Ability to design a scanner and a parser using LEX and YACC tools.

UNIT I
INTRODUCTION TO COMPILERS

UNIT-I SYLLABUS
Structure of a compiler – Lexical Analysis – Role
of Lexical Analyzer -Input Buffering –
Specification of Tokens – Recognition of Tokens-
Lex – Finite Automata-Regular Expressions to
Automata – Minimizing DFA.

STRUCTURE OF A COMPILER
• Definition: A program that accept as input a program text in a certain language
and produces as output a program text in another language (Grune et all)
• Compiler is a translator program that reads a program written in one language -
the source language- and translates it into an equivalent program in another
language-the target language.

Lexical Analysis
• In a compiler, linear analysis is called lexical analysis or scanning.
• It takes source code as input
• It reads one character at a time and convert into lexemes( represent in the
form of tokens)
Example: position = initial + rate * 60
✓ The identifier position.
✓ The assignment symbol =.
✓ The identifier initial.
✓ The plus sign.
✓ The identifier rate.
✓ The multiplication sign.
✓ The number 60.
<Id,Position> <=,> <Id,initial> <+,> <Id,rate><*,> <num,60>

Contd..
• Needs to record each id attribute: keep a symbol table
• Lexical Analysis eliminates white spaces, etc..

Syntax Analysis
• It is called as parsing or Syntax Analysis
• It takes token as input and generate parse tree as output.
• The parser checks that the expression made by tokens is
syntactically correct or not.
• Context free grammars formalize the rules and guide syntax analysis.

Semantic Analysis
• It checks the parse tree follows the rules of language
• It keeps tracks of identifiers, types and expressions
• Output is annotated as tree syntax

Intermediate Code Generator
• After semantic analysis, some compilers generate an explicit
intermediate representation of the source program.
• This representation should be easy to produce and easy to translate
into the target program.

Code Optimization
• It is used to improve the intermediate code
• So the output can run fast and take less space
• It removes unnecessary lines of code and arrange the sequence in
order to speed up the execution.

Code Generation
• The final phase of the compiler is the generation of target code,
consisting of relocatable machine code or assembly code.
• The intermediate instructions are each translated into sequence of
machine instructions that perform the same task.

Symbol table Manager
• An essential function of a compiler is to record the identifiers used in
the source program and collect its information.
• A symbol table is a data structure containing a record for each
identifier with fields for attributes.(such as, its type, its scope, if
procedure names then the number and type of arguments etc.,)
• The data structure allows to find the record for each identifier and
store or retrieve data from that record quickly.
Example: sum=initial + value * 10 1 sum
2 initial
3 value
4 ….

Error Handling and Reporting
• Each phase can encounter errors. After detecting an error, a phase must deal that
error, so that compilation can proceed ,allowing further errors to be detected.
• Lexical phase can detect error where the characters remaining in the input do not
form any token.
• The syntax and semantic phases handle large fraction of errors. The stream of
tokens violates the syntax rules are determined by syntax phase.
• During semantic, the compiler tries to detect constructs that have the right
syntactic structure but no meaning to the operation involved.

Role of lexical Analyzer
• As the first phase of a compiler, the main task of the lexical analyzer is
to read the input characters of the source program, group them into
lexemes, and produce as output a sequence of tokens for each lexeme
in the source program.
• The stream of tokens is sent to the parser for syntax analysis.
• It is common for the lexical analyzer to interact with the symbol table
as well.
• When the lexical analyzer discovers a lexeme constituting an
identifier, it needs to enter that lexeme into the symbol table.

Contd..
• In some cases, information regarding the kind of identifier may be
read from the symbol table by the lexical analyzer to assist it in
determining the proper token it must pass to the parser.

Interaction between the lexical analyzer and the parser
• The interaction is implemented by having the parser call the lexical
analyzer.
• The call, suggested by the getNextToken command, causes the
lexical analyzer to read characters from its input until it can identify
the next lexeme and produce for it the next token, which it returns to
the parser.
• Since the lexical analyzer is the part of the compiler that reads the
source text, it may perform certain other tasks besides identification of
lexemes

Contd..
• One such task is stripping out comments and whitespace (blank,
newline, tab, and perhaps other characters that are used to separate
tokens in the input).
• Another task is correlating error messages generated by the compiler
with the source program.
• lexical analyzers are divided into a cascade of two processes:
• Scanning consists of the simple processes that do not require tokenization of the
input, such as deletion of comments and compaction of consecutive whitespace
characters into one.
• Lexical analysis proper is the more complex portion, where the scanner produces the
sequence of tokens as output.

Lexical Analysis and Parsing
• Simplicity of design – avoid complexity to work together .
• Improving Compiler Efficiency - A separate lexical analyzer allows us to apply
specialized techniques that serve only the lexical task, not the job of parsing. In
addition, specialized buffering techniques for reading input characters can speed
up the compiler significantly.
• Enhancing Compiler Portability -Input-device-specific peculiarities can be
restricted to the lexical analyzer.

Tokens, Patterns, Lexemes
• A token is a pair consisting of a token name and an optional
attribute value.
• A pattern is a description of the form that the lexemes of a token
may take.
• A lexeme is a sequence of characters in the source program that
matches the pattern for a token and is identified by the lexical analyzer
as an instance of that token.

Lexical Errors
• Some errors are out of power of lexical analyzer to recognize:
✓fi(a==f(x))..
• However it may be able to recognize errors like:
✓d=2r
• Such errors are recognized when no pattern for tokens matches a character
sequence

Error Recovery
• Panic Mode: Successive characters are ignored until we reach to a well
formed token
• Delete one character from the remaining input
• Insert a missing character into the remaining input
• Replace a character by another character
• Transpose two adjacent characters

Input Buffering
• Input buffer helps to find the correct lexeme.
• Sometimes lexical analyzer needs to look ahead some symbols to decide about
the token to return
- in C language: we need to look after -,= or < to decide what token to return
- in FORTRAN: DO 5 I=1.25
• We need to introduce a two buffer scheme to handle large look-aheads safely

Buffer Pairs
Two pointers to the input are maintained:
I. Pointer lexemeBegin, marks the beginning of the current lexeme,
whose extent we are attempting to determine.
II. Pointer forward scans ahead until a pattern match is found

Sentinels
• A sentinel is a special character or token used to mark the end of
a sequence.
• In the context of a lexical analyzer, a sentinel can be utilized to
indicate the end of the input source code or the end of a specific
token.

QUIZ
1. ______________ is the heart of the compiler.
2.what will be the next phase of Intermediate code generator?
3. What is the output of Lexical analyzer?
4. Token consist of pairs, what are they?
5. What is the command ,the parser generate to lexical analyzer?
6. What is the drawback in 1-Buffer?
7. What the two pointers used in Buffering technique?
8. What are the Back-end in Phases of compiler?

SPECIFICATION OF TOKENS
• In theory of compilation regular expressions are used to formalize the
specification of tokens.
• The specification of tokens depends on the pattern of the lexeme.
Example:
letter(letter+digit)*
• Each regular expression is a pattern specifying the form of strings.

Contd..
I) Alphabet, Strings, Languages
II) Operations on Language
III) Regular Expression
IV) Regular Definition

Strings and Languages
Symbol:
• A symbol is an abstract entity that we shall not define formally.
Examples:
• Letters: A|B|C…..|Z |a|b….|z
• Digit:0,1,2…9
• Special Characters

Contd..
Alphabet:
• Alphabet is a non-empty finite set of symbols. it is denoted by the
symbol Ɛ(Sigma).
Example:
• Ɛ={0,1} is the binary alphabet which consisting of digits
• Ɛ={a,b,c} is an alphabet which consisting of letters.
• ASCII is an alphabet ;it is used in many softwares.
• Unicode is an alphabet , which includes 1,00,000 characters.

Contd..
String:
• A String is defined as finite sequence of symbols over an alphabet Ɛ .
It is denoted as ‘w’.
• In language theory, the terms “sentence” and “word” are often used as
synonyms for “string”
Examples:
1.Alphabet Ɛ={ a, b }
Strings: w = {a, b, aa, ab, ba, bb, aaa, aab,……..}
2. Ɛ={ 0, 1 }
Strings: w = { 0, 1, 00, 01, 10, 11, 000, 001, 010, 011,100,101,110,111,
0000,....}

Different Notation used in a string
• Length
• Prefix:- any number of leading symbols
• Proper Prefix :- any number of leading symbols without € and
original string
• Suffix:- tailing symbols
• Proper Suffix:- any number of trailing symbols without € and
original string
• Substring

Definition
- A regular expression over Ɛ can be defined as
• ɸ is a regular expression for empty set
• Ɛ is a regular expression for null string {Ɛ}
• If ‘a’ is a symbol in Ɛ then ‘a’ is RE for the set {a}.
• If R and S are two RE then
• UNION of two R+E is a RE
• CONCATENATION of two RS is a RE
• KLEEN CLOSURE of any R* S* is a RE

Identity Rule or Algebric rule

EXAMPLES
1) Design a RE to accept all possible combinations of a’s and b’s over
Ɛ={ a, b}
2) Design a RE which Starts with 1 and end with 0.

An Introduction to the Compiler Designss

Recommended

Recommended

More Related Content

Similar to An Introduction to the Compiler Designss

Similar to An Introduction to the Compiler Designss (20)

Recently uploaded

Recently uploaded (20)

An Introduction to the Compiler Designss