Compiler an overview

Introduction to Compiler
Dr.P.Amudha
Associate Professor
Dept. of CSE

Introduction
• In order to reduce the complexity of designing and
building computers, made to execute relatively simple
commands
• combining these very simple commands into a program in
what is called machine language
• . Since this is a tedious and error-prone process most
programming is, instead, done using a high-level
programming language
• . This language can be very different from the machine
language that the computer can execute, so some means of
bridging the gap is required.
• This is where the compiler comes in.

Contd.,
• Using a high-level language for programming has a large impact
on how fast programs can be developed.
• The main reasons for this are:
• Compared to machine language, the notation used by
programming languages is closer to the way humans think about
problems.
• The compiler can spot some obvious programming mistakes.
• Programs written in a high-level language tend to be shorter
than equivalent programs written in machine language.
• Another advantage of using a high-level level language is:
the same program can be compiled to many different machine
languages and, hence, run on many different machines.

Translator
• A program written in high-level language is called as
source code. To convert the source code into machine
code, translators are needed.
• A translator takes a program written in source language as
input and converts it into a program in target language as
output.
• It also detects and reports the error during translation.
Roles of translator are:
• Translating the high-level language program input into an
equivalent machine language program.
• Providing diagnostic messages wherever the programmer
violates specification of the high-level language program.

Different type of translators
Compiler
Compiler is a translator which is used to convert programs in high-level
language to low-level language. It translates the entire program and also
reports the errors in source program encountered during the translation.
Interpreter
• Interpreter is a translator which is used to convert programs in high-level
language to low-level language. Interpreter translates line by line and
reports the error once it encountered during the translation process.
• It directly executes the operations specified in the source program when the
input is given by the user.
• It gives better error diagnostics than a compiler.

• Assembler
Assembler is a translator which is used to
translate the assembly language code into
machine language code.

S.No. Compiler Interpreter
1 Performs the translation
of a program as a
whole.
Performs statement by
statement translation.
2 Execution is faster. Execution is slower.
3 Requires more memory
as linking is needed for
the generated
intermediate object
code.
Memory usage is efficient as no
intermediate object code is
generated.
4 Debugging is hard as
the error messages are
generated after scanning
the entire program only.
It stops translation when the first
error is met. Hence, debugging
is easy.
5 Programming languages
like C, C++ uses
compilers.
Programming languages like
Python, BASIC, and Ruby uses
interpreters.

Why learn about compilers?
a) It is considered a topic that you should know in order to be
“well-cultured” in computer science.
b) A good craftsman should know his tools, and compilers are
important tools for programmers and computer scientists.
c) The techniques used for constructing a compiler are useful
for other purposes as well.
d) There is a good chance that a programmer or computer
scientist will need to write a compiler or interpreter for a
domain-specific language

Compilers and Interpreters
• “Compilation”
– Translation of a program written in a source
language into a semantically equivalent
program written in a target language
Compiler
Error messages
Source
Program
Target
Program
Input
Output

Compilers and Interpreters
(cont’d)
Interpreter
Source
Program
Input
Output
Error messages
• “Interpretation”
– Performing the operations implied by the
source program

The Analysis-Synthesis Model of
Compilation
• There are two parts to compilation:
– Analysis determines the operations implied by
the source program which are recorded in a tree
structure
– Synthesis takes the tree structure and translates
the operations therein into the target program

Other Tools that Use the
Analysis-Synthesis Model
• Editors (syntax highlighting)
• Pretty printers (e.g. Doxygen)
• Static checkers (e.g. Lint and Splint)
• Interpreters
• Text formatters (e.g. TeX and LaTeX)
• Silicon compilers (e.g. VHDL)
• Query interpreters/compilers (Databases)

Language Processing System
Preprocessor
Compiler
Assembler
Linker
Skeletal Source Program
Source Program
Target Assembly Program
Relocatable Object Code
Absolute Machine Code
Libraries and
Relocatable Object Files

Language Processing System
Preprocessor :
• A preprocessor, generally considered as a part of
compiler, is a tool that produces input for compilers.
• It deals with macro-processing, augmentation, file
inclusion, language extension, etc.
Compiler: A compiler, translates high-level language into
low-level machine language
Assembler:
• An assembler translates assembly language programs
into machine code.
• The output of an assembler is called an object file,
which contains a combination of machine
instructions as well as the data required to place
these instructions in memory.

Contd.,
Loader:
• Loader is a part of operating system and is
responsible for loading executable files into memory
and execute them.
• It calculates the size of a program instructions and
data and creates memory space for it.
• Cross-compiler: A compiler that runs on platform A
and is capable of generating executable code for
platform B is called a cross-compiler.
• Source-to-source Compiler: A compiler that takes the
source code of one programming language and
translates it into the source code of another
programming language is called a source-to-source
compiler.

example
How a program, using C compiler, is executed on
a host machine.
• User writes a program in C language (high-level
language).
• The C compiler, compiles the program and
translates it to assembly program (low-level
language).
• An assembler then translates the assembly
program into machine code (object).
• A linker tool is used to link all the parts of the
program together for execution (executable
machine code).
• A loader loads all of them into memory and
then the program is executed

The Grouping of Phases
• Compiler front and back ends:
– Front end: analysis (machine independent)
– Back end: synthesis (machine dependent)
• Compiler passes:
– A collection of phases is done only once (single pass)
or multiple times (multi pass)
• Single pass: usually requires everything to be defined before
being used in source program
• Multi pass: compiler may have to keep entire program
representation in memory

Phases of compiler
• Lexical Analysis
A token is a string of characters, categorized according to
the rules as a symbol (e.g. IDENTIFIER,
NUMBER, COMMA, etc.).
The process of forming tokens from an input stream of
characters is called tokenization
and the lexer categorizes them according to symbol
type.
A token can look like anything that is useful for
processing an input text stream or text file.

Contd.,
• A lexical analyzer generally does nothing with combinations of
tokens, a task left for a parser. For example, a typical lexical
analyzer recognizes parenthesis as tokens, but does nothing to
ensure that each '(' is matched with a ')'.
• In a compiler, linear analysis is called lexical analysis or scanning.
For example, in lexical analysis
the characters in the assignment statement: Position: = initial + rate ∗ 60
would be grouped into the following tokens:
1. The identifier position.
2. The assignment symbol : =
3. The identifier initial.
4. The plus sign +
5. The identifier 𝐫𝐚𝐭𝐞.
6. The multiplication sign
7. The number 60
The blanks separating the characters of these tokens would be eliminated

Syntax Analysis
• Hierarchical analysis is called parsing or syntax
analysis.
• It involves grouping the tokens of the source
program into grammatical phrases that are used by
the compiler to synthesize output.
• Usually, the grammatical phrases of the source
program are represented by a parse tree

• The hierarchical structure of a program is usually expressed
by recursive rules.
• For example, we might have the following rules as part of the
definition of expressions:
1. Any identifier is an expression.
2. Any number is an expression.
3. If expression1 and expression2 are expressions, then so are
expression1 + expression2
expression1 ∗ expression2
(expression1)
Rules (1) and (2) are non-recursive basic rules, while (3) defines
expressions in terms of operators applied to other expressions.
Thus, by rule (1), initial and rate are expressions.
By rule (2), 60 is an expression, while by rule (3), we can first
infer that rate∗60 is an expression and finally that initial +
rate∗60 is an expression.

Semantic Analysis
• The semantic analysis phase checks the source program for
semantic errors and gathers type information for the subsequent
code-generation phase.
• It uses the hierarchical structure determined by the syntax-analysis
phase to identify the operators and operands of expressions and
statements.
• An important component of semantic analysis is type checking.
• Here the compiler checks that each operator has operands that are
permitted by the source language specification.
• For example, many programming language definitions require a
compiler to report an error every time a real number is used to
index an array.
• However, the language specification may permit some operand
corrections,
• for example, when binary arithmetic operator is applied to an
integer and real.
• In this case, the compiler may need to convert the integer to a real.

• The six phases of compilation: lexical analysis, syntax
analysis, semantic analysis, intermediate code
generation, code optimization, and code generation.
• The 6 phases divided into 2 Groups:
1. Front End: Depends on stream of tokens and parse tree
2. Back End: Dependent on Target, Independent of source
code
Symbol-Table Management:
• A symbol table is a data structure containing a record for
each identifier, with fields for the attributes of the
identifier.
• The data structure allows us to find the record for each
identifier quickly and to store or retrieve data from that
record quickly.
• Symbol table is a Data Structure in a Compiler used for
Managing information about variables & their attributes.

Error Detection and Reporting
• Each phase can encounter errors.
• However, after detecting an error, a phase must somehow
deal with that error, so that compilation can proceed,
allowing further errors in the source program to be
detected.
• A compiler that stops when it finds the first error is not as
helpful as it could be.
• The syntax and semantic analysis phases usually handle a
large fraction of the errors detectable by the compiler.
• The lexical phase can detect errors where the characters
remaining in the input do not form any token of the
language.
• Errors where the token stream violates the structure rules
(syntax) of the language are determined by the syntax
analysis phase.

• Intermediate Code Generations:-
• An intermediate representation of the final machine
language code is produced.
• This phase bridges the analysis and synthesis phases of
translation.
• Code Optimization :-
• This is optional phase described to improve the intermediate
code so that the output runs faster and takes less space.
• Code Generation:-
• The last phase of translation is code generation. A number
of optimizations to reduce the length of machine language
program are carried out during this phase.
• The output of the code generator is the machine language
program of the specified computer.

Compiler construction tools
1. Parser generators.
2. Scanner generators.
3. Syntax-directed translation engines.
4. Automatic code generators.
5. Data-flow analysis engines.
6. Compiler-construction toolkits.
Parser Generators
• Input: Grammatical description of a programming
language
Output: Syntax analyzers.
• Parser generator takes the grammatical description
of a programming language and produces a syntax
analyzer.

Scanner Generators
• Input: Regular expression description of the tokens of a
language
Output: Lexical analyzers.
• Scanner generator generates lexical analyzers from a regular
expression description of the tokens of a language.
Syntax-directed Translation Engines
• Input: Parse tree.
Output: Intermediate code.
• Syntax-directed translation engines produce collections of
routines that walk a parse tree and generates intermediate
code.

Automatic Code Generators
• Input: Intermediate language.
Output: Machine language.
• Code-generator takes a collection of rules that define the translation of each
operation of the intermediate language into the machine language for a
target machine.
Data-flow Analysis Engines
• Data-flow analysis engine gathers the information, that is, the values
transmitted from one part of a program to each of the other parts.
• Data-flow analysis is a key part of code optimization.
Compiler Construction Toolkits
• The toolkits provide integrated set of routines for various phases of compiler.
• Compiler construction toolkits provide an integrated set of routines for
construction of phases of compiler.

Cousins of Compiler
• Interpreters: discussed in detail in first lecture
• Preprocessors: They produce input for the compiler. They perform jobs
such as deleting comments, include files, perform macros etc.
• Assemblers: They are translators for Assembly language. Sometimes
the compiler will generate assembly language in symbolic form then
hand it over to assemblers.
• Linkers: Both compilers and assemblers rely on linkers to collect code
separately compiled or assembled in object file into a file that is
directly executable.
• Loaders: It resolves all relocatable addresses to base address

Applications of compiler technology
1. Implementation of High-Level Programming
2. Optimizations for Computer Architectures
• Parallelism
• M e m o r y hierarchies
3. Design of New Computer Architectures
• R I S C
• Specialized Architectures
4. Program Translations
• Binary Translation
• Hardware Synthesis
• Database Query Interpreters
• Compiled Simulation
5. Software Productivity Tools
• Type Checking
• Bounds Checking
• Memory – Management Tools

Complexity of compiler technology
• A compiler is possibly the most complex
system software
• The complexity arises from the fact that it is
required to map a programmer’s requirements
(in a HLL program) to architectural details
• It uses algorithms and techniques from a very
large number of areas in computer science
• Translates intricate theory into practice -
enables tool building

Compiler Algorithms
Makes practical application of
• Greedy algorithms - register allocation
• Heuristic search - list scheduling
• Graph algorithms - dead code elimination, register allocation
• Dynamic programming - instruction selection
• Optimization techniques - instruction scheduling
• Finite automata - lexical analysis
• Pushdown automata - parsing
• Fixed point algorithms - data-flow analysis
• Complex data structures - symbol tables, parse trees, data
dependence graphs
• Computer architecture - machine code generation

Context Free Grammars
• Grammars are used to describe the syntax of a
programming language. It specifies the structure of
expression and statements.
• stmt -> if (expr) then stmt
where stmt denotes statements, expr denotes
expressions.
Types of grammar
• Type 0 grammar
• Type 1 grammar
• Type 2 grammar
• Type 3 grammar
• Context free grammar is also called as Type 2
grammar.

A context free grammar G is defined by four tuples as,
G=(V,T,P,S)
where,
• G - Grammar
• V - Set of variables
• T - Set of Terminals
• P - Set of productions
• S - Start symbol
• It produces Context Free Language (CFL) which is a collection of
input strings that are terminals, derived from the start symbol of
grammar on multiple steps.
where,
• L-Language
• G- Grammar
• w - Input string
• S - Start symbol
• T - Terminal

Conventions
Terminals are symbols from which strings
are formed.
• Lowercase letters i.e., a, b, c.
• Operators i.e.,+,-,*·
• Punctuation symbols i.e., comma, parenthesis.
• Digits i.e. 0, 1, 2, · · · ,9.
• Boldface letters i.e., id, if.
Non-terminals are syntactic variables that denote a set
of strings.
• Uppercase letters i.e., A, B, C.
• Lowercase italic names i.e., expr , stmt.

• Start symbol is the head of the production stated first in the
grammar.
• Production is of the form LHS ->RHS (or) head ->
body, where head contains only one non-terminal and body
contains a collection of terminals and non-terminals.
• (eg.) Let G be,

Context Free Grammars vs Regular Expressions
• Grammars are more powerful than regular expressions.
• Every construct that can be described by a regular
expression can be described by a grammar but not vice-
versa.
• Every regular language is a context free language but
reverse does not hold.
(eg.)
• RE= (a I b)*abb (set of strings ending with abb).
• Grammar

Syntax directed definition
• Syntax directed definition specifies the values of
attributes by associating semantic rules with the
grammar productions.
• It is a context free grammar with attributes and
rules together which are associated with grammar
symbols and productions respectively.
• The process of syntax directed translation is two-
fold:
• Construction of syntax tree
• Computing values of attributes at each node by
visiting the nodes of syntax tree.

• Semantic actions
• Semantic actions are fragments of code which are
embedded within production bodies by syntax
directed translation.
• They are usually enclosed within curly braces ({
}).
• It can occur anywhere in a production but usually
at the end of production.
• (eg.)
• E---> E1 + T {print ‘+’}

Types of translation
L-attributed translation
• It performs translation during parsing itself.
• No need of explicit tree construction.
• L represents 'left to right'.
•
S-attributed translation
• It is performed in connection with bottom up
parsing.
• 'S' represents synthesized.

Types of attributes
Inherited attributes
• It is defined by the semantic rule associated with the
production at the parent of node.
• Attributes values are confined to the parent of node, its
siblings and by itself.
• The non-terminal concerned must be in the body of the
production.
Synthesized attributes
• It is defined by the semantic rule associated with the
production at the node.
• Attributes values are confined to the children of node and by
itself.
• non terminal concerned must be in the head of production.
• Terminals have synthesized attributes which are the lexical
values (denoted by lexval) generated by the lexical analyzer.

Syntax directed definition of simple desk calculator
Production Semantic rules
L ---> En L.val = E.val
E ---> E1+ T E.val = E1.val+ T.val
E ---> T E.val = T.val
T---> T1*F T.val = Ti.val x F.val
T ---> F T.val = F.val
F ---> (E) F.val = E.val
F ---> digit F.val = digit.lexval

Syntax-directed definition-inherited attributes
Production Semantic Rules
D --->TL L.inh = T.type
T ---> int T.type =integer
T ---> float T.type = float
L ---> L1, id L1.inh = L.inh
addType (id.entry, Linh)
L ---> id addType (id.entry, L.inh)
Symbol T is associated with a synthesized attribute type.
• Symbol L is associated with an inherited attribute inh,

Types of Syntax Directed Definitions
• S-attributed Definitions
• Syntax directed definition that involves only synthesized
attributes is called S-attributed.
• Attribute values for the non-terminal at the head is
computed from the attribute values of the symbols at the
body of the production.
• The attributes of a S-attributed SDD can be evaluated in
bottom up order of nodes of the parse tree.
• i.e., by performing post order traversal of the parse tree
and
• evaluating the attributes at a node when the traversal leaves
that node for the last time.

Production Semantic rules
L ---> En L.val = E.val
E ---> E1+ T E.val = E1.val+ T.val
E ---> T E.val = T.val
T---> T1*F T.val = Ti.val x F.val
T ---> F T.val = F.val
F ---> (E) F.val = E.val
F ---> digit F.val = digit.lexval

• L-attributed Definitions
• The syntax directed definition in which the edges of dependency
graph for the attributes in production body, can go from left to right
and not from right to left is called L-attributed definitions.
• Attributes of L-attributed definitions may either be synthesized or
inherited.
• If the attributes are inherited, it must be computed from:
• Inherited attribute associated with the production head.
• Either by inherited or synthesized attribute associated with the
production located to the left of the attribute which is being computed.
• Either by inherited or synthesized attribute associated with the
attribute under consideration in such a way that no cycles can be
formed by it in dependency graph.

Production Semantic Rules
T ---> FT' T '.inh = F.val
T ' ---> *FT1’ T’1.inh =T'.inh x F.val
In production 1, the inherited attribute T' is computed from the value
of F which is to its left.
In production 2, the inherited attributed Tl' is computed from T'. inh
associated with its head and the value of F which appears to its left
in the production.
i.e., for computing inherited attribute it must either use from the
above or from the left information of SDD.

Compiler an overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Compiler an overview

Similar to Compiler an overview (20)

Recently uploaded

Recently uploaded (20)

Compiler an overview