Compiler Design Syntax Analysis - Compiler Design

What is Syntax Analysis in Compiler Design?

The second phase of the compiler is known as Syntax analysis.

It is observed in the previous chapters that the tokens are identified by the lexical analyser. But the syntax of the given statement cannot be checked by the lexical analyzer because of the limitations of the regular expressions. The balancing tokens cannot be checked by regular expressions. Thus the context-free grammar (CFG), which is recognized by push-down automata, is used by this phase.

CFG, on the other hand, is a superset of Regular Grammar, as depicted below:

CFG Regular Grammar

Each Regular Grammar is context-free, but some problems exists that are beyond the scope of Regular Grammar. The syntax of the programming languages is described by the CFG.

What is Context-Free Grammar (CFG)?

A context-free grammar has four components:

  • A set of non-terminals (V). The syntactic variables that denote the set of strings are Non-terminals. The set of strings are defined by the non-terminals, which help in defining the grammar language.
  • A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from which strings are formed.
  • A set of productions (P). The way in which the terminals and non-terminal combine to form strings is specified by the productions of a grammar. Each production consists of a non-terminal called the left side of the production, an arrow, and a sequence of tokens and/or on- terminals, called the right side of the production.
  • One of the non-terminals is designated as the start symbol (S); from where the production begins.

The start symbol derives the strings by replacing the non-terminal by the right side of a production for that non-terminal.

Example

For instance, The problem of palindrome language is considered which the Regular Expression cannot describe. In the sense that L = { w | w = wR } is not a regular language, but CFG can describe as follows:

Where:

This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111, etc.

What is a Syntax Analyzers in Compiler Design?

The input is taken from the lexical analyzer as token streams by syntax analyzer. The source code taken from the token stream is analyzed by the parser as against the production rules in order to detect the errors in the code and Parse tree is the outcome of this phase.

Syntax Analyzer

Thus, two tasks are accomplished by the parser, parsing the code and generating the output as a parse tree.

The whole code along with the errors in the program is parsed by the parsers. The error recovering strategies are used by the parsers.

What is a Derivation in Compiler Design?

The sequence of the production rules which are used for obtaining the input string is known as derivation. Two decisions for sentential form od input is taken by parsing:

  • Deciding the non-terminal which is to be replaced.
  • Deciding the production rule, by which, the non-terminal will be replaced.

Two options are available for deciding which non-terminal to be replaced with production rule,

Left-most Derivation

The left-most derivation scans and replacing the sentential form of input from left to right. The left-most derivation derives the sentential form which is known as left-sentential form.

Right-most Derivation

The left-most derivation scans and replacing the sentential form of input from right to left. The right-most derivation derives the sentential form which is known as right-sentential form.

Example

Production rules:

Input string: id + id * id

The left-most derivation is:

It is to be noticed that the left-most side non-terminal processes first.

The right-most derivation is:

What is a Parse Tree in Compiler Design?

The graphical representation of the derivation is known as Parse tree. The derivation of the strings from the start symbol is made convenient. The root of the parse tree is the start symbol of derivation.

For instance, left-most derivation of a + b * c is taken,

The left-most derivation is:

Step 1:

E → E * E

parse_tree_step_1

Step 2:

E → E + E * E

parse_tree_step_2

Step 3:

E → id + E * E

parse_tree_step_3

Step 4:

E → id + id * E

parse_tree_step_4

Step 5:

E → id + id * id

parse_tree_step_5

In a parse tree:

  • All leaf nodes are terminals.
  • All interior nodes are non-terminals.
  • In-order traversal gives original input string.

The associativity and the precedence of the operators are depicted by the parse tree. The The deepest sub-tree is traversed first, therefore the operator in that sub-tree gets precedence over the operator which is in the parent nodes.

What is Ambiguity in Compiler Design?

For one string, if the grammar has more than one parse tree either left or right, the grammar is said to be ambiguous.

Example

For the string id + id – id, two parse trees are generated by the above grammar.

Parse Tree Ambiguity

The ambiguous grammar generates a language known as inherently ambiguous. For compiler construction, grammar ambiguity is not suggested. Ambiguity cannot be detected and removed automatically, but can be removed either re-writing the whole grammar without ambiguity, or by setting and following associativity and precedence constraints.

What is Associativity in Compiler Design?

If the operators are on both sides, the side on which the operator takes this operand is decided by the associativity of those operators. If the operation is left-associative, then the operand will be taken by the left operator or if the operation is right-associative, the right operator will take the operand.

Example

Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If the expression contains:

it will be evaluated as:

For instance, (id + id) + id

Operations like Exponentiation are right associative, i.e., the order of evaluation in the same expression will be:

For instance, id ^ (id ^ id)

What is Precedence in Compiler Design?

If a common operand is shared by two different operators, the decision on taking the operand is taken by the precedence. For instance, 2+3*4 can have two different parse trees, one corresponding to (2+3)*4 and another corresponding to 2+(3*4). The problem can be easily solved by setting precedence among the operators. But, mathematically, * (multiplication) has precedence over + (addition) and hence the expression 2+3*4 is interpreted as:

The chances of ambiguity in language and the grammar is decreased by these methods.

What is Left Recursion in Compiler Design?

If in case the grammar contains non-terminal A and the derivation has A as the left-most symbol, then the grammar becomes left-recursive. For the case of top-down parsers, left-recursive grammar is problematic. When the same non-terminal in the derivation is encountered in the terminal, it is difficult to decide when to stop the left non-terminal paring and it may go into an infinite loop.

Example:

  • is an example of immediate left recursion, where A is any non-terminal symbol and α represents a string of non-terminals.
  • is an example of indirect-left recursion.

Left Recursion

First A is parsed by the top-down parser, which yields a string that consists of A itself and the parser goes into loop.

Removal of Left Recursion

The following technique is used to remove the left recursion:

The production

is converted into following productions

The string derived from the grammar does not get impacted by this but the immediate left recursion is removed.

Second method is by using the algorithm which will eliminate all the direct and indirect left recursions.

Example

The production set

after applying the above algorithm, it becomes

and then, by using the first technique the left recursion is removed.

Now none of the production has either direct or indirect left recursion.

What is Left Factoring in Compiler Design?

If there is any common prefix string for more than one grammar production rules, then the choice as to which of the production need to be taken for parsing cannot be made by the parser.

Example

If a top-down parser encounters a production like

As both productions start from same terminal, it is difficult to determine which production to follow. In such situations left-factoring technique is used.

The grammar is transformed making it useful for the top-down parsers by Left factoring. One production for each of the common prefixes is taken and the rest of the derivation is added by the new productions.

Example

The above productions can be written as

Now one production per prefix is made available for the parser which simplifies in taking decisions.

What are First and Follow Sets in Compiler Design?

Creation of the first and follow sets is an important part of the construction of parser table. The actual position of the terminal in derivation is provided by these sets by creating the parsing table.

First Set

In order to identify which terminal symbol is derived first by the non-terminal the First Set is created. For instance,

That is α derives t (terminal) in the very first position. So, t ∈ FIRST(α).

ALGORITHM FOR CALCULATING FIRST SET

The definition of FIRST(α) set:

  • if α is a terminal, then FIRST(α) = { α }.
  • if α is a non-terminal and α → ℇ is a production, then FIRST(α) = { ℇ }.
  • if α is a non-terminal and α → 𝜸1 𝜸2 𝜸3 … 𝜸n and any FIRST(𝜸) contains t then t is in FIRST(α).

First set can be seen as:

First Formula

Follow Set

In the production rules, the terminal symbol that follows immediately a non-terminal α is calculated by Follow Set. It concentrates only on the next terminal symbol which follows the productions of a non-terminal.

ALGORITHM FOR CALCULATING FOLLOW SET:

  • if α is a start symbol, then FOLLOW() = $
  • if α is a non-terminal and has a production α → AB, then FIRST(B) is in FOLLOW(A) except ℇ.
  • if α is a non-terminal and has a production α → AB, where B ℇ, then FOLLOW(A) is in FOLLOW(α).

Follow set can be seen as: FOLLOW(α) = { t | S *αt*}

What are the limitations of Syntax Analyzers?

The inputs are received as tokens from lexical analyzers by syntax analyzers. The validity on the token supplied by the syntax analyzed completely depends on the lexical analyzers. Some of the drawbacks associated with syntax analyzers are as follows:

  • The validity of the token cannot be determined.
  • The declaration of the token before it is being used cannot be determined.
  • The initialization of the token before it is being used cannot be determined.
  • The validity of the operations performed on the token type cannot be determined.

All the mentioned drawbacks can be accomplished by the semantic analyzer.

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Compiler Design Topics