Lexical Analyzer
- The task of first phase of a compiler is to read the input characters of the source code and group them into sequence of characters with a collective meaning is known as token.
- Lexical Analyzer reads the source program and performs the following tasks
Produce stream of tokens
Ignore white spaces(blank, new line, tab)
Ignore comments if any
Definition of a token:
- The sequence of characters with a logical meaning is known as token
(or)
- The smallest individual unit of a program is known as token
Definition of pattern rule:
- A pattern rule is a description of the form that the lexeme of a token may take
Definition of Lexeme:
- A lexeme is sequence of characters in the source program that matches the pattern for a token
(or)
- The actual representation of a token
- Each lexeme is categorized by its name called a token
- The general form of a token is <token-name, attribute-value>
- where token-name is an abstract symbol that is used during next phase(syntax analyzer) of a compiler and attribute-value points to an entry in the symbol table
Example:
DO 5 I = 1.12;
Identifier
Identifier is collection of alphanumeric characters and identifier beginning character should be necessarily a letter
Rules for being valid identifiers
- The output would be <DO> <number> <id, I> <assign_op> <number> <semicolon>
- When the lexical analyzer recognizes tokens as identifier (id), it needs to enter into the symbol table along with their attributes
- Lexical Analyzer is also known as Scanner
Reasons why lexical analyzer is also a scanner
- Scanners don't require tokenization of the input, such as deletion of comments and white spaces
- Where Lexical analyzer produces tokens from output of the scanner
Why to separate lexical analyzer and parsing?
- Simplicity of design
- Compiler efficiency is improved
- Compiler portability is enhanced
Specification of a token
- Specification of tokens can be done by using regular expressions
Identifier
Identifier is collection of alphanumeric characters and identifier beginning character should be necessarily a letter
Rules for being valid identifiers
- The name of the identifier should not begin with a letter or any special character. For example, 1index, $currency amount_count are invalid identifiers but index1 is valid one
- There should not be any space in the identifier name. For example, int total amount is invalid identifier
- The name of the identifier must not be a keyword. For example, int switch is an invalid identifier