>>So today we’ll

be doing scanning. Last time we started

introducing scanning. And we reviewed. Basically, most of

last lecture was spent on reviewing the

concepts of a DFA, NFA and a regular expression. Deterministic Finite Automaton, Non-Deterministic

Finite Automaton and regular expressions. And we will be using this in

the — to build a scanner. And the scanner is that

stage in the compiler, in which we recognize

words or tokens. So we don’t recognize sentences, sentences which correspond

to statements. You know, this is the —

this is the human language and this is the programming

language. So words correspond to tokens and sentences correspond

to statements. And recognizing words

is the scanning, while recognizing sentences is

the parsing, which is the part that we haven’t gotten to yet. Okay. And last time we mentioned

that to recognize words, we can use NFAs, DFAs

and regular expressions, because it’s — the language

of all words, all valid words, in a typical programming

language, okay, is a regular language. So it’s a regular language that

includes all the valid words in a typical programming

language. And the words in a typical

programming language are the numbers, you know, the

integers, variable names, the different operators and

the key words of the language. So these are the main parts of

speech in a language: numbers, constants, variable names which

we call identifiers, operators, the different operators —

plus, minus and the key words of the language, key words

like, you know — if, else, why. So the language of all — what

we will be doing in scanning is that we will define a

language for each one of these different

parts of speech. And the language of all

valid words is just going to be the union of all of them. Right? So last time, we started

giving an example by looking into a regular expression

for unsigned integers. So now, when it comes to

programming languages, our sigma’s going to be

the fits of all symbols, all valid symbols in a

programming language, in the programming language that

we are designing a compiler for. So, we will have

all the numbers. We will have all the

alphabetic characters — a, b through z. And we will have

even uppercase — A,B through — so this is little z

and this is big Z. And we have all the symbols that

can appear in a valid program, like, you know, plus,

minus, equal, etc. Okay. So this is our sigma. So this is not the whole

thing, but now you understand that it has all the

symbols that you can type in when you are writing

a program. This is our sigma. Now given this sigma, we

need the regular expression for an unsigned integer. So we need the regular

expression for an unsigned integer. And the first attempt may

be something like this. And by the way, we will

use a shorthand for numbers and alphabetic characters. So for numbers, we will

use the shorthand 0 to 9. We’ll use this. 0 to 9, this means all

the numbers from 0 to 9. And for alphanumeric — for

alphabetic, sorry, we will use a to z. So this is different from

what they’re using in the book. In the book, they’re using

ellipsis instead of the hyphen, and we are merging the — under, we’re are merging little

letters with capital letters. Okay? So this just means all of

them, all alphabetic characters. Now, a regular expression

for an unsigned integer — the first attempt is

something like this. 0 through 9, star. And you can repeat as

many times as you want. Okay. Now, what’s — this is

fine, but this will allow things that we may not want to allow,

like, for example, 0, 0, 0. So this will allow

us to write 0, 0, 0. This will be an acceptable

string that belongs to this language. Right? So you can repeat. But to avoid this, to

avoid leading 0’s –>>You [inaudible]

the first character.>>Well, okay, yeah. So the first character

has to be a non-0, so you can do it like 1 to 9. But then what’s the problem?>>You can’t [inaudible].>>Yeah. So you can

add the 0 by itself. So we can just add it, union 0. So, you can do, union, 0. You either have a 0

or you have — sorry. 1 through 9 — what I

meant is 1 through 9, then followed by

0 through 9, star. So, here, I’m saying

that the first — you either have a 0 or

you have a non-0 followed by any combination of numbers. Okay. So, this one —

this allows leading 0’s. This does not allow. By the way, in this course, our objective is not knowing

what should be allowed and what shouldn’t be allowed. You know, we are giving the

specifications of a language and we are designing

a compiler for that. So who decides, you know,

whether leading 0’s are allowed or not allowed is the

designer of the language. When someone writes the

specs for a language, and based on the specs of this

language, we build a compiler that implements these specs. Okay? So we tell you, you know, whether leading 0’s should be

allowed or not and we ask you to build a compiler

that implements this. Okay. So now, this is an array. Now, for this array,

it can be easy to think of a Non-Deterministic

Finite Automaton that is equivalent to this. Right? For this — this

is our start state. Then we have two possibilities. Right? So, we either start

with 0 and if we start with 0, what will happen? We got an accept state. And if we start with a

non-0, we’ll go to –>>Another accept state.>>Another accept state, but what should you have

in this accept state?>>A loop.>>A loop. So you can have anything

that you want. You can have anything 0 to 9. So this is the equivalent NFA. So, today, we’ll be covering

the Thompson Construction, which is a systematic way of

converting a regular expression to an NFA, but here,

we’re doing it informally. We are doing it just by

inspection, because it’s so obvious and intuitive, to convert this regular

expression into an NFA. By the way, this will

also allow something bad. What will this allow?>>The value’s too large

to fit in an imager?>>Well, okay. So, this is not — this is

definitely not setting any limit of the size of the integer. Okay? But, you know, from

a syntactic point of view, that may not be a big problem. What is the bad thing

that is allowed here?>>It allows an empty symbol.>>Yeah, an empty symbol. So, this will allow you to write

something like x equals nothing. Right? But it’s — this

is not what we want. So, this allows the empty

string, as well, allows… …empty string. So the empty — it’s saying that the empty string is

a valid unsigned integer, which is not true. A valid unsigned integer has to

have at least one number in it. Okay? Now, how about

floating point numbers? So floating point numbers. So this gives us an integer. So this is 0 or 1,

9 followed by 0, 9. So this gives us an integer,

because a floating point number, you know, can be an integer. But it can also have what?>>A decimal point.>>A decimal point. So, let’s say that the

decimal point is optional here. Floating point numbers

with an optional decimal point. Again, in this course,

it’s not our job, you know, to decide whether the

decimal point is optional or not optional. This is given to us. And we build a compiler

that satisfies these specs. So if the decimal

point is optional, then how do we express this? So, let’s say the decimal

point is not optional. So we’ll put the decimal point. Then after the decimal

point, we’ll have what?>>[inaudible] any more numbers.>>Yeah, we can’t have —

now, you may allow the — so we can simplify

things and put 0, 9, star. Now if you put 0, 9, star here,

you are allowing an empty string after the decimal point,

which could be fine. Right? If you write something

like x equals 5 point, this is still, at least

this is meaningful. At least, this has a meaning. So point nothing

is like point 0. So, this allows you to

have nothing, you know, after the decimal point or it

allows you to put as many 0’s as you want, so this

will allow this. It will also allow — But if you do it this way,

this is a regular expression for the decimal point. But here, we’re saying that

the decimal point is what?>>Mandatory.>>Mandatory. So for to make it

optional, what should we do? [ Inaudible Comments ] So, if I want to make the

decimal point optional — [ Inaudible Comments ]>>Use of parentheses?>>You can wrap the whole thing. You can make the whole — you

can make the decimal point and everything after it

— all of it optional.>>How? So how do we express

this using regular expressions, that’s the question. Yes?>>Parentheses, right? Where you put parenthesis before

the decimal and after the star. And then a star afterwards.>>Parenthesis? Like this?>>No, you need –>>So this makes it optional?>>No. You put 0, 1

afterwards, in the top right.>>You can union that

with the empty string.>>Okay. Yeah. That’s exactly the solution. So if we union this

with an empty string, so we can either

have this or nothing. Okay. So we can have

either this or nothing. So in this case and — okay. Is this — is what

I have on the board, does it express what

I want to say here?>>Yeah.>>No, it doesn’t. It’s wrong. It’s not expressing

what I’m trying to say. So, okay. So, this doesn’t

do what we are trying to do, because this here,

this is concatenation. Right? Which has

higher precedence, concatenation or union? Concatenation has higher

precedence than union. So with this, so we’re saying that this is concatenated

with this. So, this is not what we want. So we should use parenthesis to

force the order that we want. So in this case, we’re trying

to say that this is what we have and this is concatenated

with this or this. Right? So we’re saying that

this whole thing, you know — so this is the integer part. This is concatenated

with the decimal point. And we can — you know, the

decimal point is, you know, the decimal point

and whatever comes after the decimal

point is optional. Let’s do variable

names or identifiers. So for an identifier or a

variable name, you know, normally in a C language — in

a C-like language like like C, C++ or Java, we allow, you

know, the variable name. What are the rules for a variable name

in C, C++ and Java?>>It must start with a letter. And then after that, we’re going

to have letters and numbers and you can have an underscore.>>And you can have

an underscore. Yeah, exactly. So how do we express this

using regular expressions? So, with regular expressions,

you have to start with a letter. Right? So, it’s going

to be a through z. Then then what?>>It would be a to

z, 0 to 9, and the –>>A to z. Then –>>It won’t be in

the same bracket.>>Yeah. a to z or 0 to 9. Or underscore. And this, we can repeat it

as many times as we want. Okay? Well, in fact, yeah. In C and C++, you can start

with an underscore, too. So if you want to do this,

you should do — okay. Underscore. But it doesn’t — again, we’re

not discussing, you know, what should be allowed

and what shouldn’t, we’re just implementing the

specs that we are given. But here we are allowing

an identifier to start with an underscore or a letter. Then after that, we can have

as many letters, numbers or underscores as we want. Yeah?>>Are we okay with it ending with many underscores,

just trailing?>>I think it’s allowed. You know, I tried once –>>It is.>>I tried to define a variable as a single underscore

and it was accepted. So I — okay. So I tried once to

do this, integer x — sorry, integer underscore. And then I did underscore

equals 5 or underscore equals

underscore plus underscore. And this — yeah,

this was accepted. So, and again, you

know, for here, we’re not discussing what should

be accepted and what shouldn’t and whether this is

good language design. So we’re not designing

the language. We’re designing the complier

that implements a language that is specified for us. Someone specifies the language

for us and we build a compiler. But I have tried this myself,

because seven years ago, a student asked me

whether, you know, the underscore will be a valid

identifier in C language.>>GCC?>>No, I think I tried it

using the Microsoft compiler.>>Oh, well, there you go.>>Yeah. Okay. All right. So this allows — so the identifier starts with

an underscore, then letters. Then you can have as many

letters and numbers as you want. So with an NFA, the corresponding NFA

here — is well, sorry. You’ll start with this, an

underscore or a to z. And then, this is an accept state, right? And in this accept state, you

have a loop with 0 or a to z or 0 to 9 or underscore. So it can have these. You can loop here. Okay. So this is the

corresponding NFA. Okay?>>Well, one thing I want to

point out, though, is that C and C++ have a finite limit

on the length of identifiers and I believe it’s

32 characters.>>No.>>Are you sure it’s

32 characters? Well, in –>>And if you go beyond that,

it won’t distinguish beyond — anything beyond the

32nd characters.>>That depends on the linker.>>Okay. So, now, this raises

an interesting question. Again, we are interested

in the compiler design, not in the language design. So we don’t care what

the language allows, but the question is, with this

you can write an identifier name that is of an arbitrary length. So, here, we’re not

setting any length on the — any limits on the

length of an identifier. Now, how can we set a limit? So, say, if we want

to set a limit of, you know, five characters. If we want to set a limit of

five characters then, you know, we don’t have to — the

star means infinity. It means, there is no limit. You can repeat as many

times as you want. Right? So the star —

what the star means is that you can repeat this

as many times as you want. So it’s unlimited. So, you definitely need

something other than the star. And what’s other than the star

is — you can just, you know, define how many numbers. You have a notation of

defining how many times this can be repeated. So, for five characters,

it can say, okay, this is a to z. Sorry. And then after this,

you can have — well let’s call this

whole thing, you know, for convenience, let’s

call this thing X. Okay? Just so that we don’t repeat it. So, if this is X, then

we’ll have either X or X repeated twice. Or even, you know, X

repeated zero times, so we can have epsilon. So epsilon or X repeated

once or X repeated twice or X repeated three times

or X repeated four times. So, this sets a limit

of five characters. Right? So the first

character has to be a letter or underscore. What comes after this,

you have up to — you have up to four characters, that can be anything

out of this group. Okay? So, this sets a limit. And it’s tedious. You can — so you can set

any limit you that you want, so you can make it 10

or 20 or 30 or whatever. Okay? All right. Any questions on this? Yeah?>>What do we have to check

for indicator words, like if and while and [inaudible]

in front of them? [inaudible].>>In fact, key words, checking

for key words is trivial. Because direct regular

expression for if, the regular expression’s

going to be if. This is a regular expression. So, checking for key

words is trivial, because this is a valid

regular expression, because it has two

symbols from the alphabet and these two symbols

from the alphabet — by the way, that is

called concatenation, here. Usually we drop the

concatenation, the concatenation

symbol, which is this, just like we drop

the multiplication. In Algebra, we drop

the concatenation in regular expressions. So this is i concatenated

with f. But what’s a regular expression? A regular expression,

in a regular expression, you can have a combination of

the symbols of the alphabet, concatenation, union or star. So, these are the three

operations that you can have in a regular expression —

union, star and concatenation. And any letters from — any

symbols from the alphabet. So, recognizing a

key word is trivial. The regular expression for a

key word is the key word itself. But the regular expression for the entire language

is just the union of them. So, a regular expression for

the whole language is, you know, regular expression for numbers,

union, regular expression for identifiers, union, regular

expression for operators, union, regular expression

for key words. Right? So, it’s just the union. An interesting construct

that we would like to study a regular

expression for is the string — okay, so let’s look at a

regular expression for a string. So, for a string, you may think

that you must start with this. Then, sigma, star — sorry. Sigma, star. Okay. So you can — this is,

you know, your first attempt to write a regular

expression for a string. But, so, you are saying that you

have the quotation, open quote, closed quote and inside of

this, you can have anything. Sigma, star is just anything. Right? But this anything,

you want to exclude what?>>The close.>>The closing, yeah,

closing code. So you can’t have closing code

inside of this, so if you want to exclude it, so how

can you exclude it? How can you exclude

the closing codes? You can exclude it by saying

everything but or the complement of — by allowing here, the

complement of closing code. So, you can say the complement,

using this, the complement of closing code repeated

as many times as you want, then closing code. So, what does this mean, the

complement of closing code? So this is everything or

anything but closing code. Or it’s — this is equivalent to sigma minus the

language of closing code. Okay? So this is — by the way,

and this complement operator — this is not a standard

regular expression. So, this is an extended

regular expression. Strictly speaking, a regular

expression cannot have this complement. So, strictly speaking, a regular

expression can only have union, concatenation and star. But this is an extension. So this is an extension —

we are extending the notation of regular expressions by

allowing the complement — because this is convenient. Now, without this

convenience, we will have to write all the symbols in the

alphabet except this symbol. But this extension allows us

to write it much more easily and much more concisely. Yeah?>>That character you used,

is that a caret or a –>>Yeah, the caret. Yeah. Maybe it’s not clear. But that’s what I meant.