Compilers Lecture 5: Scanning (2): Regular Expressions for Programming Languages

Compilers Lecture 5: Scanning (2): Regular Expressions for Programming Languages


>>So today we’ll
be doing scanning. Last time we started
introducing scanning. And we reviewed. Basically, most of
last lecture was spent on reviewing the
concepts of a DFA, NFA and a regular expression. Deterministic Finite Automaton, Non-Deterministic
Finite Automaton and regular expressions. And we will be using this in
the — to build a scanner. And the scanner is that
stage in the compiler, in which we recognize
words or tokens. So we don’t recognize sentences, sentences which correspond
to statements. You know, this is the —
this is the human language and this is the programming
language. So words correspond to tokens and sentences correspond
to statements. And recognizing words
is the scanning, while recognizing sentences is
the parsing, which is the part that we haven’t gotten to yet. Okay. And last time we mentioned
that to recognize words, we can use NFAs, DFAs
and regular expressions, because it’s — the language
of all words, all valid words, in a typical programming
language, okay, is a regular language. So it’s a regular language that
includes all the valid words in a typical programming
language. And the words in a typical
programming language are the numbers, you know, the
integers, variable names, the different operators and
the key words of the language. So these are the main parts of
speech in a language: numbers, constants, variable names which
we call identifiers, operators, the different operators —
plus, minus and the key words of the language, key words
like, you know — if, else, why. So the language of all — what
we will be doing in scanning is that we will define a
language for each one of these different
parts of speech. And the language of all
valid words is just going to be the union of all of them. Right? So last time, we started
giving an example by looking into a regular expression
for unsigned integers. So now, when it comes to
programming languages, our sigma’s going to be
the fits of all symbols, all valid symbols in a
programming language, in the programming language that
we are designing a compiler for. So, we will have
all the numbers. We will have all the
alphabetic characters — a, b through z. And we will have
even uppercase — A,B through — so this is little z
and this is big Z. And we have all the symbols that
can appear in a valid program, like, you know, plus,
minus, equal, etc. Okay. So this is our sigma. So this is not the whole
thing, but now you understand that it has all the
symbols that you can type in when you are writing
a program. This is our sigma. Now given this sigma, we
need the regular expression for an unsigned integer. So we need the regular
expression for an unsigned integer. And the first attempt may
be something like this. And by the way, we will
use a shorthand for numbers and alphabetic characters. So for numbers, we will
use the shorthand 0 to 9. We’ll use this. 0 to 9, this means all
the numbers from 0 to 9. And for alphanumeric — for
alphabetic, sorry, we will use a to z. So this is different from
what they’re using in the book. In the book, they’re using
ellipsis instead of the hyphen, and we are merging the — under, we’re are merging little
letters with capital letters. Okay? So this just means all of
them, all alphabetic characters. Now, a regular expression
for an unsigned integer — the first attempt is
something like this. 0 through 9, star. And you can repeat as
many times as you want. Okay. Now, what’s — this is
fine, but this will allow things that we may not want to allow,
like, for example, 0, 0, 0. So this will allow
us to write 0, 0, 0. This will be an acceptable
string that belongs to this language. Right? So you can repeat. But to avoid this, to
avoid leading 0’s –>>You [inaudible]
the first character.>>Well, okay, yeah. So the first character
has to be a non-0, so you can do it like 1 to 9. But then what’s the problem?>>You can’t [inaudible].>>Yeah. So you can
add the 0 by itself. So we can just add it, union 0. So, you can do, union, 0. You either have a 0
or you have — sorry. 1 through 9 — what I
meant is 1 through 9, then followed by
0 through 9, star. So, here, I’m saying
that the first — you either have a 0 or
you have a non-0 followed by any combination of numbers. Okay. So, this one —
this allows leading 0’s. This does not allow. By the way, in this course, our objective is not knowing
what should be allowed and what shouldn’t be allowed. You know, we are giving the
specifications of a language and we are designing
a compiler for that. So who decides, you know,
whether leading 0’s are allowed or not allowed is the
designer of the language. When someone writes the
specs for a language, and based on the specs of this
language, we build a compiler that implements these specs. Okay? So we tell you, you know, whether leading 0’s should be
allowed or not and we ask you to build a compiler
that implements this. Okay. So now, this is an array. Now, for this array,
it can be easy to think of a Non-Deterministic
Finite Automaton that is equivalent to this. Right? For this — this
is our start state. Then we have two possibilities. Right? So, we either start
with 0 and if we start with 0, what will happen? We got an accept state. And if we start with a
non-0, we’ll go to –>>Another accept state.>>Another accept state, but what should you have
in this accept state?>>A loop.>>A loop. So you can have anything
that you want. You can have anything 0 to 9. So this is the equivalent NFA. So, today, we’ll be covering
the Thompson Construction, which is a systematic way of
converting a regular expression to an NFA, but here,
we’re doing it informally. We are doing it just by
inspection, because it’s so obvious and intuitive, to convert this regular
expression into an NFA. By the way, this will
also allow something bad. What will this allow?>>The value’s too large
to fit in an imager?>>Well, okay. So, this is not — this is
definitely not setting any limit of the size of the integer. Okay? But, you know, from
a syntactic point of view, that may not be a big problem. What is the bad thing
that is allowed here?>>It allows an empty symbol.>>Yeah, an empty symbol. So, this will allow you to write
something like x equals nothing. Right? But it’s — this
is not what we want. So, this allows the empty
string, as well, allows… …empty string. So the empty — it’s saying that the empty string is
a valid unsigned integer, which is not true. A valid unsigned integer has to
have at least one number in it. Okay? Now, how about
floating point numbers? So floating point numbers. So this gives us an integer. So this is 0 or 1,
9 followed by 0, 9. So this gives us an integer,
because a floating point number, you know, can be an integer. But it can also have what?>>A decimal point.>>A decimal point. So, let’s say that the
decimal point is optional here. Floating point numbers
with an optional decimal point. Again, in this course,
it’s not our job, you know, to decide whether the
decimal point is optional or not optional. This is given to us. And we build a compiler
that satisfies these specs. So if the decimal
point is optional, then how do we express this? So, let’s say the decimal
point is not optional. So we’ll put the decimal point. Then after the decimal
point, we’ll have what?>>[inaudible] any more numbers.>>Yeah, we can’t have —
now, you may allow the — so we can simplify
things and put 0, 9, star. Now if you put 0, 9, star here,
you are allowing an empty string after the decimal point,
which could be fine. Right? If you write something
like x equals 5 point, this is still, at least
this is meaningful. At least, this has a meaning. So point nothing
is like point 0. So, this allows you to
have nothing, you know, after the decimal point or it
allows you to put as many 0’s as you want, so this
will allow this. It will also allow — But if you do it this way,
this is a regular expression for the decimal point. But here, we’re saying that
the decimal point is what?>>Mandatory.>>Mandatory. So for to make it
optional, what should we do? [ Inaudible Comments ] So, if I want to make the
decimal point optional — [ Inaudible Comments ]>>Use of parentheses?>>You can wrap the whole thing. You can make the whole — you
can make the decimal point and everything after it
— all of it optional.>>How? So how do we express
this using regular expressions, that’s the question. Yes?>>Parentheses, right? Where you put parenthesis before
the decimal and after the star. And then a star afterwards.>>Parenthesis? Like this?>>No, you need –>>So this makes it optional?>>No. You put 0, 1
afterwards, in the top right.>>You can union that
with the empty string.>>Okay. Yeah. That’s exactly the solution. So if we union this
with an empty string, so we can either
have this or nothing. Okay. So we can have
either this or nothing. So in this case and — okay. Is this — is what
I have on the board, does it express what
I want to say here?>>Yeah.>>No, it doesn’t. It’s wrong. It’s not expressing
what I’m trying to say. So, okay. So, this doesn’t
do what we are trying to do, because this here,
this is concatenation. Right? Which has
higher precedence, concatenation or union? Concatenation has higher
precedence than union. So with this, so we’re saying that this is concatenated
with this. So, this is not what we want. So we should use parenthesis to
force the order that we want. So in this case, we’re trying
to say that this is what we have and this is concatenated
with this or this. Right? So we’re saying that
this whole thing, you know — so this is the integer part. This is concatenated
with the decimal point. And we can — you know, the
decimal point is, you know, the decimal point
and whatever comes after the decimal
point is optional. Let’s do variable
names or identifiers. So for an identifier or a
variable name, you know, normally in a C language — in
a C-like language like like C, C++ or Java, we allow, you
know, the variable name. What are the rules for a variable name
in C, C++ and Java?>>It must start with a letter. And then after that, we’re going
to have letters and numbers and you can have an underscore.>>And you can have
an underscore. Yeah, exactly. So how do we express this
using regular expressions? So, with regular expressions,
you have to start with a letter. Right? So, it’s going
to be a through z. Then then what?>>It would be a to
z, 0 to 9, and the –>>A to z. Then –>>It won’t be in
the same bracket.>>Yeah. a to z or 0 to 9. Or underscore. And this, we can repeat it
as many times as we want. Okay? Well, in fact, yeah. In C and C++, you can start
with an underscore, too. So if you want to do this,
you should do — okay. Underscore. But it doesn’t — again, we’re
not discussing, you know, what should be allowed
and what shouldn’t, we’re just implementing the
specs that we are given. But here we are allowing
an identifier to start with an underscore or a letter. Then after that, we can have
as many letters, numbers or underscores as we want. Yeah?>>Are we okay with it ending with many underscores,
just trailing?>>I think it’s allowed. You know, I tried once –>>It is.>>I tried to define a variable as a single underscore
and it was accepted. So I — okay. So I tried once to
do this, integer x — sorry, integer underscore. And then I did underscore
equals 5 or underscore equals
underscore plus underscore. And this — yeah,
this was accepted. So, and again, you
know, for here, we’re not discussing what should
be accepted and what shouldn’t and whether this is
good language design. So we’re not designing
the language. We’re designing the complier
that implements a language that is specified for us. Someone specifies the language
for us and we build a compiler. But I have tried this myself,
because seven years ago, a student asked me
whether, you know, the underscore will be a valid
identifier in C language.>>GCC?>>No, I think I tried it
using the Microsoft compiler.>>Oh, well, there you go.>>Yeah. Okay. All right. So this allows — so the identifier starts with
an underscore, then letters. Then you can have as many
letters and numbers as you want. So with an NFA, the corresponding NFA
here — is well, sorry. You’ll start with this, an
underscore or a to z. And then, this is an accept state, right? And in this accept state, you
have a loop with 0 or a to z or 0 to 9 or underscore. So it can have these. You can loop here. Okay. So this is the
corresponding NFA. Okay?>>Well, one thing I want to
point out, though, is that C and C++ have a finite limit
on the length of identifiers and I believe it’s
32 characters.>>No.>>Are you sure it’s
32 characters? Well, in –>>And if you go beyond that,
it won’t distinguish beyond — anything beyond the
32nd characters.>>That depends on the linker.>>Okay. So, now, this raises
an interesting question. Again, we are interested
in the compiler design, not in the language design. So we don’t care what
the language allows, but the question is, with this
you can write an identifier name that is of an arbitrary length. So, here, we’re not
setting any length on the — any limits on the
length of an identifier. Now, how can we set a limit? So, say, if we want
to set a limit of, you know, five characters. If we want to set a limit of
five characters then, you know, we don’t have to — the
star means infinity. It means, there is no limit. You can repeat as many
times as you want. Right? So the star —
what the star means is that you can repeat this
as many times as you want. So it’s unlimited. So, you definitely need
something other than the star. And what’s other than the star
is — you can just, you know, define how many numbers. You have a notation of
defining how many times this can be repeated. So, for five characters,
it can say, okay, this is a to z. Sorry. And then after this,
you can have — well let’s call this
whole thing, you know, for convenience, let’s
call this thing X. Okay? Just so that we don’t repeat it. So, if this is X, then
we’ll have either X or X repeated twice. Or even, you know, X
repeated zero times, so we can have epsilon. So epsilon or X repeated
once or X repeated twice or X repeated three times
or X repeated four times. So, this sets a limit
of five characters. Right? So the first
character has to be a letter or underscore. What comes after this,
you have up to — you have up to four characters, that can be anything
out of this group. Okay? So, this sets a limit. And it’s tedious. You can — so you can set
any limit you that you want, so you can make it 10
or 20 or 30 or whatever. Okay? All right. Any questions on this? Yeah?>>What do we have to check
for indicator words, like if and while and [inaudible]
in front of them? [inaudible].>>In fact, key words, checking
for key words is trivial. Because direct regular
expression for if, the regular expression’s
going to be if. This is a regular expression. So, checking for key
words is trivial, because this is a valid
regular expression, because it has two
symbols from the alphabet and these two symbols
from the alphabet — by the way, that is
called concatenation, here. Usually we drop the
concatenation, the concatenation
symbol, which is this, just like we drop
the multiplication. In Algebra, we drop
the concatenation in regular expressions. So this is i concatenated
with f. But what’s a regular expression? A regular expression,
in a regular expression, you can have a combination of
the symbols of the alphabet, concatenation, union or star. So, these are the three
operations that you can have in a regular expression —
union, star and concatenation. And any letters from — any
symbols from the alphabet. So, recognizing a
key word is trivial. The regular expression for a
key word is the key word itself. But the regular expression for the entire language
is just the union of them. So, a regular expression for
the whole language is, you know, regular expression for numbers,
union, regular expression for identifiers, union, regular
expression for operators, union, regular expression
for key words. Right? So, it’s just the union. An interesting construct
that we would like to study a regular
expression for is the string — okay, so let’s look at a
regular expression for a string. So, for a string, you may think
that you must start with this. Then, sigma, star — sorry. Sigma, star. Okay. So you can — this is,
you know, your first attempt to write a regular
expression for a string. But, so, you are saying that you
have the quotation, open quote, closed quote and inside of
this, you can have anything. Sigma, star is just anything. Right? But this anything,
you want to exclude what?>>The close.>>The closing, yeah,
closing code. So you can’t have closing code
inside of this, so if you want to exclude it, so how
can you exclude it? How can you exclude
the closing codes? You can exclude it by saying
everything but or the complement of — by allowing here, the
complement of closing code. So, you can say the complement,
using this, the complement of closing code repeated
as many times as you want, then closing code. So, what does this mean, the
complement of closing code? So this is everything or
anything but closing code. Or it’s — this is equivalent to sigma minus the
language of closing code. Okay? So this is — by the way,
and this complement operator — this is not a standard
regular expression. So, this is an extended
regular expression. Strictly speaking, a regular
expression cannot have this complement. So, strictly speaking, a regular
expression can only have union, concatenation and star. But this is an extension. So this is an extension —
we are extending the notation of regular expressions by
allowing the complement — because this is convenient. Now, without this
convenience, we will have to write all the symbols in the
alphabet except this symbol. But this extension allows us
to write it much more easily and much more concisely. Yeah?>>That character you used,
is that a caret or a –>>Yeah, the caret. Yeah. Maybe it’s not clear. But that’s what I meant.

Leave a Reply

Your email address will not be published. Required fields are marked *