"Transcript of RSM Episode 6"

RSM project page

00:06 the part that parses our little assembly language
00:11 and builds a virtual machine a set of virtual machine
00:18 instructions so we started out by just filling up a an array in memory with
00:23 just the actual you know instructions here right
00:28 the next thing we did was we implemented a formatter so that we can see the
00:33 output of them let's print this out or sort of the
00:37 the state of it and so to speak what the computer sees and then after that we built an interpreter or evaluator or whatever
00:44 what i call that the the thing that actually like executes
00:50 the instructions in a virtual machine and then we get to a point where it starts
00:55 getting kind of kind of janky to like write
00:58 programs like this and compile them into the program it would be nice if we can write them sort of like this
01:06 like this text it is probably hard to see so let's paste that to here oops
01:12 um and remove that okay so it'd be nice if
01:16 we can write code it kind of looks like this
01:21 and to begin with we'll just gonna explicitly name the arguments but eventually as we were talking about an earlier part we'll be
01:30 able to use any names here and have the the
01:35 assembler allocate reducers for us but we're not going to get to that just now but a little bit later okay so this is the goal to be able to write something like this and then for a little parser to uh
01:47 to parse this and to build uh these instructions for us that we can
01:52 unrun and that way we can give you know we can
01:57 compile this little program and then we can just give it some text
02:02 and it can just kind of run it so yesterday i uh started working on a
02:08 simple parser that's what we'll be working on today so first off a couple of tokens
02:14 actually i should give a little bit overview perhaps of the of the structure
02:19 of these things and so um the first of the goal for the assembler
02:30 one scan scan the input
02:38 for uh tokens just looking into this keyboard
02:43 scan name for tokens the next thing that it does is that it
02:49 parses the semantic of
02:56 those tokens right so that becomes some sort of ast and after that it generates
03:09 it actually it probably like checks it shakes checks that the um
03:16 esd or whatever is valid right like you know if you
03:21 like types and stuff like that uh it won't be very complicated since or some language is very simple and finally it
03:30 generates code right or vm instructions that is and eventually constants and stuff like that okay so that is what the assembler does
03:42 so far i've implemented this part and we can have a look at that and then the next thing we're going to do is uh
03:49 to parse the to assign meaning to these tokens right
03:56 okay [Applause] so first off here is a
04:04 definition of all the different tokens that might occur in the source text
04:09 and here which so this becomes an idiom just off to here you can see it at the end here's him right
04:14 so and here is the um the end of the input stream
04:20 token so we'll we'll see in a minute here while that is useful anyhow
04:27 that's what that is this one says that this token is a comment and comments
04:32 in the in the syntax that i've come up with so far is
04:37 slash kind of like the the line comments of c and then yeah like what is like and then we have a couple of simple
04:48 tokens left parenthesis right parenthesis semicolon
04:53 um equals slash minus there will be more of these here later right like plus and stuff like that then we have uh names and these are
05:01 things that you know these tokens have um so there's some set of bytes or some integer value
05:09 associated with these so i'll label it something that kind of looks like this you know it's some some
05:15 name and a colon uh a register something that looks like this is a capital r with a number um or capital f with a number
05:23 and anything else that doesn't match these things is just some symbolic name
05:28 um and this is how i like to use right parsers
05:33 i'll start with something like like this
05:39 something like this where there is a end of input token
05:43 usually there's not a common token ignore comments but and then um just uh yes the very few basic things
05:51 that i need and that's sort of like a catch-all that's like a symbolic like um
05:58 a name if you wanted or identifier or whatever you want to call it sort of like anything that that doesn't parses
06:05 these things will be treated as a name so uh yes i'm doing what i did there so
06:12 uh the the the symbol token that is what
06:17 most things that got you know that doesn't have specific syntax that is going to become patterns are going to become a symbol
06:26 then we have literal numbers these are sort of like you know you just type something like this by the way if it's not apparent each of
06:34 these tokens have a little example next to them just to to make it easy to remember what
06:41 they what they represent so we want to be able to use you know write
06:45 one two three in our search code right through to mean like the number one two three um
06:51 and uh it's also possible to write a there are three
06:59 uh bases that i currently support it's pretty easy to add other support for other bases but these are the ones that
07:05 i think i'll need or want so base two zero b is the prefix
07:10 ones and zeros base 10 no prefix is decimal
07:18 and base 16 hexadecimal crx or capital or lowercase both of these okay
07:21 um and and all of these
07:26 can accept a a negative sign at the at the beginning and the scanner will treat them
07:35 differently if they do for example you know overflow shaking of the value
07:39 so what's and then we'll follow here it's just keywords so these are things that are that parse the symbols
07:48 and then we'll be like hey wait a second fun is actually like a keyword called
07:53 you know function definition keyword and i 16 right is a
07:59 also keyword i 16 keyword for the type i 16. uh just putting them here makes it like really easy to play around with these things because never ever
08:08 have uh have i put something like this together and it just right on the on the first try so we're definitely gonna have to change this it's nice to set this up to
08:16 make okay so let's let's have a look at the
08:24 the scanner the part that's that scans the input so we go to the bottom here here is uh
08:29 the the main function the entry here that is
08:35 the assembly function assemble it takes a destination output
08:40 array of instructions uh the capacity of that array
08:46 and then it takes the input source as just a a currently it's just a um
08:54 zero terminated like a case c string um but it then the first thing we do is just call the parse function so remember like the there are four stages to this the first one is the scanning the tokens and parse
09:09 the semantic meaning of those tokens and we got to do that yes one go this is not in a more complicated language might
09:15 separate the two um but in this case we just kind of sort of
09:21 mix them together because it's a simple syntax so the first thing we do is just to do that parse and eventually we'll do you
09:29 know analysis to check for things so to do analysis
09:36 and then we'll we'll do cogen code narration
09:42 so the parse function starts out by just allocating a bit of state it allocates
09:45 and all the stack here and press the pointer uh the the parse state we have here
09:52 this is uh data that will uh mostly change as the
09:58 scanner and parser as we're i'm just gonna say parsing as we're parsing the
10:02 input these things are changing here's a cursor to the uh the current slash next
10:08 byte that we're looking at in the input stream this is tells us when the input
10:13 stream actually ends this is the beginning of the current token that we just parsed
10:19 this is the beginning of the current line and from this we can compute the uh
10:23 the column we're gonna have to keep that separate
10:28 um oh well this could be done in two ways this is usually how to do it it just allows for like indentation sensitive syntax and stuff to implement
10:35 it this way uh this is the line number it starts with one and it counts up every time we encounter a
10:43 line feed then we have a the current token and
10:47 that's one of these constants that we looked at up here is one of these that is the the token that we just parsed or scanned
10:58 this is a flag that says are we going to insert a synthetic semicolon token or
11:04 not this is a really neat trick that i've learned from the go compiler
11:09 it allows us to write code that can
11:14 have statements separated either by line breaks or by semicolons
11:19 and you can kind of intermix that in a very deterministic and easy parse way it's
11:25 not like javascript so uh so this is uh i i like this word for
11:32 anything anyhow so this is going to just be set to true depending on the type of
11:37 token that we scan so if we scan someone like a plus token it's going to be false because i expect
11:43 something to come after plus but if you scan something like a number electoral or a name or a symbol it'll be set to true and if we then
11:52 encounter a new line we're going to use you know kind of tell the parser that's like yeah there's
11:59 a semicolon here although it isn't this flag is is specifically for parsing
12:04 numbers this says if the uh the number that we're where he has parsed if that's negative and we have a
12:13 number here for the value of a number that we just parsed
12:17 or our parsing and finally we have just a uh an error sort of message so this is to not know
12:29 it's not it's not a very complicated error handling we need here so this is going to be enough at least for now
12:36 okay so back to the parse function so we set that up uh our par state here
12:42 we set the cursor the input cursor to the the source text the beginning of it we set the end to the source text
12:50 uh to the you know to the first to the address just after the source states
12:56 and in this case remember we are starting out we're not really taking files as inputs just yet when we do we'll add like a you know a size
13:04 argument here but for now we're just taking a c string so this will use string line here
13:10 we set a blind start uh this start of the first line is the
13:15 you know this is start of the input the we initialize line number two to one
13:20 if line numbers were zero initialized we can just drop that since you know
13:25 in c at least c11 then any unknown numbers here would be zero
13:31 and uh here's our main scanning loop so we'll be we'll be adding to this and doing some parsing in
13:36 here and currently there's a function that's defined just above here let me bring that into you it's the log parse state and this
13:49 function up here it's used for is for development purposes
13:53 this is the one that if we let's run i think my computer is starting a little
14:00 bit it looks like linking is taking a while um
14:05 so this function just prints out the the tokens that we're seeing
14:10 so question mark here means that i'm gonna stop somewhere
14:15 oh look at that each
14:19 keyword token okay there we go um so we are scanning a token and we're we
14:27 keep scanning for the next token so we say hey scan the next token
14:33 and unless that token was the tn token so
14:38 this is the end of input stream print it and then scan another one
14:41 right so we just keep going so let me show you
14:51 i'm gonna let's copy this in here so we can have a look at that in our parse um kids upload that
15:01 i was just playing around with some let's do test things out
15:10 before we start this video okay so this is the this is the
15:15 currently built that's its source so first we we scan fun which is a
15:21 keyword so we get keyword function line one column one
15:25 then we get a name sim a symbol whatever um and that's
15:31 factorial right and we get the parentheses here start parentheses and
15:35 parentheses and then we got a keyword i32
15:42 uh and we get another keyword i32 and now we see here now we got our special um
15:48 magical semi colon generated from uh yes the line break
15:53 here next thing we are label
15:57 we don't get a semicolon after the label so here's a condition where
16:02 you know uh you know you can you can just do this it's just as valid
16:06 and next you know we get a register
16:14 so a capital r as i mentioned earlier is a i was debating a little bit about maybe
16:24 she used something else like dollar for registers or whatever but
16:29 if anyone comes here or familiar with other assemblies then dollar usually has
16:34 a different meaning like literal or something like that so you know uh let's keep it simple like
16:40 you know if you really need to call a local like r something use lowercase or
16:45 like any other character started with it's gonna be fine um so we have a register and it names the register and we can see in this case we have some extra information that we printed out here so here we have we've interpreted this number um as an actual number and we checked that we checked the balance of that number
17:05 two and now we can have a look at the the error here so let's say that we we name a register that's too large
17:11 then we run this and it's going to say on line three column five
17:16 we have an invalid register so line three column five right right
17:22 here this isn't a valid register 199 that doesn't that reduce doesn't if exist um same thing if we like do something like this it's also like yeah that's not a number but in that case
17:33 we're not the detail about the error message really we only care about that this is in the value register
17:40 uh and then you know it keeps going here i'm just testing uh just making sure
17:46 that the unicode uh sorry the utf-8 validator is like okay it's not a
17:50 perfect validator but it's it's just something fun that i did last night just to make sure that um it's really
17:56 dumb actually and again it's not it's not super reliable but it's
18:01 okay this thing is like you know eats some utf-8 sequences
18:07 uh when it's scanning a symbol so here's my i'm using my favorite
18:12 um example uh that has a serial joiner here
18:18 somewhere um so it's a uh it's the emoji for like woman and then
18:26 it's a skin tone modifier five and then zero would join her and then we
18:30 have a um let's just start
18:34 so yeah and then we have a um
18:40 rocket and that becomes woman astronaut with skin tone modified five
18:45 woman astronaut uh so that's what we're seeing over here slowly mode here and now we just got return you know and
18:52 and our real function i'm just keeping it small knight is to test things out like this is obviously not like valid syntax but we'll get to that
19:01 the real program that we're gonna parse is is more like this right it has some
19:06 some actually uh meaningful stuff in it
19:12 so uh yeah that's that's what we are now i'm going to do a
19:18 separate video when i'm working on implementing the parser
19:23 so here that's the summary of where we are and how things work