"Transcript of RSM Episode 8"

RSM project page

00:07 hello and welcome back
00:11 so i was up really late last night way past 2 am working on this um posher
00:19 and it does a lot a lot more now it's not ready but it's pretty close
00:25 so where we left off left off in the uh previous part was um we started working
00:30 on the parser with done the scanner and we started working on the parser we could get some some some basic stuff
00:37 parsed and now actually parsing most of the i think actually the complete example function that we have here let's move these parse
00:46 test data to the end here so we can look at it um so here's the here's the input that we're parsing a couple of changes since the last time
00:56 first off comments i'm not just skipping comments that's advanced let's see
01:07 so as we're as we're scanning the input for tokens and tokenizing it
01:13 when we are scanning comments um i am at this point just
01:17 kind of ignoring that comment not producing any um any token for it but then just like re-run like just
01:25 going ahead and parsing the next token right which might be another comment and it keeps going so uh essentially we can we can uh swap
01:33 these two lines here and now this is probably not gonna work but yeah it's gonna complain but now we're getting uh
01:39 comment tokens as well so that allows us to we still you know that allows us to
01:43 in the future do things like using the parser just to
01:47 do pretty printing and formatting and stuff like that so we're going to need to scan the
01:54 comment anyways to know where it ends and it's the same amount of work as producing a token for it so anyhow
02:01 that's kind of neat a couple of other things uh i cleaned up
02:09 the naming so the other day when we were putting this together uh we just used
02:14 some names just to get going since since uh a person like this at
02:19 least this person gets a little bit long in terms of there's a lot of functions a lot of lines of code and stuff like that it's kind of nice to have some sort of naming scheme going so what i've said along for now at least is that a prefix s is for for scanning
02:33 the tokens tokenizing so s advanced advances to the scanning
02:37 state uh s reg means that we're scanning a
02:40 register uh token and so on
02:45 s name is getting a name and then p is for parse functions
02:52 so parse e statements uh there are a couple of things that don't follow this that are just sort of
02:58 just so common and nice that you know they don't have it but generally speaking there's a uh there's there's this sort of like naming thing going on
03:07 so and then we have a couple functions called that starts with make or mk
03:12 for just creating nodes right now there are three versions of these there's make note that just creates whatever no whatever token is currently out there there's one that that creates
03:23 a list so explicitly sets the no matter what the token is explicitly sets it to the type of node
03:30 to be a list and i use the l paren left parentheses for that
03:35 we get a nil node
03:41 this is what we called bad or bad node in the previous
03:45 part in some cases it might be useful to represent nil as well as bad so kind of
03:52 use it for both things and tn is just zero and so you know
03:56 that's that now we're we don't have that going on
04:02 here but i think let's see is it if the params are empty let's try it
04:06 there's some condition oh i guess we have a bug we can fix that
04:11 um there's some condition where i will just do no maybe in this case
04:15 yeah and so we can see here if i toggle this back on so let's look at what we've been parsing here so first off here we're still just printing out as with the previous part the older tokens that we're scanning of the source and there's quite a lot now
04:31 since we're part of the full function you know so we get the the function token and then a name and then parenthesis and so on
04:40 and those we interpret those and assign meaning to those that's what the parser does and so the the little parentheses here
04:46 um denotes sort of like a a logical grouping like it's an s expression kind of thing then the the first thing in in it unless there's a space uh is the
04:57 type of this group so like this tempo group is a function so this is a token
05:04 and then comes a a set of fields so the first field is just the name of the function then we have and so parenthesis we use
05:12 the space after is just a list and then we have a list of parameters
05:17 so we have n and j they have types here
05:22 it also allows us to do things like not naming it and then it we have first
05:29 here a um a named parameter and it's type i32 and then we an unnamed one
05:33 and then we just used type we don't
05:37 uh we don't create this sort of like little tuple of name and type
05:42 and the next thing i'm just getting back to the the nil
05:46 insertion here and then the the next thing that we parse here is the results
05:52 which can be a list of things it can be several results right since in our isa
06:04 input parameters and output results are kind of treated the same way
06:09 at least for now so same thing here really it's the same syntax it's just a parenthesis missing
06:16 uh or absent from the result here it's just to make this not ambiguous or anything like that
06:23 and what happens if a function doesn't return anything right we could just do
06:29 this right kind of like c like what not and say the result type of this function is like nothing i've always found that a
06:35 little bit weird and i understand why why void is needed in ac function i
06:40 guess this is the main function but let's see so like
06:44 you know if we remove this like parsing this and see would be tricky right since
06:49 is this like a type or whatever so void can observe that purpose and other
06:52 purposes but since we have this trailing style
06:57 we don't really need this we can just like leave it out to say that this function doesn't doesn't return any values
07:06 and when the parser gets to this point here right here we will see if if we have anything
07:13 before this either the the line end or the start of
07:18 a block if we have anything here it'll parse it like we just did here right and
07:23 we'll create this list of result types in this case it's just a single one a
07:27 and if nothing is in there we'll insert nil and this is gonna make things like much easier for us later another option here would be to just leave that out right and so
07:37 uh the the tricky thing with doing that is that we would have a variable length
07:43 of this uh function list right so then when we do things later like generate code from this to do analysis on this we would have to every time like go and count the length of the function list to know
07:54 if it has any results um and so this way we just keep it simple for ourselves if the if uh item number two in its list
08:04 here is uh is nil we know that there are no results and if it's not we know that there are results and and all the offsets are the
08:10 same and yeah so that's what we can make use of now anyhow the rest is kind of self-explanatory i think the difference is from the other day
08:27 okay so the next thing i want to talk about is um how this is kind of different from
08:34 usually the purses i'll work on uh in this case we have
08:40 we have this let's see how should we um let's let's grab this and put it into a
08:45 new file uh we'll do this okay this would be good um
08:51 that's the syntax co2 or something like that it'll
08:59 match pretty well okay so in uh and actually our simplex change a
09:03 little bit we now have these credit braces to to tell where the function starts being and yeah so in many cases we might have some sort of um
09:15 you know nested things structures in here right
09:20 so we might have like a block and you know a block and so on right
09:25 and these might be statements or they might not be and uh we might have labels
09:29 or not however in our simple little syntax here
09:34 we've made it we've we've really constrained the the the problem space or
09:38 the syntax here um and we're also at the level where
09:43 we're just dealing with these essentially logical blocks of a function right so this function has three blocks here's the first block here's the second one and here's the
09:57 and if we just leave out this label right we have an implicit first block here if we leave out this label we just have two blocks now right
10:05 and what's kind of neat about this and i was i was going back and forth last night about you know should
10:11 uh should i try to just go straight to some sort of ssa form which makes things
10:15 like uh code generation and radish allocation maybe a little somewhere um
10:21 and i sketch out some things and this is why i have something to it and i sketch out some things um
10:30 and realize that it will probably a bit of overkill maybe one day
10:34 it would be neat to streamline the parser just go straight to some sort of uh ssa form and then ssa
10:40 part we talked about this in a previous part um so
10:50 let's see we have so early on this might actually be in
11:03 the first part yeah i think this is in part one if if you want to go back and and and listen to like the the thinking around this and
11:11 uh and stuff like that anyhow so one of the
11:16 forms we considered for this like assembly language um was ssa style and i
11:22 wrote that up pretty quickly essence you have to deal with with five notes so
11:27 kind of fusing different paths together and stuff like that this is great for machines but kind of awkward for humans
11:34 and he also has a safe form also like you you only assigned to something once
11:37 right um or bind rather
11:42 so this is kind of what has to say the ssa form of the function will look like i guess we can just copy that into our
11:51 um sketch file here okay this is the ssa style of it
11:55 and realized that a middle ground here is
11:59 just to say that a function like whenever we're parsing
12:04 this function the structure of this is not a set of statements but it is a set
12:08 of blocks so a function is not a list of statements is the list of blocks
12:16 and if you leave out that first block i synthesize the first block sort of like an implicit
12:21 first block rather than you know saying that you know you can do
12:27 this or that and that greatly simplify the parser first off i didn't do that any you know i was scratching my head a lot like oh how should i handle this right like this kind of case where you get statements and then you get a block right which is kind of logically this
12:40 um at least that's what i want it to be it's going to be i think it will make things easier later down
12:46 so now this the function now is always um
12:51 so this box let's see i have a oh i have a next here so i sketched out the syntax a little
13:00 bit this i guess is a boring ass bnf um ish but
13:07 the the point here is having some sort of like formal
13:10 expression of what the syntax is supposed to to be is i think going to be helpful at least it was for me last night but then again i was kind of tired at the end so i don't know if it's useful anymore but i keep this so my screen obviously it's
13:24 much bigger than what i'm uh recording here so i'm keeping this in a different part of my screen that's outside of the recording but uh as a little reference to myself
13:34 and so what we have here is uh the function definition you know we've looked at that the function body here starts with the curly braces ends with curly braces and then it has a block and this is kind of what i was talking about so a function is not
13:50 a function's body is not like a list of like just some statements it's more
13:54 structured than that it's a list of blocks and a block
13:59 is either anonymous or um
14:04 or not so anonymous block will be is the first block there's no other block can be anonymous or really i think at this point might
14:11 actually remove this and simplify this so i'll show you in a minute what i ended up doing and a label block right is as we as we
14:20 looked at is what we have here um a block that we call something like b0
14:24 right b1 and a block itself contains
14:31 some name excuse me um
14:35 some name i haven't come up with a good name here just call it like statement one
14:42 it's a block statement maybe lock statement since it's like not any
14:49 statement uh since you know anything could be a statement like a function efficient um it's a block statement it's at least one
14:57 of them since like a block just two labels following each other i think should be valid but if you want to make the valid we can just do that because it's uh zero more this is one or more of
15:06 these it's and it's either an operation or an
15:10 assignment and i think that's it actually
15:14 i think that's it uh an assignment will look something like this uh register three is you know
15:24 add register one and register
15:34 or it can be a local name once we implement locals
15:39 right you can do know x
15:42 y y and z
15:47 and the operation is something that doesn't have the equal
15:50 sign in between and that is specifically a
15:54 name so it would not be registered so something like uh
16:00 branch if zero it's that let's just make sure i'm not
16:13 so it's really simple uh and then the
16:17 arguments you know it's an example right it's like
16:24 it's it's what we see here you know i needed to write a check okay
16:27 so these are arguments right it's either a register or a name local like y here um
16:35 or a literal and a literal
16:43 uh [Music]
16:49 should define that so there are uh it's yes it looked
16:56 it lit so it's like a something funny
17:03 um so it's either a decimal one let's see so we either have
17:17 and then zero two nine
17:22 a two f right
18:00 we have a decimal literal and hex petrol so the uh the builder girl starts with the b
18:14 one more of zero and and the decimal literal does not begin
18:27 with any of those does that it actually says here tonight
18:34 yeah i know this is not like some valid vnf or whatever but it's not for a
18:38 machine um okay i think that's it and then maybe
18:43 one day we'll have string literals but no no it's just integrals maybe floating
18:48 pointless drops at some point so i think that's this is pretty much the the simplex
18:57 let's move the outside of the screen okay so the implicit block so let's go back here
19:04 and yes try it out so now we have a name block here block zero right so we're talking about a function is
19:11 strictly a list of blocks not just random statements so what happens if we do this right like
19:16 how do we handle this you're going to see that we still get b zero here there's still a block here
19:23 so we have here's a block b zero here's a block one and here we have the final block
19:29 right and the correspond to this um and i think it's kind of i think it ended up being catapult again
19:42 so we have a function here for parsing and this let me just show you so this is
19:50 triggered when we find a fun token right so f um token
19:55 and it does kind of what you would expect right it uh it
20:01 it grabs a name as the last one you know and it reports an error if it doesn't
20:06 get a name and it reads the parameters that are enclosed in parentheses
20:13 and then it reads an optional list of results
20:18 and we reuse the same function here as the parameters because they have the same format and then comes the body
20:27 and so we begin with we we always want a and you know we what we could do here actually we could do if
20:33 um so we can do this if we wanted to allow
20:47 us function definitions without a body so this allows us to do this
20:51 um so now we're gonna get a let's see oh
20:56 did i make a mistake here oh no i didn't we're just printing this out in a weird way um well actually let me silence these
21:04 token token printing things
21:09 i think we have what is that advanced oh yeah we have this macro
21:15 so let's just uncomment that since every top level thing we that's when we print out that's why we didn't see them so now we get a uh now we allow
21:23 sort of forward declarations if you will or like
21:26 external declarations whatever you want to call them sort of a a declaration of
21:31 a function existing it's only its signature but not its implementation
21:35 if you wanted to um let's say you could put
21:40 source code in multiple files it might be useful to do this some some popular irs like have this
21:46 like llvms ir for example does this rather than using includes or anything like that you just kind of you just kind of punch in the signature of a function that's defined somewhere else um and then you can use that somewhere
21:59 like this and the uh it's little parser can then
22:04 tell you that like oh you're you're giving it the right or the wrong arguments
22:10 so as he says that if we wanted to support support that prefix
22:25 see here's the function and here's the implicit here's how that
22:29 the implicit first block thing so the first thing we do is we just make sure that we get a um left left brace
22:36 select the oh the opening left brace here but we don't consume it just yet
22:41 so the first thing we do is to create a node uh to represent an implicit block zero
22:48 and we might use that we might not and then we also allocate the body node uh which is going to be a list node which is it's just the same as just sets the token remember um
23:01 and why we do it here and then we do advance is because the implementation of
23:08 our uh make node function here takes the the uh
23:13 you know creates a node structure for us and it sets it to the current token
23:17 in our case that's less interesting and then it sets its source position to the
23:22 current source position from our scan state and this is kind of like the
23:26 the the key reason we're doing things this way so jumping back
23:31 that means that the source position of our implicitly zero and body will be
23:36 you know this the start of this which semantically i think makes sense right
23:41 so that like when we think about you know where in the source is the is the
23:44 body where does it start to begin it's here
23:49 this is the body right rather than like this or some some abstract notion of
23:54 things and when it comes to the block similarly right if there's an implicit block where does that implicit block start does it start here or like at this
24:03 token probably not right i think it kind of starts up here in a place of block
24:07 so um that's what's going on so both of these will get the source position set to this
24:15 and our block zero will will be used down here right so the next thing we do so then we just consume that
24:21 uh brace and if we immediately get a
24:27 uh uh right hand side brace then we're just done that's just an empty function right so if it looks like
24:33 this then we just get a function without any
24:47 and so if you do have a body so if the next token is not the um
24:54 the ending brace there then we see is does the function start with a label
25:00 if it does not we essentially just insert this block zero and we call this
25:05 function that the um go look at it let's see what is this called uh i think this actually is like pre-page label yeah there we go
25:15 so here we have our parslet that is triggered for um the the label
25:22 and so whenever we encounter a label then
25:28 this is called we read the label's name and then we just trim off the trailing colon here with a needle and label name
25:34 and then what we do we use this uh breakout function here this assist
25:38 function we that we're going to use down here an implicit block
25:42 which essentially is reads a list of operations and assignments if you remember what we have in our
25:51 syntax here so we have a a block
25:56 right oh i guess we call it block a block statement yeah so reads monomer
26:00 block statements let's just lift that in here i keep clicking on this let me do this
26:15 and we actually define this as zero or more so maybe we should update our syntax like that
26:22 okay so this just reads
26:26 zero more block statements that is correct yeah it reads zero more um so pretty straightforward what this is doing and it appends it to
26:37 to this node that we pass in here right so it it calls this like list the pen
26:41 function to just add things that it's parsing so it just keeps adding things until it uh it encounters either a
26:49 right hand brace the end of the input stream which that would be an error later on or
26:55 another label which would be a start of another block so just keeps going until
27:00 any of those conditions are true and and then we use it so we we make use
27:05 of that down here right so we synthesize the the starting label here and we have
27:09 a constant for the first block which is b0 uh oops and no matter if we did have an explicit
27:20 opening block or if we synthesize spawn
27:24 beyond this point it doesn't matter anymore now we're just parsing
27:28 um more blocks right which might be
27:33 no more we might be done it might just be a single block or if at this point we have another
27:38 label scanned kind of line up for us
27:43 by the scanner then we'll just do the same thing we'll
27:47 just call our block function that we looked at for
27:51 scanning a block and we're done at this point the body now has a list of
27:58 blocks uh another detail is that now since we
28:07 have this implicit block name users are going to be or we're going to be able to reference that so we're going
28:13 to be able to do things like do we have b0 here sometimes somewhere
28:18 i guess we don't but like let's say that we had b zero in a reference trade like you're
28:23 going to be able to reference this block zero here as the implicit block which
28:29 i think for simplicity's sake we should just allow this otherwise it's going to be a special case to like remember if
28:34 it's like entered or not anyhow since we can do this we also cannot allow
28:40 uh any any block that is not the first
28:43 block to be called block zero because like if we now do this right like this
28:49 is ambiguous which block doesn't mention this one or like the implicit block zero here right um i'm sure that there are fancy ways to
28:57 solve this like with you know orientated name up here and then you know we can we
29:01 try to be clever but i think in practice it will be just bad
29:07 practice to allow a label to be called b0
29:13 um yes as a convention so ideas decided to disallow that so that's what what this little code does so if uh if the
29:20 block we is parsed here is the very first block in the function so it's the
29:25 um sorry if it is not the first right so if
29:29 the if the list already has at least one item so head is not null uh
29:38 let's go oh my keyword so if this is not null it's clear
29:42 and the the name of the block that we just
29:46 parsed is the b0 name and i broke this out into
29:51 a constant so i could shift it later and see the reference uh then that's an error i'm
29:57 gonna get an error message it's gonna call this block zero this is gonna complain it's gonna say uh on line nine here a block name bcr
30:07 must be the first block so i thought it was kind of neat
30:22 so i think this is it for for this part
30:28 what is next is to i need to do a little bit of cleanup this code is a little bit it's a little bit inconsistent a little bit messy for example
30:38 there is this uh looking at this syntax again right
30:42 there is a concept of an operation an assignment with arguments the parser
30:46 right now does not is not structured as nicely it's looser
30:50 than this and so i'm gonna clean that up and just
30:55 straighten out the parser to part exactly that and nothing else and issue an error if it gets anything else
31:02 and after that point i think what we will do it will be moving on to uh code
31:05 generation which will be kind of exciting we're getting one step closer down to our second big milestone of having a a
31:13 ready-made little sort of interpretive program that we can just toss these little source files on zoom
31:20 so we'll be generating these uh virtual machine instructions
31:25 at that point cool i hope you enjoyed this one um
31:32 you know chime in on twitter or reach out some other way if you're if there any questions or anything like that and
31:37 i hope to catch you in the next part