programming/projects/muscript · riki's house

hey! looks like you have JavaScript disabled.
I respect that decision, but you may find the experience of browsing the treehouse… not great.
for example, links to branches may not work properly. I cannot do anything about this; it's due to how the <details> element works.
(a <details> will not expand itself automatically to reveal the linked element to you.)
I did my best to at least keep the site readable in this state, but you can only do so much with plain HTML and CSS.

Pinky promise this website does not contain any malicious code such as trackers or cryptocurrency miners.
if you don't believe me, you're free to inspect the source yourself! all the scripts are written lovingly in vanilla JS (not minified!) by yours truly ❤️

and if this box is annoying, feel free to block it with uBlock Origin or something. I have no way of remembering you closed it, and don't wanna add a database to this website. simplicity rules!

repo
2023-09-10
my UnrealScript compiler
2023-09-10
part of the Stitchkit project, which aims to build a set of Hat in Time modding tools that are a joy to use
2023-09-10
- the name “MuScript” is actually a reference to Mustache Girl
  2023-09-10
architecture
2023-09-10
- MuScript uses a query-based architecture similar to rustc
  2023-09-10
  - the classic pass-based compiler architecture has the compiler drive itself by first parsing all files, then analyzing them, then emitting bytecode - in passes
    2023-09-10
  - a query-based architecture works by driving the compiler by asking it questions
    2023-09-10
    the most interesting question to us all being “do you know the bytecode of this package?”
    2023-09-10
    which then triggers another question “do you know the classes in this package?”
    2023-09-10
    which then triggers another question “do you know the contents (variables, functions, structs, enums, states, …) of those classes?”
    2023-09-10
    which then triggers another question “do you know all the variables in class ComfortZone?”
    2023-09-10
    which then triggers another question “do you know the ID of the type Int?”
    2023-09-10
    “yeah, it’s 4.”
    2023-09-10
    “there’s this variable HuggingQuota with ID 427.”
    2023-09-10
    which then triggers another question “do you know all the functions in class ComfortZone?”
    2023-09-10
    which then triggers another question “do you know the bytecode of function Hug?”
    2023-09-10
    which then triggers another question “do you know the ID of the class Person?”
    2023-09-10
    “yes indeed, this class exists and has ID 42.”
    2023-09-10
    which then triggers another question “do you know the bytecode of this function?”
    2023-09-10
    …you get the idea.
    2023-09-10
    “alright, here’s the bytecode for Hug.”
    2023-09-10
    “that’s all.”
    2023-09-10
ideas to try out
2023-09-10
- I jot down various silly ideas for MuScript in the future here
  2023-09-10
- parallelization with an event loop
  2023-09-10
  - the thing with a pass-based architecture is that with enough locking, it may be easy to parallelize.
    2023-09-10
    I can imagine parallelization existing on many levels here.
    2023-09-10
    if you have a language like Zig where every line can be tokenized independently, you can spawn a task per line.
    2023-09-10
    you could analyze all the type declarations in parallel. though with dependencies it gets hairy.
    2023-09-10
    you could analyze the contents of classes in parallel.
    2023-09-10
    thing is, this stuff gets pretty hairy when you get to think about dependencies between all these different stages.
    2023-09-10
    hence why we haven’t seen very many compilers that would adopt this architecture; most of them just sort of do their thing on one thread, and expect to parallelize by spawning more processes of cc or rustc or what have you.
    2023-09-10
  - where with a pass-based architecture the problem is dependencies between independent stages you’re trying to parallelize, with a query-based architecture like MuScript’s it gets even harder, because the entire compiler is effectively a dependency _{_hell} machine.
    2023-09-10
    one query depends on 10 subqueries, which all depend on their own subqueries, and so on and so forth.
    2023-09-10
    but there’s technically nothing holding us back from executing certain queries in parallel.
    2023-09-10
    in fact, imagine if we executed all queries in parallel.
    2023-09-10
  - enter: the async event loop compiler
    2023-09-10
    there is a central event loop that distributes tasks to be done to multiple threads
    2023-09-10
    every query function is async
    2023-09-10
    meaning it suspends execution of the current function, writes back “need to compute the type ID of Person because that’s not yet available” into a concurrent set (very important that it’s a set) - let’s call this set the “TODO set”
    2023-09-10
    on the next iteration, the event loop spawns a task for each element of the TODO set, and the tasks compute all the questions asked
    2023-09-10
    because we’re using a set, the computation is never duplicated; remember that if an answer has already been memoized, it does not spawn a task and instead returns the answer immediately
    2023-09-10
    though this may be hard to do with Rust because, as far as I know, there is no way to suspend a function conditionally? (needs research.)
    2023-09-10
    once there are no more tasks in the queue, we’re done compiling
    2023-09-10
- parsing less
  2023-09-20
  - I measured the compiler’s performance yesterday and most time is actually spent on parsing, out of all things
    2023-09-20
    going along with the philosophy of be lazy, we should probably be parsing less things then
    2023-09-20
  - I don’t think we need to parse method bodies if we’re not emitting IR
    2023-09-20
    basically, out of this code: unrealscript function Hug(Hat_Player OtherPlayer) { PlayAnimation('Hugging'); // or something, idk UE3 } parse only this: unrealscript function Hug(Hat_Player OtherPlayer) { /* token blob: PlayAnimation ( 'Hugging' ) ; */ } omitting the entire method body and treating it as an opaque blob of tokens until we need to emit IR
    2023-09-20
  - I don’t think we need to parse the entire class if we only care about its superclass
    2023-09-20
    basically, out of this code: unrealscript class lqGoatBoy extends Hat_Player;
    
    defaultproperties { Model = SkeletalMesh’lqFluffyZone.SkGoatBoy’; // etc } only parse the following: class lqGoatBoy extends HatPlayer;
    
    /* parser stops here, rest of text is ignored until needed */ and then only parse the rest if any class items are requested
    2023-09-20
    real case: getting the superclasses of Hat_Player takes a really long time because it’s big (Hat_Player.uc itself is around 8000 lines of code, and it has many superclasses which are also pretty big)
    2023-09-20
ideas I tried out
2023-10-20
- lexing first
  2023-09-20
  - something that MuScript did not use to do is have a separate tokenization stage
    2023-09-20
    this is because UnrealScript has some fairly idiosyncratic syntax which requires us to treat some things in braces {} as strings, such as cpptext
    2023-09-20
    cpptext { // this has to be parsed as C++ code, which has some differing lexing rules template <typename T> void Hug(T& Whomst) { DoHug(T::StaticClass(), &Whomst); } }
    2023-09-20
    but C++ is similar enough to UnrealScript that we are able to get away with lexing it using the main UnrealScript lexer
    2023-09-20
    we even lex variable metadata var int Something <ToolTip=bah>; using the lexer, storing invalid characters and errors as some InvalidCharacter token kind or something
    2023-09-20
    and that’s without emitting diagnostics - let the parser handle those instead
    2023-09-20
    one place where the current approach of the lexer eagerly emitting diagnostics fails is the case of <ToolTip=3D location>, where 3D is parsed as a number literal with an invalid suffix and thus errors out
    2023-09-20
  - implementing this taught me one important lesson: context switching is expensive
    2023-10-20
    having the lexer as a separate pass made the parsing 2x faster, speeding up the compiler pretty much two-fold (because that’s where the compiler was spending most of its time)
    2023-10-20
    my suspicion as to why this was slow is that the code for parsing, preprocessing, and reading tokens was scattered across memory - also with lots of branches that needed to be checked for each token requested by the parser
    2023-10-20
    I think also having token data in one contiguous block of memory also helped, though isn’t as efficient as it could be yet.
    2023-10-20
    the current data structure as of writing this is rust struct Token { kind: TokenKind, source_range: Range<usize>, }
    
    struct TokenArena { tokens: Vec<Token>, } (with some irrelevant things omitted - things like source files are not relevant for token streams themselves)
    2023-10-20
    I don’t know if I’ll ever optimize this to be even more efficient than it already is, but source ranges are mostly irrelevant to the high level task of matching tokens, so maybe arranging the storage like rust struct Tokens { kinds: Vec<TokenKind>, source_ranges: Vec<Range<usize>>, } could help
    2023-10-20
    another thing that could help is changing the usize source ranges to u32, but I don’t love the idea because it’ll make it even harder to support large files - not that we necessarily will ever support them, but it’s something to consider
    2023-10-20
insanium
2023-09-12
- section for incredibly wack ideas that will never work but fill my inner bit twiddle magician with joy
  2023-09-12
- also see Stitchkit’s insanium
  2023-09-12