programming/projects/muscript

  • repo

    2023-09-10
  • my UnrealScript compiler

    2023-09-10
  • part of the Stitchkit project, which aims to build a set of Hat in Time modding tools that are a joy to use

    2023-09-10
    • the name “MuScript” is actually a reference to Mustache Girl

      2023-09-10
  • architecture

    2023-09-10
    • MuScript uses a query-based architecture similar to rustc

      2023-09-10
      • the classic pass-based compiler architecture has the compiler drive itself by first parsing all files, then analyzing them, then emitting bytecode - in passes

        2023-09-10
      • a query-based architecture works by driving the compiler by asking it questions

        2023-09-10
        • the most interesting question to us all being “do you know the bytecode of this package?”

          2023-09-10
          • which then triggers another question “do you know the classes in this package?”

            2023-09-10
            • which then triggers another question “do you know the contents (variables, functions, structs, enums, states, …) of those classes?”

              2023-09-10
              • which then triggers another question “do you know all the variables in class ComfortZone?”

                2023-09-10
                • which then triggers another question “do you know the ID of the type Int?”

                  2023-09-10
                  • “yeah, it’s 4.”

                    2023-09-10
                • “there’s this variable HuggingQuota with ID 427.”

                  2023-09-10
              • which then triggers another question “do you know all the functions in class ComfortZone?”

                2023-09-10
                • which then triggers another question “do you know the bytecode of function Hug?”

                  2023-09-10
                  • which then triggers another question “do you know the ID of the class Person?”

                    2023-09-10
                    • “yes indeed, this class exists and has ID 42.”

                      2023-09-10
                  • which then triggers another question “do you know the bytecode of this function?”

                    2023-09-10
                    • …you get the idea.

                      2023-09-10
                  • “alright, here’s the bytecode for Hug.”

                    2023-09-10
                • “that’s all.”

                  2023-09-10
  • ideas to try out

    2023-09-10
    • I jot down various silly ideas for MuScript in the future here

      2023-09-10
    • parallelization with an event loop

      2023-09-10
      • the thing with a pass-based architecture is that with enough locking, it may be easy to parallelize.

        2023-09-10
        • I can imagine parallelization existing on many levels here.

          2023-09-10
          • if you have a language like Zig where every line can be tokenized independently, you can spawn a task per line.

            2023-09-10
          • you could analyze all the type declarations in parallel. though with dependencies it gets hairy.

            2023-09-10
          • you could analyze the contents of classes in parallel.

            2023-09-10
        • thing is, this stuff gets pretty hairy when you get to think about dependencies between all these different stages.

          2023-09-10
          • hence why we haven’t seen very many compilers that would adopt this architecture; most of them just sort of do their thing on one thread, and expect to parallelize by spawning more processes of cc or rustc or what have you.

            2023-09-10
      • where with a pass-based architecture the problem is dependencies between independent stages you’re trying to parallelize, with a query-based architecture like MuScript’s it gets even harder, because the entire compiler is effectively a dependency hell machine.

        2023-09-10
        • one query depends on 10 subqueries, which all depend on their own subqueries, and so on and so forth.

          2023-09-10
          • but there’s technically nothing holding us back from executing certain queries in parallel.

            2023-09-10
            • in fact, imagine if we executed all queries in parallel.

              2023-09-10
      • enter: the async event loop compiler

        2023-09-10
        • there is a central event loop that distributes tasks to be done to multiple threads

          2023-09-10
        • every query function is async

          2023-09-10
          • meaning it suspends execution of the current function, writes back “need to compute the type ID of Person because that’s not yet available” into a concurrent set (very important that it’s a set) - let’s call this set the “TODO set”

            2023-09-10
        • on the next iteration, the event loop spawns a task for each element of the TODO set, and the tasks compute all the questions asked

          2023-09-10
          • because we’re using a set, the computation is never duplicated; remember that if an answer has already been memoized, it does not spawn a task and instead returns the answer immediately

            2023-09-10
            • though this may be hard to do with Rust because, as far as I know, there is no way to suspend a function conditionally? (needs research.)

              2023-09-10
        • once there are no more tasks in the queue, we’re done compiling

          2023-09-10
    • parsing less

      2023-09-20
      • I measured the compiler’s performance yesterday and most time is actually spent on parsing, out of all things

        2023-09-20
        • going along with the philosophy of be lazy, we should probably be parsing less things then

          2023-09-20
      • I don’t think we need to parse method bodies if we’re not emitting IR

        2023-09-20
        • basically, out of this code: unrealscript function Hug(Hat_Player OtherPlayer) { PlayAnimation('Hugging'); // or something, idk UE3 } parse only this: unrealscript function Hug(Hat_Player OtherPlayer) { /* token blob: PlayAnimation ( 'Hugging' ) ; */ } omitting the entire method body and treating it as an opaque blob of tokens until we need to emit IR

          2023-09-20
      • I don’t think we need to parse the entire class if we only care about its superclass

        2023-09-20
        • basically, out of this code: unrealscript class lqGoatBoy extends Hat_Player;

          defaultproperties { Model = SkeletalMesh’lqFluffyZone.SkGoatBoy’; // etc } only parse the following: class lqGoatBoy extends HatPlayer;

          /* parser stops here, rest of text is ignored until needed */ and then only parse the rest if any class items are requested

          2023-09-20
        • real case: getting the superclasses of Hat_Player takes a really long time because it’s big (Hat_Player.uc itself is around 8000 lines of code, and it has many superclasses which are also pretty big)

          2023-09-20
  • ideas I tried out

    2023-10-20
    • done lexing first

      2023-09-20
      • something that MuScript did not use to do is have a separate tokenization stage

        2023-09-20
        • this is because UnrealScript has some fairly idiosyncratic syntax which requires us to treat some things in braces {} as strings, such as cpptext

          2023-09-20
          • cpptext
            {
                // this has to be parsed as C++ code, which has some differing lexing rules
                template <typename T>
                void Hug(T& Whomst)
                {
                    DoHug(T::StaticClass(), &Whomst);
                }
            }
            
            2023-09-20
        • but C++ is similar enough to UnrealScript that we are able to get away with lexing it using the main UnrealScript lexer

          2023-09-20
        • we even lex variable metadata var int Something <ToolTip=bah>; using the lexer, storing invalid characters and errors as some InvalidCharacter token kind or something

          2023-09-20
          • and that’s without emitting diagnostics - let the parser handle those instead

            2023-09-20
            • one place where the current approach of the lexer eagerly emitting diagnostics fails is the case of <ToolTip=3D location>, where 3D is parsed as a number literal with an invalid suffix and thus errors out

              2023-09-20
      • implementing this taught me one important lesson: context switching is expensive

        2023-10-20
        • having the lexer as a separate pass made the parsing 2x faster, speeding up the compiler pretty much two-fold (because that’s where the compiler was spending most of its time)

          2023-10-20
          • my suspicion as to why this was slow is that the code for parsing, preprocessing, and reading tokens was scattered across memory - also with lots of branches that needed to be checked for each token requested by the parser

            2023-10-20
        • I think also having token data in one contiguous block of memory also helped, though isn’t as efficient as it could be yet.

          2023-10-20
          • the current data structure as of writing this is rust struct Token { kind: TokenKind, source_range: Range<usize>, }

            struct TokenArena { tokens: Vec<Token>, } (with some irrelevant things omitted - things like source files are not relevant for token streams themselves)

            2023-10-20
            • I don’t know if I’ll ever optimize this to be even more efficient than it already is, but source ranges are mostly irrelevant to the high level task of matching tokens, so maybe arranging the storage like rust struct Tokens { kinds: Vec<TokenKind>, source_ranges: Vec<Range<usize>>, } could help

              2023-10-20
              • another thing that could help is changing the usize source ranges to u32, but I don’t love the idea because it’ll make it even harder to support large files - not that we necessarily will ever support them, but it’s something to consider

                2023-10-20
  • insanium

    2023-09-12
    • section for incredibly wack ideas that will never work but fill my inner bit twiddle magician with joy

      2023-09-12
    • also see Stitchkit’s insanium

      2023-09-12