composable virtual file systems

  • you know what I hate?

    2024-11-25
    • State.

      2024-11-25
  • guess what file systems are: giant balls of state controlled fully by imperative code!

    2024-11-25
    • how the heck are we supposed to write reliable code in an environment where we can’t even be sure everything is in the state we expect it to?

      2024-11-25
  • having to handle state desynchronizations results in a massive amount of complexity and space for bugs. and unfortunately… there’s really nothing you can do about this, because the whole point of file systems is to enable persistent storage. oh

    2024-11-25
  • so I’ve been thinking if maybe there was some way to make file systems more declarative.

    2024-11-25
    • after all, functional programming exists, and it enables you to think about data and transformations, rather than sending commands to the computer.

      2024-11-25
    • systemd exists, NixOS also exists… both enable you to specify what state your system should reach, rather than how it should be done.

      2024-11-25
      • well, right. actually, systemd does have an imperative element to it, because you have to tell it what command should be executed to get your process spinning. I’d argue that is still way better than a shell script, because it can restart failing services selectively without any extra work on your part.

        2024-11-25
  • my idea was basically to enable the programmer to specify an expected file system structure declaratively, like so:

    root = dir {
        crates = dir "crates",
        readme = file "README.md",
    }
    

    and then you’d be able to access the readme file via root.readme, rather than by the usual fopen-fwrite-fclose interface.

    2024-11-25
    • but the more I thought about the idea in this way, the less sense it made.

      2024-11-25
    • sure it may be fine for asserting the expected structure of files in the file system, but what if some of the files don’t exist? what about directories where you explicitly don’t know about the file structure?

      2024-11-25
    • basically, this idea was weirdly bidirectional in a way that my little cat brain couldn’t process. so that never went anywhere.

      2024-11-25
  • a few weeks ago however, I had a different revelation: what if instead of interfacing with the underlying file system, we… build one? like… you know. a virtual file system?

    2024-11-25
    • I was familiar with the idea of creating an API exposing a virtual file system through LÖVE, which exposes bindings to PhysicsFS.

      2024-11-25
      • I know of at least one game that makes use of PhysicsFS outside of LÖVE, but I had never used it myself in a project.

        2024-11-25
      • looking at the API exposed by LÖVE though, it’s pretty simple—you get a (global) file system, to which you can add mount points, which represent physical directories on your computer.

        2024-11-25
  • this idea seemed incredibly cool, though going with my functional zen I just needed to make it composable—so my end goal was to have a file system made out of pieces you can fit together like LEGO bricks!

    2024-11-25
  • a fractal of files

    2024-11-25
    • I started designing. I knew I at least needed the ability to enumerate and read files, so I needed at least these two functions:

      trait Dir {
          /// List all entries under the given path.
          fn dir(&self, path: &VPath) -> Vec<VPathBuf>;
      
          /// Return the byte content of the entry at the given path, or `None` if the path does not
          /// contain any content.
          fn content(&self, path: &VPath) -> Option<Vec<u8>>;
      }
      

      this alone already gave me an insane amount of insight!

      2024-11-25
      • first of all, from the perspective of the program, do we really need to differentiate between directories and files?

        2024-11-25
        • from a GUI design standpoint, they’re a useful abstraction for us humans—the more concrete an object, the easier it is to grasp. files and folders seem a lot more easy to grasp than abstract entries which may represent documents, folders, or both. but for a program, it couldn’t care less.

          2024-11-25
        • compare this code for walking the file system:

          fn walk_dir_rec(dir: &dyn Dir, path: &VPath, mut f: impl FnMut(&VPath)) {
              for entry in dir.dir(path) {
                  f(&entry);
                  walk_dir_rec(dir, &entry, f);
              }
          }
          
          fn process_all_png_files(dir: &dyn Dir) {
              walk_dir_rec(dir, VPath::ROOT, |path| {
                  if path.extension() == Some("png") {
                      if let Some(content) = dir.content(path) {
                          // do stuff with the file
                      }
                  }
              });
          }
          
          2024-11-25
        • to this code, which has to differentiate between files and directories, because calling dir on a file or content on a directory is an error:

          fn walk_dir_rec(dir: &dyn Dir, path: &VPath, mut f: impl FnMut(&VPath)) {
              for entry in dir.dir(path) {
                  f(&entry);
                  if entry.kind == DirEntryKind::Dir {
                      walk_dir_rec(dir, &entry, f);
                  }
              }
          }
          
          fn process_all_png_files(dir: &dyn Dir) {
              walk_dir_rec(dir, VPath::ROOT, |entry| {
                  if entry.path.extension() == Some("png")
                      && entry.kind == DirEntryKind::File
                  {
                      if let Some(content) = dir.content(entry.path) {
                          // do stuff with the file
                      }
                  }
              });
          }
          

          to me, the logic seems a lot simpler in the former case, separating the concerns of walking the directory in walk_dir_rec from the concerns of reading the files in process_all_png_files!

          2024-11-25
        • this does not automatically mean it’s a good idea to design an operating system around this, but it’s interesting to think about the properties that emerge from removing the separation.

          2024-11-25
          • it may not even be the greatest idea to interface with the physical file system in this way, if the communication has to be bidirectional—since real world file systems separate files from directories, think about what happens if your program tries to write content to an entry which already has a dir.

            2024-11-25
            • in that manner, this is a leaky abstraction.

              2024-11-25
      • second… this looks a lot like resource forks! so imagine that you can add even more metadata to file system entries, by adding more methods to this trait.

        2024-11-25
        • feels like an incredibly useful way to propagate auxiliary data through the program!

          2024-11-25
    • with these two functions, the ability to join paths, and remove their prefixes, this is enough to start building interesting things.

      2024-11-25
      • since we’d like our file system to be composable, we’ll need a composition operator first. I’m naming mine MemDir, because it represents an in-memory dir with entries. I’ll spare you the implementation details, but it acts more or less like a hash map:

        let mut dir = MemDir::new();
        dir.add(VPath::new("README.txt"), readme_txt);
        dir.add(VPath::new("src"), src);
        
        2024-11-25
      • and now for the opposite of the composition operator: we’ll need an operator that decomposes Dirs into smaller ones. the name of this one I’ll take from the command line—Cd, meaning change directory:

        let src = Cd::new(dir, VPath::new("src"));
        
        2024-11-25
      • and voilá, the file system is composable!

        2024-11-25
    • interestingly enough, assuming your virtual paths cannot represent parent directories, this already forms the foundation of a capability-based file system.

      2024-11-25
      • if you want to give a program access to your src directory, but nothing else, give it that Cd from above, and it cannot access anything outside of it.

        2024-11-25
  • get real

    2024-11-25
    • having a design in mind, I thought it would be interesting to integrate it into a real project. the treehouse seemed like the right thing to test it out on, since it’s effectively a compiler—it transforms a set of source directories into a target directory.

      2024-11-25
      • and I have to say: so far, I’m liking it!

        2024-11-25
    • I ended up needing a few more resource forks to implement all the existing functionality.

      pub trait Dir: Debug {
          /// List all entries under the provided path.
          fn dir(&self, path: &VPath) -> Vec<DirEntry>;
      
          /// Return the byte content of the entry at the given path.
          fn content(&self, path: &VPath) -> Option<Vec<u8>>;
      
          /// Get a string signifying the current version of the provided path's content.
          /// If the content changes, the version must also change.
          ///
          /// Returns None if there is no content or no version string is available.
          fn content_version(&self, path: &VPath) -> Option<String>;
      
          /// Returns the size of the image at the given path, or `None` if the entry is not an image
          /// (or its size cannot be known.)
          fn image_size(&self, _path: &VPath) -> Option<ImageSize> {
              None
          }
      
          /// Returns a path relative to `config.site` indicating where the file will be available
          /// once served.
          ///
          /// May return `None` if the file is not served.
          fn anchor(&self, _path: &VPath) -> Option<VPathBuf> {
              None
          }
      
          /// If a file can be written persistently, returns an [`EditPath`] representing the file in
          /// persistent storage.
          ///
          /// An edit path can then be made into an [`Edit`].
          fn edit_path(&self, _path: &VPath) -> Option<EditPath> {
              None
          }
      }
      
      2024-11-25
      • content_version and anchor are both used to assemble URLs out of Dir entries. I have a function url which, given a root URL, returns a URL with a ?v= parameter for cache busting.

        pub fn url(site: &str, dir: &dyn Dir, path: &VPath) -> Option<String> {
            let anchor = dir.anchor(path)?;
            if let Some(version) = dir.content_version(path) {
                Some(format!("{}/{anchor}?v={version}", site))
            } else {
                Some(format!("{}/{anchor}", site))
            }
        }
        
        2024-11-25
      • image_size is used to automatically determine the size of images at build time. that way I can add width="" height="" attributes to all <img> tags, preventing layout shift.

        2024-11-25
        • technically, as of writing this, not all images have this. notably, ones I paste into the markup, like this one:

          goofy close up screenshot of Hat Kid staring down into the camera

          if you refresh the page with this branch open, you will notice the layout shift.

          unless I’ve already fixed it as of the time you’re reading this, in which case… partying

          2024-11-25
    • one notable piece of functionality that is currently missing is version history. to be honest, I’m still figuring that one out in my head; I have the feeling it’s not exactly going to be simple, but it should end up being a lot more principled than whatever this was.

      2024-11-25
  • Radio Edit (radio edit)

    2024-11-25
    • but wait riki! what’s that edit_path do?

      2024-11-25
    • one notable thing about this virtual file system is that it doesn’t allow writing to the virtual files.

      2024-11-25
      • I mean, think about it. it just doesn’t make sense! we have a & immutable reference, and allowing the program to edit files as it’s compiling could wreck some real havoc…!

        2024-11-25
        • and not only that, there’s also the question of useless edits, edits that tweak in-memory files. those don’t persist across restarts, so they don’t really make much sense, do they?

          2024-11-25
    • it may seem pointless to want to have the treehouse—again, a compiler of sorts—capable of editing source files, but there is one legit use case: I have a fix command, which fills in all the missing branch IDs in a file, optionally saving it in place.

      2024-11-25
      • there’s also a fix-all command, which does this to all files.

        2024-11-25
    • so I needed to devise an API that would let me write to files after the command is done running. that’s what the edit_path is for.

      2024-11-25
    • edit_path returns an EditPath, which represents a location somewhere in persistent storage. having an EditPath, you can construct an Edit.

      /// Represents a pending edit operation that can be written to persistent storage later.
      #[derive(Debug, Clone, PartialEq, Eq)]
      pub enum Edit {
          /// An edit that doesn't do anything.
          NoOp,
      
          /// Write the given string to a file.
          Write(EditPath, String),
      
          /// Execute a sequence of edits in order.
          Seq(Vec<Edit>),
          /// Execute the provided edits in parallel.
          All(Vec<Edit>),
      
          /// Makes an edit dry.
          ///
          /// A dry edit only logs what operations would be performed, does not perform the I/O.
          Dry(Box<Edit>),
      }
      

      Edits take many shapes and forms, but the most important one for us is Write: it allows you to write a file to the disk.

      the other ones are for composing Edits together into larger ones.

      2024-11-25
      • Seq can be used to implement transactions, where all edits have to succeed for the parent edit to be considered successful.

        2024-11-25
        • I use this for writing backups. if writing a backup fails, we wouldn’t want to overwrite the original file, because we wouldn’t be able to restore it!

          2024-11-25
      • All can be used to aggregate independent edits together and execute them in parallel.

        2024-11-25
      • Dry can be used to implement a --dry-run command, only printing an edit instead of applying it.

        2024-11-25
      • NoOp can be used when you need to produce an Edit, but don’t actually want to perform any operations.

        2024-11-25
        • this runs contrary to my opinion on None enums, for one reason: would you rather have to handle Option<Edit> everywhere, or just assume whatever Edit you’re being passed is valid?

          2024-11-25
          • try replacing recursive Edit references in the example above with Option<Edit>s.

            2024-11-25
          • while Edits are meant to be composed, they aren’t really meant to be inspected outside of the code that applies edits to disk—therefore having a NoOp variant actually improves readability, since Edit is constructed more often than it is deconstructed.

            2024-11-25
          • if this doesn’t yell “you shouldn’t treat anyone’s opinions as dogma—not even your own ones,” I don’t know what will.

            2024-11-25
    • although I said before that the fork-based virtual file system is a leaky abstraction when you introduce writing to the physical file system, I don’t think this particular API is susceptible to this—since it can expose EditPaths for entries that can actually be written (ones with a content), you can disallow writing to directories that way.

      2024-11-25
      • of course then you cannot create directories.

        2024-11-25
      • also, TOCTOU bugs are a thing, but I disregard those as they don’t really fit into a compiler’s threat model.

        2024-11-25
        • as I said in the beginning, I don’t really like mutable state…

          2024-11-25
  • improvise, adapt, overcome

    2024-11-25
    • thanks to the Dir’s inherent composability, it is trivial to build adapters on top of it. I have a few in the treehouse myself.

      2024-11-25
      • TreehouseDir is a file system which renders .tree files into HTML lazily.

        2024-11-25
        • since generating all this HTML is an expensive operation, I have it wrapped in a ContentCache adapter, which caches any successfully generated pages in memory.

          2024-11-25
          • I pre-warm the cache with all pages too, so that I can see any warnings that arise during generation. this is done on multiple threads, because all my Dirs are Send + Sync.

            2024-11-25
        • I also have a Dir called HtmlCanonicalize layered on top of the ContentCache, which removes html extensions from paths. TreehouseDir exposes canonical paths without an html, but I have to support it for compatibility with old links.

          2024-11-25
      • Blake3ContentVersionCache is the sole implementor of Dir::content_version. its purpose is to compute content_versions and cache them in memory for each path. as the name suggests, versions are computed using a (truncated) BLAKE3 hash.

        2024-11-25
      • ImageSizeCache is the sole implementor of Dir::image_size. for paths with a supported extension (png, svg, jpg, jpeg, webp), it reads the stored image file’s size and caches it in memory.

        2024-11-25
      • Anchored is the sole implementor of Dir::anchor. it lets me specify that the source file system’s static directory ends up being available under /static on the website.

        2024-11-25
      • Overlay combines a base directory with an overlay directory, first routing requests to the overlay directory, and if those fail, routing them to the base directory. this allows me to overlay a MemDir with the static directory and a robots.txt on top of a TreehouseDir—which together form the compiler’s target directory.

        2024-11-25
    • and guess what—the server serves straight from that virtual target directory, too! after all, what’s the point of writing it to disk if you already have everything assembled. It’s all dynamic.™

      2024-11-25
    • for the curious, here’s roughly how the treehouse’s virtual file systems are structured:

      source: ImageSizeCache(Blake3ContentVersionCache(MemDir {
          "treehouse.toml": BufferedFile(..),  // content read at startup
          "static": Anchored(PhysicalDir("static"), "static"),
          "template": PhysicalDir("template"),
          "content": PhysicalDir("content"),
      }))
      
      target: Overlay(
          HtmlCanonicalize(ContentCache(TreehouseDir)),
          MemDir {
              "static": Cd(source, "static"),
              "robots.txt": Cd(source, "static/robots.txt"),
          },
      )
      
      2024-11-25
      • I’m not too fond of that treehouse.toml hack, but this is because PhysicalDir cannot really be attached to singular files right now. I’ll definitely fix that one day…

        2024-11-25
  • a fatal flaw

    2024-11-25
    • there is one flaw I want to point out with the current implementation of Dir: it uses trait methods to add new resource forks.

      2024-11-25
      • any time you add a new resource fork that’s implemented only by one or two Dirs, you have to duplicate a stub across all existing adapters! this means introducing a new fork means performing up to N edits, where N is the number of implementers of Dir.

        2024-11-25
        • this problem doesn’t exist for Dirs which are not adapters (aka suppliers), but in practice there are far less suppliers than adapters.

          2024-11-25
    • one idea I’ve had to fix this was to change the API shape to a single trait method.

      pub trait Dir {
          fn forks(&self, path: &VPath, forks: &mut Forks);
      }
      
      impl Dir for MyDir {
          fn forks(&self, path: &VPath, forks: &mut Forks) {
              forks.insert(|| MyFork);
          }
      }
      
      impl<T> Dir for AdapterDir<T> {
          fn forks(&self, path: &VPath, forks: &mut Forks) {
              self.inner.forks(path, forks);
      
              forks.insert(|| MyMomsEpicSilverware);
          }
      }
      

      but that hasn’t come to fruition yet, as I have no idea how to make it efficient yet object-safe… I’m yet to add profiling to the treehouse, so I don’t want to make risky performance decisions like this at this point.

      2024-11-25
      • dynamic typing has a cost, after all!

        2024-11-25
    • maybe one day I’ll post an update on that.

      2024-11-25
  • and that basically concludes all this virtual file system shenaniganry! I hope you got something useful out of it.

    2024-11-25
  • I look forward to finding out how this system will fare in the long run. guess I’ll report my progress, I dunno. next year?

    2024-11-25
    • see you then!

      2024-11-25