places, or what is up with `*x` not always meaning the same thing in different contexts

  • I recently got a question from my someone telling me they doesn’t understand why *x does not read from the pointer x when on the left-hand side of an assignment.

    2024-05-18
    • as in this case:

      void example(int *x) {
          int y = *x;  // (1)
          *x = 10;     // (2)
      }
      
      2024-05-18
    • it seems pretty weird, right? why does it read from the pointer in declaration (1), but not in the assignment (2)?

      2024-05-18
      • doesn’t *x mean “read the value pointed to by x?”

        2024-05-18
  • TL;DR: the deal with this example is that *x does not mean “read from the location pointed to by x”, but just “the location pointed to by x”.

    2024-05-18
    • *x is not a value, it’s a memory location, or place in Rust parlance

      2024-05-18
      • same thing with x

        2024-05-18
      • same thing with some_struct.abc or some_struct->abc

        2024-05-18
  • but instead of jumping to conclusions, let’s go back to the beginning. let’s think, what is it that makes places different?

    2024-05-18
    • the main thing is that you can write to them and create references to them. for instance, this doesn’t work:

      void example(void) {
          1 = 2;  // error!
      }
      

      but this does:

      void example(void) {
          int i;
          i = 2;  // fine!
      }
      
      2024-05-18
    • so really, places are kind of a different type - we can do certain additional operations with them, such as writing!

      2024-05-18
      • we’ll call this type place(T), where T is any arbitrary type that is stored at the memory location represented by the place(T).

        2024-05-18
  • place(T) behaves a bit weirdly compared to other types. for starters, it is impossible to write the type down in C code, so we’re always bound to turn place(T) into something else quickly after its creation.

    2024-05-18
    • for instance, in this example:

      void take_int(int x);
      
      void example(int x) {
          take_int(x /*: place(int) */);
      }
      

      the type of x being passed into the take_int function is place(int), but since that function accepts an int, we convert from place(int) to a regular int.

      2024-05-18
      • this conversion happens implicitly and involves reading from the place - remember that places represent locations in memory, so they’re a bit like pointers. we have to read from them before we can access the value.

        2024-05-18
  • but there are operations in the language that expect a place(T), and therefore do not perform the implicit conversion.

    2024-05-18
    • we’re able to describe these operations as functions which take in a type T and return a type U - written down like T -> U.

      2024-05-18
      • this notation is taken from functional languages like Haskell.

        2024-05-18
      • the -> operator is right-associative - T -> U -> V is a function which returns a function U -> V, not a function that accepts a T -> U.

        2024-05-18
    • one of these operations is assignment, which is like a function place(T) -> T -> T.

      it accepts a place(T) to write to, a T to write to that place, and returns the T written. note that in that case no read occurs, since the implicit conversion described before does not apply.

      void example(void) {
          int x = 0;
          x /*: place(int) */ = 1 /*: int */; /*-> int (discarded) */
      }
      
      2024-05-18
    • another one of these operations is the & reference operator, which is like a function place(T) -> T*.

      it accepts a place(T) and returns a pointer T* that points to the place’s memory location in exchange.

      void example(void) {
          int x = 0;
          int* p = &(x /*: place(T) */);
      }
      
      2024-05-18
      • and of course its analogue, the * dereferencing operator, which does not consume a place, but produces one.

        it accepts a T* and produces a place(T) that is placed at the pointer’s memory location - it’s the reverse of &x, T* -> place(T).

        void example(int* x) {
            int y = *x;
        }
        
        2024-05-18
    • another couple of operations that accept a place(T) is the . and [] operators, both of which can be used to refer to subplaces within the place.

      2024-05-18
      • the difference is that . is a static, compile-time known subplace, while [] may be dynamic, runtime known.

        2024-05-18
      • the . operator takes in a place(T) and returns a place(U) depending on the type of structure field we’re referencing.

        2024-05-18
        • since there is no type that represents the set of fields of a structure S, we’ll invent a type anyfield(S) which represents that set.

          2024-05-18
          • the type of a specific field f in the structure S is field(S, f).

            2024-05-18
        • we’ll also introduce a type fieldtype(F) which is the type stored in the field F.

          2024-05-18
        • given that, the type of the . operator is place(T) -> F -> place(fieldtype(F)), where F is an anyfield(T).

          2024-05-18
        • example:

          void example(void) {
              struct S { int x; } s;
          
              s /*: place(struct S) */;
              s /*: place(struct S) */ .x /*: field(struct S, x) */; /*-> place(int) (discarded) */
          }
          
          2024-05-18
      • the [] operator takes in a T*, a ptrdiff_t to offset the pointer by, and returns a place(T) whose memory location is the offset pointer. the function signature is therefore T* -> ptrdiff_t -> place(T).

        example:

        void example(int* array) {
            int* p = &((array /*: int* */)[123] /*: place(int) */);
        }
        
        2024-05-18
        • we can actually think of this a[i] operator as syntax sugar for *(a + i).

          2024-05-18
          • this has a funny consequence where array[0] is equivalent to 0[array] - offsetting a pointer is just addition, and addition is commutative. therefore we can swap the operands to [] and it will work just fine!

            2024-05-18
            • I do wonder though why it doesn’t produce a warning. I’m no standards lawyer, but I believe this may have something to do with implicit type conversions - the 0 gets promoted to a pointer as part of the desugared addition. I really need to read up about C’s type promotion rules.

              2024-05-18
  • now I have to confess, I lied to you. there are no places in C.

    2024-05-18
    • the C standard actually calls this concept “lvalues”, which comes from the fact that they are values which are valid left-hand sides of assignment.

      2024-05-18
      • however, I don’t like that name since it’s quite esoteric - if you tell a beginner “x is not an lvalue,” they will look at you confused. but if you tell a beginner “x is not a place in memory,” then it’s immediately more clear!

        so I will keep using the Rust name despite the name “lvalues” technically being more “correct” and standards-compliant.

        2024-05-18
        • I’m putting “correct” in quotes because I don’t believe this is a matter of correctness, just opinion.

          2024-05-18
  • what’s interesting about place(T) is that it’s actually a real type in C++ - except under a different name: T&.

    2024-05-18
    • references are basically a way of introducing places into the type system for real, which is nice, but on the other hand having places bindable to names results in some weird holes in the language.

      2024-05-18
    • to begin with, in C we could assume that referencing any variable T x by its name x would produce a place(T). this is a simple and clear rule to understand.

      2024-05-18
      • in C++, this is no longer the case - referencing a variable T x by its name x produces a T&, but referencing a variable T& x by its name x produces a T&, not a T& &!

        in layman’s terms, C++ makes it impossible to rebind references to something else. you can’t make this variable point to y:

        int x = 0;
        int y = 1;
        int& r = x;
        r = y; // nope, this is just the same as x = y
        
        2024-05-18
      • and it’s not like it could’ve been done any better - if we got a T& & instead, we’d be able to reassign a different place to the variable, but then we’d get a type mismatch on something like r = 1

        2024-05-18
        • because assignment is T& -> T -> T; if our T is int& &, the expected signature is int& & -> int& -> int&, but we’re providing an int, not an int& - and we can’t make a reference out of a value!

          2024-05-18
      • so we’d need a way of doing T& -> T, but guess what: (almost) this already exists and is called “pointers” and “the unary * operator”.

        2024-05-18
        • except of course, with pointers the signature is T* -> T&.

          2024-05-18
    • so by introducing references, C++ was actually made less consistent!

      2024-05-18
      • I actually kind of wish references were more like they are in Rust - basically just pointers but non-null and guaranteed to be aligne

        2024-05-18
  • anyways, as a final boss bonus of this blog post, I’d like to introduce you to the x->y operator (the C one)

    2024-05-18
    • if you’ve been programmming C or C++ for a while, you’ll know that it’s pretty dangerous to just go pointer-walkin’ with the -> operator

      int* third(struct list* first) {
          return &list->next->next->value;
      }
      
      2024-05-18
      • there’s a pretty high chance that using the third function will cause a crash for you if there are only two elements in the list.

        2024-05-18
        • if it doesn’t cause a crash, you may have more serious problems to worry about kamien

          2024-05-18
      • but how does it cause a crash if we’re taking the reference out of that whole -> chain? shouldn’t taking a reference not cause any reads?

        2024-05-18
    • the secret lies in what the x->y operator really does. basically, it’s just convenience syntax for (*x).y.

      2024-05-18
    • let’s start by dismantling the entire pointer access chain into separate expressions:

      int* third(struct list* first) {
          struct list* second = first->next;
          struct list* third = second->next;
          return &third->value;
      }
      
      2024-05-18
    • now let’s desugar the -> operator:

      int* third(struct list* first) {
          struct list* second = (*first).next;
          struct list* third = (*second).next;
          return &(*third).value;
      }
      
      2024-05-18
    • and add some type annotations:

      int* third(struct list* first) {
          struct list* second = (*first).next /*: place(struct list*) */;
          struct list* third = (*second).next /*: place(struct list*) */;
          return &(*third).value;
      }
      
      2024-05-18
    • and now let’s follow it line by line.

      2024-05-18
      • struct list* second = (*first).next /*: place(struct list*) */;
        

        first we read the value of the next field from the structure pointed to by first. assuming first is a valid pointer, this shouldn’t fail.

        2024-05-18
      • struct list* third = (*second).next /*: place(struct list*) */;
        

        but now something bad happens: we don’t know if the second pointer we just got a place(T) from is valid. we offset it by .next and implicitly read from it, which is bad!

        2024-05-18
      • at this point there’s no point in analyzing the rest of the function - we’ve hit Undefined Behavior!

        2024-05-18
    • the conclusion here is that chaining x->y can be really dangerous if you don’t check for the validity of each reference. just doing one hop and a reference - &x->y - is fine, because we never end up reading from the invalid pointer - it’s like doing &x[1]. but two hops is where it gets hairy - in x->y->z, the ->z has to read from x->y to know the pointer to read from.

      2024-05-18
  • TODO: in the future I’d like to embed a C compiler here that will desugar all place operations into explicit ones. stay tuned for that!

    2024-05-18