This article discusses the internals of the Sahl programming language.

Prelude

This article primarily discusses some implementation details about my programming language sahl like how it implements m:n threads in the vm, some bits about code generation, the development history and future goals.

Design Overview

The Sahl programming language is a statically typed language implemented in Rust and C. Key design elements include static typing with type inference, and a concurrency model supporting m:n threads and channels.

Syntax

Sahl’s syntax draws inspiration from Rust and Kotlin, with primarily imperative features.

Static Typing

Sahl employs static typing with type inference. Notably, types are retained in function prototypes to enhance the inference process. This design choice aims to strike a balance between flexibility and predictability in type resolution.

Code Generation

Three Address Code

Three Address Code generation is exemplified by the following pseudo code snippet:

match expr {
  Expr::Arith {op, lhs, rhs} => {
    let lhs_reg = compile(lhs);
    let rhs_reg = compile(rhs);
    if op == Add {
      if lhs.type == Int && rhs.type == Int {
        code.push(IAdd(lhs_reg, rhs_reg));
      } else {
        code.push(FAdd(lhs_reg, rhs_reg)); 
      }
    }
  }
}

Monomorphisation

The print is implemented for primitives and strings only in the VM and the native runtime. For lists and tuples a loop is generated at the spot of invocation which invokes print for whichever primitive is present. In the initial implementation the type info was used at runtime to print the primitives and objects but that added an overhead which was eliminated later.

SuperInstructions

There are certain sequences on instruction which at the time of codgen gets combined into a single instruction. These include

  1. LoadConstOp which loads a variable and a constant into registers and performs an operation.

  2. LoadConstOpStore which loads a variable and a constant into registers, performs an operation and stores the result in a register.

  3. JmpIfNotCond which jumps to an instruction if a condition is not true. It combines the condition check and jump into a single instruction.

Concurrency Support

This was the main challenging part - m:n threads with channels. Here is a pseudocode which explains the scheduler.

global_scheduler_queue = [];

fn spawn(thread_vm_obj) {
  enqueue(global_scheduler_queue, thread_vm_obj);
}

fn poll() {
  loop {
    thread_vm_obj = dequeue(global_scheduler_queue);
    run(thread_vm_obj)
  }
}

fn run(vm_obj) {
  loop {
    match next_instruction {
      CHAN_READ => {
        if chan_is_empty {
          // Give up control
          push(global_scheduler_queue);
          return;
        }
      }
      CHAN_WRITE => {
        if chan_is_full {
          // Give up control
          push(global_scheduler_queue);
          return;
        } 
      }
    }
  }
}

fn main() {
  for i in 0..n {
    native_thread(poll)
  }
  vm = new VM;
  run(vm); // Main thread
}

The main function of my VM spawns n native threads for polling the global job queue. The poll functions waits to get a job from the job queue. I have used mutex and cond vars for synchronization internally. It is based on the assumption that when a chan is full/empty and has to be the written to or read from the complementary thread which has to consume it or write to it is not being run at the moment. This concurrency model aims to prevent thread starvation, though improvements for fairness are acknowledged. Here also, the initial

Garbage Collector

How to make the gc work across m:n threads? I initially implemented mark and sweep as it was described in crafting interpreters. It was naive as every thread did GC on the objects it had and that lead to live objects being collected as garbage. I rewrote the VM later to use register based model and this time I changed the GC to cheney’s (semispace) collector. Although it still retains the problem of not being able to collect across threads it is faster than mark and sweep.

The current GC It also happens to be precise instead of conservative. Stackmaps are emitted for VM to keep track of live objects in the callframe. Stackmaps are a list 64bit ints wherein i th bit at jth int in the list is 1 i.e (stackmap[j] & 1 << i) if the local at index j * 64 + i is live. It is present as an instruction around an allocation. On encountering it the vm.callframe.stackmap field is updated. When the GC is about begin a worklist is prepared by traversing the callframes and checking their stackmaps. Then it follows classical cheney’s algorithm.

Development History

  1. Prior Iterations:

    1. Initial version

      Initial parser was written in nom(Rust library) and translated into Rust enums/structs, running on a virtual machine.

      I used the knowledge I had from reading craftinginterpreters but made it statically typed. Another change I made was to have separate regs/space in the callframe for locals instead of putting them on the stack. Stack only had temporary values used in computation (non variables).

    2. Introduction of a stack-based virtual machine

      I eventually realised using match for rust enums(which was my compiled code) was slow so I started emitting those enums as bytes and wrote a vm in c to interpret those.

    3. Introduction of an AST to NASM transpiler.

      I wrote a compiler from my ast to nasm assembly. It was functional. More than there being no optimizations, I was calling c functions for basics ops like addition, multiply for floats. asm.rs

    4. A bytecode to NASM compiler in go:

      It read the bytecode and emitted assembly.sahl_aot.go.

    5. Translation from AST to LLVM IR.

      I wrote a transpiler from ast itself to llvm ir using inkwell. native.rs

  2. Current Implementation:

    1. Wrote compiler to three-address IR from AST which was just serialized as bytes.

      I rewrote the virtual machine in a register-based model.

      The primary reason to transition from stack based instructions/vm to register based was speed and the ability to optimised the latter or even convert to SSA based IR and them optimise.

    2. Native compilation to llvm ir using inkwell (rust library) and a runtime which uses Boehm Allocator/GC.

      I chose to do this because the language is entirely statically typed so this would be easy.

Future Development Goals

Sahl’s future development aims at enhancing language features and performance:

  • Implementation of an incremental and precise garbage collector to replace Boehm(native) and Cheney(vm) GCs.

    I can do this with train algorithm or treadmill(Baker)’s algorithm.

    Having a precise GC in native runtime as well would speed it up and incremental nature would also give it an edge. We just need to keep track of where the values are being allocated on the stack to make it precise. Also, llvm’s mem2reg pass would make this tricky.

  • Adoption of safepoints for concurrent garbage collection across m:n threads.

  • Support for interfaces/traits with classes/structs to enhance code reusability.

    I can do this in rust-like traits or java like interfaces. This would either require monomorphisation or runtime type information.

  • Experimentation with a tracing JIT for improved runtime performance.

  • Incorporate theoretical ideas which seem feasible and fit the language’s design.

Updated: