ReLEAse me, pLEAse

You ever get introduced to a new idea, let it place you in a chokehold, and spend a week optimizing ISel instead of implementing class dispatch vectors? Here’s the story of my obsession with the lea instruction while building joosc.

To follow along, you at least need to know what lea (load effective address) does. In Intel syntax, lea dst, [base + k * index + o] where k is any of 1, 2, 4, 8 and o is a constant sets dst <- base + k * index + o. The + o part or k * index part can be left out, and I’ll call k “scale” and o “offset” from now on. You can imagine how nice this instruction is when performing complicated arithmetic, or if you’re playing with addresses into an array. Moving on!

During code generation, joosc lowers through three progressively more assembly-like IRs:

Tiger IR (TIR)
- Split into Statements (don’t return values) and Expressions (do)
- Both may be side-effectful
- Conditional Jump statements have two branch targets
Canonical Form IR (CFIR)
- Expressions are now pure
- Conditional Jump statements now only have one branch target
Abstract assembly (AASM)
- i386, with infinite registers

For us, ISel is the CFIR -> AASM translation. Optimizing ISel means picking the best translation. How do we define best? A cost model. A very rough one. I scoured the i386 manual for the approximate clock cycles each instruction takes and had each Instruction type implement a cost() trait. What’s really important here is that lea apparently only takes two cycles, meaning it’s tied for the cheapest instruction alongside test, cmp, some mov variants, and most binary operations.

impl InstructionCost for Lea {
    fn cost(&self) -> Cost {
        2
    }
}

impl InstructionCost for Bin {
    fn cost(&self) -> Cost {
        match self.op {
            BinOpcodes::And
            | BinOpcodes::Or
            | BinOpcodes::Xor
            | BinOpcodes::Add
            | BinOpcodes::Sub => 2,
            BinOpcodes::IMul => 23, // 9 -- 38, inconsistent
            BinOpcodes::Sar
            | BinOpcodes::Sal => 3,
        }
    }
}

A quirk of CFIR is that it’s not flat (like LLVM). It’s flat on the statement level, but expressions can be nested arbitrarily within. This means each statement is an AST of CFIR expressions that has to be pattern-matched on to generate assembly. We call this process tiling, hence the following type definition:

struct Tiling {
    pub insts: Vec<Instruction>,
    pub ret: Register,
}

Each tiling provides a sequence of instructions producing a final value, and an abstract register storing that result value, which gets used by predecessor nodes in the AST. We implement each tile as a function of the ExprTile type, and throw them all in an array.

type ExprTile = fn(&CFIRExpr, &mut Context) -> Option<Tiling>;
pub const EXPR_TILES: &'static [ExprTile] = &[ basic_const, ... ];

// Example: the trivial translation for consts
fn basic_const(expr: &CFIRExpr, ctx: &mut Context) -> Option<Tiling> {
    if let CFIRExpr::Const(const_) = expr {
        let tmp = temp_reg("const", ctx.gen_fresh());
        let ins = vec![Instruction::Move(Move::RegImm(tmp.clone(), const_.val))];
        return Some(Tiling::new(ins, tmp));
    }
    None
}

Implementing the lea tile turns out to be a massive headache.

fn binop_lea(expr: &CFIRExpr, ctx: &mut Context) -> Option<Tiling> {
    if let CFIRExpr::BinOp(binop) = expr {
        if let Some(addr) = binop.to_addr() {
            let tmp = temp_reg("binop_lea", ctx.gen_fresh());
            let mut ins = vec![Instruction::Lea(Lea {
                dst: tmp.clone(),
                src: addr,
            })];
            return Some(TileBridge::new(ins, tmp));
        }
    }
    None
}

This doesn’t look too bad, but there’s a lot hiding in that to_addr() call.

/// Returns an `Addr` if possible, matching on
/// the most complex addressing modes first
pub fn to_addr(&self) -> Option<Addr> {
    self.assoc_ast()
        .or(self.paren_ast())
        .or(self.weird_ast())
        .or(self.base_index_scale())
        .or(self.base_offset())
        .or(None)
}

I need to painstakingly pattern match on multiple equivalent ASTs because of precedence rules, like so:

/// Matches on the parse tree:   /// Matches on the parse tree:
/// ```md                        /// ```md 
///               +              ///             +
///             /   \            ///            / \
///            /     \           ///           +   offset
///           *       +          ///          / \
///          / \     / \         ///      base   *
///    index scale base offset   ///            / \
/// ```                          ///       index  scale
fn weird_ast(...)...             /// ```
                                 fn assoc_ast(...)...

The functions themselves are awful, right-drifted to all hell and begging for more powerful matching syntax.¹ Here’s one of them:

/// Matches on the parse tree arising from
///    base + (index * scale + offset)
/// ```md
///             +
///            / \
///         base  +
///              / \
///             *   offset
///            / \
///       index  scale
/// ```
fn paren_ast(&self) -> Option<Addr> {
    if let TIROp::Add = self.op {
        // e + (? <> ?)
        if let CFIRExpr::BinOp(rhs) = &*self.right {
            // e + (? + ?)
            if let TIROp::Add = rhs.op {
                // e + (? + offset)
                if let CFIRExpr::Const(offset) = &*rhs.right {
                    // e + ((? <> ?) + offset)
                    if let CFIRExpr::BinOp(mul) = &*rhs.left {
                        // e + ((? * ?) + offset)
                        if let TIROp::Mult = mul.op {
                            // e + ((e2 * scale?) + offset)
                            if let CFIRExpr::Const(scale) = &*mul.right {
                                if COMPLEX_SCALES.contains(&scale.val) {
                                    return Some(Addr::complex(
                                        &*self.left,
                                        &*mul.left,
                                        scale.val,
                                        offset.val,
                                    ));
                                }
                            }
                        }
                    }
                }
            }
        }
    }
    None
}

Notice that these parse trees assume offset is on the right of each +, and index * scale is the node, not scale * index. That’s because the TIR -> CFIR translation also performs a little bit more canonicalization: any commutative binary operation with a Const operand has it on the RHS. Without this small change, there’d be even more possibilities and I likely would have given up entirely.

Anyway, zooming back out, the idea is that starting from the root node, whenever we have a CFIRExpr, we iterate over this EXPR_TILES array, taking the first tile that matches (returns a Some) the current subtree. Well, not quite. That would have been good enough for the assignment requirements, but it turns out this greedy approach doesn’t always find the lowest cost overall tiling.²

What could possibly be the solution… hmmmmmm… 🤔🤔🤔🤔🤔🤔… real headscratcher… I wonde—okay yeah it’s a hashmap. It’s always a hashmap. We’ll use dynamic programming to find the best tiling for every subtree, bottom up.

type ExprMemo = (&'static ExprTile, Cost);
type ExprMap = HashMap<CFIRExpr, ExprMemo>;

thread_local!(static EXPR_STORE: RefCell<ExprMap> = 
    RefCell::new(ExprMap::default()));

This thread_local nonsense is just to trick the compiler into letting me use a mutable global. joosc is single-threaded, so it’s okay. With all this in place, the core tiling algorithm is simple and readable:

pub fn tile_expr(ir: &CFIRExpr, ctx: &mut Context) -> Tiling {
    match lookup_cached_expr(ir) {
        Some(memo) => memo.0(ir, ctx),
        None => EXPR_TILES
            .iter()
            .filter_map(|f| Some((f, f(ir, &mut ctx.clone())?.insts.cost())))
            .min_by(|l, r| l.1.cmp(&r.1))
            .and_then(|res| {
                insert_cached_expr(ir, *res);
                res.0(ir, ctx)
            }),
    }
    .or_else(crate::ice!("no tile matched CFIR expr {:?}", ir)),
}

Even if you don’t know Rust (or perhaps even any functional language), it’s probably obvious what’s happening. The function composition here might be my single favourite “line” of code in the compiler. It’s just so satisfying.

There’s a teensy twist, though. My hashmap’s a bit quirky.

pub fn lookup_cached_expr(ir: &CFIRExpr) -> Option<&ExprMemo> {
    EXPR_STORE.with(|c| c.borrow().get(ir))
}

pub fn insert_cached_expr(ir: &CFIRExpr, res: ExprMemo) {
    EXPR_STORE.with(|c| {
        let mut map = c.borrow_mut();
        map.insert(ir.structural_clone(), res);
    })
}

lookup_cached_expr is sane enough, but what’s structural_clone? Well, to hash CFIR expressions we need to impl Hash on it, which we can do by deriving it and Eq after implementing PartialEq. Why can’t we just derive everything? We could, but that’d be pretty awful. Consider that this map’s whole purpose is to cache ISel results. There’s no difference between CONST 5 and CONST 6. They’ll both compile to mov fresh_temp <x>. Am I going to have a map entry for every possible constant? Absolutely not. The keys I want to store in my hashmap are ASTs. If I’ve previously decided on an optimal tiling for Bin(Bin(Const(x), Const(y)), Const(z)), then I want to match on the tree structure, ignoring the values of x, y, z, and even the binary operators.

Since CFIRExpr is an enum, we just impl the traits on each “subclass” and have its implementation delegate to them.³

impl StructuralClone for CFIRExpr {
    fn structural_clone(&self) -> CFIRExpr {
        match self {
            CFIRExpr::Const(v) => v.structural_clone(),
            CFIRExpr::Temp(v) => v.structural_clone(),
            CFIRExpr::UnaryOp(v) => v.structural_clone(),
            CFIRExpr::BinOp(v) => v.structural_clone(),
            CFIRExpr::Mem(v) => v.structural_clone(),
            CFIRExpr::Name(v) => v.structural_clone(),
        }
    }
}

impl StructuralClone for Temp {
    fn structural_clone(&self) -> CFIRExpr {
        CFIRExpr::Temp(Self::default())
    }
}

impl StructuralClone for Const {
    fn structural_clone(&self) -> CFIRExpr {
        CFIRExpr::Const(self.clone())
    }
}
// ...and so on

The codegen of a Temp is always the same regardless of its name, so it gets cloned with the empty string, but you might notice an inconsistency with what I said above. Const actually clones its value instead of defaulting it. The reason is twofold: copying an i32 isn’t that expensive, and, of course, lea. The scale factor in a complex address of the kind base + scale * index + offset can be 1, 2, 4, or 8, because these are common type sizes. However, these constants receive the same codegen for an lea, and everything that isn’t 1, 2, 4, or 8 is also treated the same (won’t even match lea).

StructuralClone is only one half of the puzzle; it determines what goes into the hashmap. We’ve yet to implement PartialEq, which determines lookup behaviour. Most of the time, we’ll ignore all data in each struct and essentially define PartialEq as a scuffed, recursive version of comparison on std::any::TypeId.

#[derive(Derivative)]
#[derivative(PartialEq, Eq, Hash)]
pub struct Temp {
    #[derivative(PartialEq = "ignore", Hash = "ignore")]
    pub name: String,
}

The derivative crate makes the implementation trivial. Again, Const is a little special. Recall that we basically want to partition the space of i32 values into {1, 2, 4, 8} and everything else.⁴ This is done by throwing in a new field that’s initialized (privately) based on the value, and only this field is checked for equality.

enum ConstKind {
    ComplexScale,
    Other,
}

#[derive(Derivative)]
#[derivative(PartialEq, Eq, Hash)]
pub struct Const {
    kind: ConstKind,
    #[derivative(PartialEq = "ignore", Hash = "ignore")]
    val: i32,
}

And we’re done! This does mean we could have two hashmap entries like Bin(*, 2, 3) and Bin(-, 2, 4) that are really the same, but it’s a lot better than every possible Bin(op, x, y) being distinct. Is it cursed? A little, but it’s not like I’m going to be comparing CFIRExprs for equality in any other context anyways. As far as I’m concerned, this is the most sensible definition of PartialEq.

In the end, I don’t think any of this work to optimize the compile time of ISel mattered, because register allocation takes a billion years. Graph colouring my beloved.

Here’s a neat paper discussing stronger pattern matching ↩
At least, that’s what every compiler course discussing maximal munch tiling says; I’ve yet to think of an example ↩
Common enum boilerplate like this can be cut down with the ambassador or impl_enum crates ↩
joosc actually also separates out 0 and 1, because there are special tiles that deal with the additive and multiplicative identity ↩