Clever ARM instruction validation for Chrome Native Client

nickzoic · on Feb 6, 2011

Not knowing anything about ARM, I didn't get how this bit is supposed to work:

    We enforce this rule by restricting the sorts of operations that 
    programs can use to alter sp.   Programs can alter sp by adding or 
    subtracting an immediate, as a side-effect of a load or store:
    ldr  rX,  [sp],  #4!        ; loads from stack, then adds 4 to sp

    These are safe because, as we mentioned before, the largest 
    immediate available in a load or store is ±4095.  Even after adding or 
    subtracting 4095, the stack pointer will still be within the sandbox or 
    guard regions.

I get the idea of the guard protecting against the biggest immediate offset, but what stops me doing an SP-updating LDR with a big offset multiple times, pushing SP beyond my "safe" memory segment?

EDIT: I guess I might be taking:

    Any other operation that alters sp must be followed by a guard   
    instruction.

too precisely, and you could just follow every ldr which writes back SP with a BIC too. Maybe I'm missing the point.

EDIT2: Wait, wait, I get it now. Once the stack pointer is in the guard area, the CPU faults if you do another LDR. Don't mind me!

jbri · on Feb 6, 2011

Because you'll fault in the guard zone on one of those memory accesses, and then your app will get killed.

EDIT: No, you don't need to check after a stack memory access either. The farthest you can get out of the sandbox is an access-then-adjust of 4095 bytes, followed by an adjust-then-access of 4095 bytes, which means you access the very tail end (offset 8190) of the guard zone, fault, and die.

nickzoic · on Feb 6, 2011

Thanks, I just realized that ... duh. I'd been imagining the guard regions as a "passive guard", eg: an area where nothing important is, rather than as an "active guard", eg: where you fault if you read/write.

pnp · on Feb 6, 2011

I see a lot of interesting techniques here. I couldn't figure out is how writes are prevented to code areas in the sandbox. I'd guess they mark pages with code-bundles as read-only but I don't see any specific mention of it.

(The article does mention that the guard pages are set to no read/write/execute)

pjscott · on Feb 6, 2011

The trampolines are located in a segment of the address space which is marked as read-only, presumably by the MMU.

jbri · on Feb 6, 2011

The validator also needs to prevent modifications to the program's own code, otherwise it could, say, remove the breakpoint instruction from the start of a data bundle.

wmf · on Feb 6, 2011

Turning every load or store from one instruction into two sounds slow; pNaCl can't arrive soon enough.

pjscott · on Feb 6, 2011

It sounds slow, but think about what's going on in the processor. Memory access instructions are far slower than simple arithmetic instructions. As for the conditional nature of the load instructions, I'm sure that the Cortex-A9 branch predictor will make the overhead from that pretty close to negligible. And there's probably something similar on the Cortex-A8, though I haven't checked.

In other words, this is a lot less slow than it sounds.

mansr · on Feb 7, 2011

The branch predictor is not involved here since there is no branch. A conditional non-branch instruction is, on most implementations, scheduled identically to the unconditional base instruction. A load/store instruction whose condition passes thus executes identically to an unconditional one. If the condition fails, it schedules as though it had hit L1 cache.

The main problem I see here is significantly increased code size, which will put additional pressure on the L1 I-cache.

jbri · on Feb 6, 2011

Fast code doesn't touch memory that often - if you're doing enough loads/stores that the extra instruction is significant, you're probably bounded by memory latency instead of processor speed anyway.

And the most common "frequent" accesses (stack space) don't require those checks.

repsilat · on Feb 7, 2011

Won't iterating over a heap-allocated array result in a load for every element? A lot of good C will keep a lot in L1, but it won't always be in registers or on the stack.

jbri · on Feb 7, 2011

Yes, there you'll be incurring the huge performance penalty of one additional instruction per element in the array.

As I said, not really significant unless your loop is so tight that memory latency is the big limiter anyway.