Crafting a JSON parser with SIMD: Descent into madness

April 21, 2024

In the first post of this series I mentioned how this function:

(defun unset-rightmost-bit (n)
  (declare (type fixnum n))
  (the fixnum (logand n (- n 1))))

could probably be written as a simple assembly instruction. Let me guide you through how what I thought would take up 20mins ended up being more than I bargained for.

It's just another VOP, right?

So the instruction we are looking for is BLSR which just unsets the rightmost set bit, exactly what we need. Let's just create a VOP to use it, much like we did for BSF:

(in-package :cl-user)

(sb-c:defknown unset-rightmost-bit ((unsigned-byte 64)) (integer 0 64)
    (sb-c:foldable sb-c:flushable sb-c:movable)
  :overwrite-fndb-silently t)

(in-package :sb-vm)

(define-vop (cl-user::unset-rightmost-bit)
  (:policy :fast)
  (:translate cl-user::unset-rightmost-bit)
  (:args (x :scs (any-reg) :target r))
  (:arg-types positive-fixnum)
  (:results (r :scs (unsigned-reg)))
  (:result-types positive-fixnum)
  (:generator 1
    (inst blsr x r)))

(in-package :cl-user)

(defun sth (a)
  (declare (type fixnum a)
           (optimize (speed 3) (safety 0)))
  (unset-rightmost-bit a))

which fails with:

Undefined instruction: BLSR in
 (INST BLSR X R)

Huh!? Grepping through the SBCL repo for BLSR does not match anything... Turns out, the compiler does not know how to emit that instruction. But at this point I'm invested. I don't want to just give up and call it a day, which, in hindsight, would have been the wise thing to do... Let's look at how BSF was defined as an instruction and just imitate that.

Teaching the compiler new words

I've already complained that SIMD and define-vop don't really come with any documentation except for a few sources and examples. Well define-instruction is not something I could really find anything on besides inside the SBCL source code. The plan is to look at some examples and hopefully grab one that looks close, adjust it and have it work.

Looking at existing instructions

Here is how BSF was taught to the compiler:

(define-instruction bsf (segment &prefix prefix dst src)
  (:printer ext-reg-reg/mem-no-width ((op #xBC)))
  (:emitter (emit* segment #xBC prefix dst src)))

For the compiler to be able to emit an assembly instruction, it must know the bytes of machine code that correspond to it. How do we know what bytes to emit? Open the Intel/AMD Developer Manual, find the instructions page and look at the opcode.

Opcode Instruction Op/En 64-bit Mode Compat/Leg Mode Description
0F BC /r BSF r16, r/m16 RM Valid Valid Bit scan forward on r/m16.
0F BC /r BSF r32, r/m32 RM Valid Valid Bit scan forward on r/m32.
REX.W + 0F BC /r BSF r64, r/m64 RM Valid N.E. Bit scan forward on r/m64.

Looking at the table in Intel's manual, the first 2 entries are pretty straight-forward, the instruction corresponds to bytes 0x0F 0xBC, that's the opcode. /r is part of the ModR/M byte. This is emitted after the opcode and encodes information about the instruction operands.

The third one is where things get more complicated. Some instructions require a prefix before the opcode. To use BSF with 64bit integers you need to first emit the REX prefix with the W (width) flag set to 1. I think in this case it would be 0b01001000.

I haven't dived deep enough to be able to explain everything but r16 indicates a 16-bit register as the source and r/m16 encodes the destination, either a 16-bit register or a 16-bit memory address.

This is a lot of information, I know it was for me but here is the recap for machine code instruction format:

[PREFIX] OPCODE MOD_RM

I am no expert and this might be lacking but it was good enough to guide me to a result and it should be enough to follow along.

Time to pull up the same table for BLSR and see if it's making any sense.

Opcode/Instruction Op/En 64/32-bit Mode CPUID Feature Flag Description
VEX.LZ.0F38.W0 F3 /1 BLSR r32, r/m32 VM V/V BMI1 Reset lowest set bit of r/m32, keep all other bits of r/m32 and write result to r32.
VEX.LZ.0F38.W1 F3 /1 BLSR r64, r/m64 VM V/N.E. BMI1 Reset lowest set bit of r/m64, keep all other bits of r/m64 and write result to r64.

This is starting to feel a lot like the difference between what's given for homework and what's on the final test...

VEX Prefix

To my dismay, VEX is far more complicated than REX... VEX was created to support instructions for SSE and AVX (among others), its encoding allows for access to YMM (128-bit) and XMM (256-bit) registers. I did not realize it at first but this is a big deal. It does not only impact the prefix emitted but also the ModR/M byte.

The section for encoding the register in the ModR/M byte is being repurposed for VEX instructions. Notice the /1 in the opcode, this is the value to be placed in the register segment but what does it mean? Apparently, there is a whole set of instructions that has the same prefix and opcode as BLSR and the only way to indicate which one you are going to use is by setting that value. As an example, /2 here would indicate BLMSK instead.

The VEX prefix can be either 2 or 3 bytes long, for BLSR it's going to be 3, you can tell by 0x0F38. If it was just 0x0F then it would be 2. This already give us the first byte of the prefix, 1/3 of the way there! Yeah, right...

First byte is 0xC4, treat it like a constant, a prefix prefix if you will. It all goes downhill from here. This is the second byte: 0b11100010. The first 3 1s are easy to guess, the opcode does not include R, X, or B and their inverse defines the first 3 bytes. The next 5 btyes are the opcode map to be used. This is where the Intel manual starts to disappoint and I found a way forward through the AMD manual where it clearly defines map_select for BLSR as being 2. Actually the AMD manual is great because it almost gives us the third byte too!

Remember W from the REX prefix, it's pretty much the same here, sets the 8th bit in the third byte. vvvv encodes the destination register, but we don't know which register will be used at this time... This is where we jump back into lisp land:

(logior #b10000000
        (ash (lognot (reg-id-num (reg-id dst))) 3))

We can just ask the compiler to find the register-id from the destination provided, invert it and stuff it into place. L (vector length) is 0, in the Intel manual this is indicated by LZ. pp is just 2 more prefix bits. While the Intel manual does not explicitly define them for this instruction it does give a hint in Volume 2 Chapter 3.1.1.2.

While writing this, it all seems a lot simpler, the manuals make sense and hold all the information you need but the first time you go into this it feels very chaotic and incomprehensible. Every little character in those initial tables packs so much information and unless you understand all of them then you cannot move forward.

A new instruction is born

We've already done all the hard work, now it's all about emitting the bytes:

(in-package :sb-x86-64-asm)

(sb-assem::define-instruction blsr (segment dst src)
  (:printer ...)
  (:emitter
   (let ((size (matching-operand-size dst src)))
     (when (not (eq size :qword))
       (error "BLSR instruction only supported for 64-bit operand size")))
   (emit-byte segment #xC4)
   (emit-byte segment #b11100010)
   (emit-byte segment (logior #b10000000
                              (ash (lognot (reg-id-num (reg-id dst))) 3)))
   ;; opcode
   (emit-byte segment #xf3)
   ;; mod-r/m byte
   (emit-byte segment (logior (ash (logior #b11000 1) 3)
                              (reg-encoding src segment)))))

Keep in mind that define-instruction is removed after SBCL compiles itself. So I did what any sensible person would and just copied it over from the source code, along with %def-inst-encoder.

The only thing we haven't covered is the ModR/M byte. Bits 7 and 6 indicate "register adressing" mode. Then bits 5, 4, and 3 are now used to indicate the variant (/1) while what's left encodes the source register. It's been a while since I put this much effort in 7 lines of code, going through such a journey just to hack this together has really expanded my understanding.

Enough peacocking, let's get back to work:

(define-vop (unset-rightmost-bit)
  (:policy :fast)
  (:translate unset-rightmost-bit)
  (:args (x :scs (unsigned-reg) :target r))
  (:arg-types positive-fixnum)
  (:results (r :scs (unsigned-reg)))
  (:result-types positive-fixnum)
  (:generator 1
    (inst blsr x r)))

Do I honestly think that this is going to make a difference in speed? Not really, but I just had to give it a try.

Benchmarks

No benchmarks this time, unset-rightmost-bit was called only when un-escaping characters. My sample JSON did not have any and are probably not that common for this to matter.

What's next?

Since SBCL is lacking this and other similar instructions, I want to attempt and make an actual contribution, adding all of them. But it's going to take some polishing. Did you notice how I casually omitted the :printer part of the instruction? Didn't really do it to save space, just hid it because it's not fully done yet.