Crafting a JSON parser with SIMD: Finding the rightmost set bit

March 22, 2024

One of the more common uses of SIMD in this project is using it for character comparisons. Instead of comparing characters one by one, you load their numerical value into an unsigned-byte array of length 32 (32 * 8 bits = 256 bits which is the size of the AVX SIMD pack) and then compare it (sb-simd-avx2:u8.32=) with the numeric value of another character. What you get back from this comparison is another SIMD pack where values are either 0, for false, or 255, for true.

(sb-simd-avx2:u8.32= my-chunk
                     (sb-simd-avx2:u8.32 (char-code #\")))

Obviously, looping over that vector largely defeats the purpose of usign SIMD. Thankfully, there is a better way, though slightly more convoluted... You can take the result of the previous comparison and turn it into a bitmap using sb-simd-avx2:u8.32-movemask. movemask is a tricky one, it took me a while to understand why the result is exactly what it's supposed to be. Let's look at an example:

(defun why-is-my-bitmap-like-that ()
  (let ((chunk (make-array (list +chunk-length+)
                           :element-type '(unsigned-byte 8)
                           :initial-element 0))
        (pack-of-1s (sb-simd-avx2:u8.32 1)))
    (setf (aref chunk 1) 1)
    (let* ((chunk (sb-simd-avx2:u8.32-aref chunk 0)) ;; convert the array to a SIMD pack
           (pack-one-p (sb-simd-avx2:u8.32= chunk pack-of-1s))
           (bitmap (sb-simd-avx2:u8.32-movemask pack-one-p)))
      (format nil "~b" bitmap))))

(why-is-my-bitmap-like-that) ;; => "10"

I would expect my result to be: "0100 0000 0000 0000 0000 0000 0000 0000". Well, it's "backwards". Not really, because the first item of the array has been moved into the first bit of the bitmap and all leading zeroes have been ommitted. Makes a lot of sense when you think about it like that...

How does this help us? The 1s in the bitmap represent all the characters in the initial string that are of interest, and we can skip over anything that's a 0. To do that, my initial idea was a bithack:

(defun rightmost-bit (n)
  (declare (type fixnum n))
  (the fixnum
    (logand (1+ (lognot n)) n)))

(defun rightmost-bit-index (n)
  "Returns a 0-based index of the location of the rightmost set bit of `n'."
  (the fixnum (truncate (log (rightmost-bit n) 2))))

(defun unset-rightmost-bit (n)
  (declare (type fixnum n))
  (logxor n (rightmost-bit n)))

You take a number n, flip the bits, add 1, use bitwise-and with the original number and you end up isolating the rightmost set bit. Almost there, here comes the ugly part. We need to convert that number into the 0-based index of the set bit in order to use it as an offset and locate the original character in the string we are parsing. There are really 2 ways to do this (that I know of):

  1. use the log function with base 2 (what you see above)
  2. basically brute force it by shifting a bit from index 0 to the left until it equals the original number.

Looking for something built into SBCL did not lead anywhere either. I hate both, for different reasons. It was good enough for a while but at some point rightmost-bit-index popped up in my profiler:

  seconds  |     gc     |    consed   |    calls   |  sec/call  |  name
--------------------------------------------------------------
     0.172 |      0.000 |           0 |  1,579,605 |   0.000000 | RIGHTMOST-BIT-INDEX
     0.125 |      0.000 |           0 |  3,672,540 |   0.000000 | SKIP-TO-NEXT-CHARACTER
     0.125 |      0.000 |           0 |  5,326,737 |   0.000000 | NOT-WHITESPACE-P
     0.094 |      0.016 | 227,813,712 |  1,579,605 |   0.000000 | CHUNK
     0.078 |      0.000 |  44,410,608 |    748,191 |   0.000000 | %PARSE-STRING
     0.078 |      0.000 |  13,146,592 |    261,399 |   0.000000 | %PARSE-ARRAY
     0.078 |      0.000 |           0 |  1,285,566 |   0.000000 | PARSE
     0.078 |      0.000 |  57,041,792 |     90,003 |   0.000001 | %PARSE-OBJECT
     0.031 |      0.000 |           0 |    604,161 |   0.000000 | %PARSE-NUMBER
     0.031 |      0.000 |   6,520,832 |    402,774 |   0.000000 | %PARSE-DECIMAL-SEGMENT
     0.016 |      0.000 |           0 |  1,496,382 |   0.000000 | NEXT-OFFSET
     0.000 |      0.000 |           0 |      1,818 |   0.000000 | %PARSE-NULL
     0.000 |      0.000 |           0 |  1,579,605 |   0.000000 | RIGHTMOST-BIT
--------------------------------------------------------------
     0.906 |      0.016 | 348,933,536 | 18,628,386 |            | Total

Throughout my journey with SIMD my biggest resource was people playing with similar things in C/C++. Most of the time I could look into what they are doing and roughly translate that to CL. That's how I learned the basics of SIMD, documentation for it in SBCL is scarce to say the least.

This time I was less enthusiastic to see their solution. Modern CPUs have a specific set of instructions to do these kinds of operations. If they are cheating then so am I.

Defining new intrinsics

BSF, or bit-scan forward, does almost the same job as rightmost-bit-index only probably faster because it's baked into your CPU and exposed as a single assembly instruction. They differ in that BSF used a 1-based index for the set bit. The issue is that SBCL does not expose it in any way. The only option that remains is to expose it ourselves through a virtual operator.

If SIMD in SBCL feels like a dark room with no light switch, then defining your own virtual operators feels like you are in a dark forest. It's one of those things that you have to learn by reading through the existing examples, and there are many in SBCL's repo but I still felt lost. This is the only real resource that explains some of the things you see used in virtual operators and that's what I blindly followed:

(in-package :cl-user)

(sb-c:defknown bit-scan-forward ((unsigned-byte 64)) (integer 0 64)
    (sb-c:foldable sb-c:flushable sb-c:movable)
  :overwrite-fndb-silently t)

(in-package :sb-vm)

(define-vop (cl-user::bit-scan-forward)
  (:policy :fast)
  (:translate cl-user::bit-scan-forward)
  (:args (x :scs (any-reg) :target r))
  (:arg-types positive-fixnum)
  (:results (r :scs (unsigned-reg)))
  (:result-types positive-fixnum)
  (:generator 1
    (inst bsf x r)
    (inst dec r)))

(in-package :cl-user)

(bit-scan-forward 3) ;; => 0

Here is how the generator is read:

  • apply BSF to x and store it in r (extracting the rightmost-set-bit)
  • decrement r by 1 (converting it to 0-based)

Converting it to 0-based does imply that any value passed to bit-scan-forward must be a positive fixnum, so we have to check in advance.

Downsides

You might be wondering, is there a downside in doing this? Always. I am not going to mention any portability concerns as those went out the window the moment I decided to use SIMD for this. The real downside I noticed is that the compiler will not generate the assembly we provided for bit-scan-forward unless it is absolutely certain that it is being given a fixnum. In any other case it will scream out UNDEFINED FUNCTION. So you do have to be extra careful when calling it.

If you do wrap it in it's own function that ensures that the argument is a fixnum then you do get a small performance hit. The function call overhead here is significant given that we are only executing 2 lines of assembly inside it. If you inline that function you seem to lose the type guarantee/check, a nasty trap!

Benchmarks

Let's look at the difference. First the 2 functions in isolation:

> (time (loop for i from 1 to 100000000 do (rightmost-bit-index i)))
Evaluation took:
  1.284 seconds of real time
  0.375000 seconds of total run time (0.375000 user, 0.000000 system)
  29.21% CPU
  2,713,127,471 processor cycles
  0 bytes consed

> (time (loop for i from 1 to 100000000 do (bit-scan-forward i)))
Evaluation took:
  0.011 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  0.00% CPU
  24,539,824 processor cycles
  0 bytes consed

A nice 11,572% improvement. Not too bad. But the real test is hooking it up in the parser and seeing how that was affected. Here we go!

;; before (rightmost-bit-index)
(time (loop for i from 0 to 100 do (parse *data*)))
Evaluation took:
  17.192 seconds of real time
  1.796875 seconds of total run time (1.703125 user, 0.093750 system)
  [ Real times consist of 2.212 seconds GC time, and 14.980 seconds non-GC time. ]
  [ Run times consist of 0.171 seconds GC time, and 1.626 seconds non-GC time. ]
  10.45% CPU
  36,311,175,076 processor cycles
  5,323,226,416 bytes consed

;; after (bit-scan-forward)
(time (loop for i from 0 to 100 do (parse *data*)))
Evaluation took:
  10.616 seconds of real time
  0.984375 seconds of total run time (0.890625 user, 0.093750 system)
  [ Real times consist of 2.251 seconds GC time, and 8.365 seconds non-GC time. ]
  [ Run times consist of 0.109 seconds GC time, and 0.876 seconds non-GC time. ]
  9.27% CPU
  22,422,362,573 processor cycles
  5,323,339,984 bytes consed

NOTE: *data* here is a 10mb JSON document

Speedup translates quite well! Now all this work just let us get rid of rightmost-bit-index but we are still using rightmost-bit and unset-rightmost-bit, which are not too bad but we can perform the same process with the equivalent BMIs and speed this up even more.

Check the code here.

EDIT: Of course, after I finished writing this it occurred to me to grep the SBCL source code for bsf and found a similar vop named unsigned-word-find-first-bit.