Crafting a JSON parser with SIMD: Slowing down to go faster

March 29, 2024

When I started this project, I chose the AVX instruction set because it was the only one that could process 256bit vectors. Every other one (excluding AVX512) was limited to 128bits. This translates to processing 32 characters per instruction. Sounds way better than 16, right? In theory it should be faster but calling out to AVX is not free. There is some overhead when we create chunks from the string being parsed in order to create the SIMD pack.

Since we are taking the time to create the SIMD pack, we want to maximize the information used. In some cases, such as parsing strings, it's worth reusing the bitmask calculated from a chunk in order to look for characters that need to be escaped or double-quotes. In any other case though we are only interested in the first rightmost set bit of the bitmask as it's a pivot point after which we need to look for a completely different one, making the bitmask pretty much single use.

That would make the ideal bitmap look like this:

1000 0000 0000 0000 0000 0000 0000 0000

Skipping the most amount of characters possible within the chunk while still finding the anchor point. What I am thinking is that this is very unlikely, in fact the opposite is more likely to be true. Unless someone, on purpose, creates a horribly formatted JSON then all the anchor points are not going to be too far apart. Even JSON keys are probably unlikely to be over 10 characters.

Let's look at an example, here is the API response from an OpenWeather endpoint:

{
  "coord": {
    "lon": 10.99,
    "lat": 44.34
  },
  "weather": [
    {
      "id": 501,
      "main": "Rain",
      "description": "moderate rain",
      "icon": "10d"
    }
  ],
  "base": "stations",
  "main": {
    "temp": 298.48,
    "feels_like": 298.74,
    "temp_min": 297.56,
    "temp_max": 300.05,
    "pressure": 1015,
    "humidity": 64,
    "sea_level": 1015,
    "grnd_level": 933
  },
  "visibility": 10000,
  "wind": {
    "speed": 0.62,
    "deg": 349,
    "gust": 1.18
  },
  "rain": {
    "1h": 3.16
  },
  "clouds": {
    "all": 100
  },
  "dt": 1661870592,
  "sys": {
    "type": 2,
    "id": 2075663,
    "country": "IT",
    "sunrise": 1661834187,
    "sunset": 1661882248
  },
  "timezone": 7200,
  "id": 3163858,
  "name": "Zocca",
  "cod": 200
}

Pay attention to the anchor points and how far apart they are. The end of string for keys is adjacent to : and the value begins after a single space. The , is also adjacenet to the end of the value and the next key begins after a linefeed and 2-6 spaces. Data is relatively closely packed and this is the human readable version, it's not uncommon to have the linefeeds and identation removed to decrease size.

Now let's look at the keys, description is the longest, at 11 characters, while the average is much less. Even the longest string value would comfortably fit in 16 character vectors. We could look at 100 more examples and collect more accurate statistics but I feel confident that this is a fair assumption to draw and build on. Paying the cost to move 32 characters into a SIMD pack seems rather wasteful and unlikely to help. AVX does support 128bit vector instructions as well so it's worth a test just to get an idea of the difference it would make.

Benchmarks

;; chunk size of 32
(time (loop for i from 0 to 100 do (parse *data*)))
Evaluation took:
  10.616 seconds of real time
  0.984375 seconds of total run time (0.890625 user, 0.093750 system)
  [ Real times consist of 2.251 seconds GC time, and 8.365 seconds non-GC time. ]
  [ Run times consist of 0.109 seconds GC time, and 0.876 seconds non-GC time. ]
  9.27% CPU
  22,422,362,573 processor cycles
  5,323,339,984 bytes consed

;; chunk size of 16
(time (loop for i from 0 to 100 do (parse *data*)))
Evaluation took:
  6.362 seconds of real time
  2.015625 seconds of total run time (1.640625 user, 0.375000 system)
  [ Real times consist of 2.084 seconds GC time, and 4.278 seconds non-GC time. ]
  [ Run times consist of 0.625 seconds GC time, and 1.391 seconds non-GC time. ]
  31.69% CPU
  13,438,524,761 processor cycles
  4,875,842,528 bytes consed

Good improvement, not only with respect to time but also space.

Comparing to standard parsers

More importantly, this change marks the point where the SIMD implementation became faster than other CL parsers. Here are the benchmarks for cl-json and jsown.

(time (loop for i from 0 to 100 do (cl-json:decode-json-from-string *data*)))
Evaluation took:
  29.079 seconds of real time
  9.562500 seconds of total run time (8.625000 user, 0.937500 system)
  [ Real times consist of 3.137 seconds GC time, and 25.942 seconds non-GC time. ]
  [ Run times consist of 0.906 seconds GC time, and 8.657 seconds non-GC time. ]
  32.88% CPU
  61,415,718,331 processor cycles
  25,724,477,840 bytes consed

(time (loop for i from 0 to 100 do (jsown::parse *data*)))
Evaluation took:
  9.593 seconds of real time
  1.750000 seconds of total run time (1.593750 user, 0.156250 system)
  [ Real times consist of 1.497 seconds GC time, and 8.096 seconds non-GC time. ]
  [ Run times consist of 0.296 seconds GC time, and 1.454 seconds non-GC time. ]
  18.24% CPU
  20,261,563,049 processor cycles
  5,149,047,872 bytes consed

To be completely fair, both of these libraries support a lot more features than mine which could be dragging down their performance. Shout out to jsown as it took a lot of effort to outperform.

Closing thoughts

Using SIMD is not a free ticket to performance. Parsing becomes more complex and there are a lot of details that can drag down performance. It took a lot of iterations to get something that is even slightly better than what's already out there for lisp.

This improvement has probably been the easiest to implement, code-wise, and it only makes parsing faster when the assumption holds true. If JSON object keys were larger than 16 characters or anchor points were further apart then this would actually slow things down quite a bit. Speed improvement might translate quite differently depending on the contents but I think for most cases this is going to be better.

There is still the option of doing things in a hybrid way and utilizing both vector sizes though I feel like the complexity added outweighs the benefits.

What's next?

You might have noticed that out of the 6.3s of runtime, 2s are spent GCing. If allocations were to be reduced we might see another speed up due to the GC running less. So that's where I will be focusing next, either trying to stack allocate things or pulling some other trick.

What about SSE?

SSE could be a more portable option, it's been around longer. For the sake of completeness, I did give it a try and performed almost identically to AVX with 128bit vectors. It does come with some downsides though.

The first is that it has less instructions available, even with my limited use of AVX I still ran into it during the conversion where u8.16/= was not available in SSE. SSE has less registers available for use and since we can do the same thing with AVX I don't see any reason to completely switch to using it.

Here is the commit with SSE being used.