Fast JSON Decoding for Local LLMs with Compressed Finite State Machine

Description

This summary was drafted with mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf

Liangsheng Yin, Ying Sheng, and Lianmin Zheng (lmsys) present a novel optimization for constrained decoding of JSON or YAML in local LLMs (large language models). The method utilizes a compressed finite state machine that can be applied to any regular expression, accommodating any JSON or YAML schema. By analyzing the finite state machine of a regular expression and compressing singular transition paths, this approach decodes multiple tokens in a single step whenever feasible, significantly accelerating the decoding process. This optimization also makes constrained decoding even faster than normal decoding. The authors compare their method with existing systems such as guidance + llama.cpp and outlines + vLLM, demonstrating up to 2x reduction in latency and a 2.5x boost in throughput.


Read article here
Link
We care about your privacy so we do not store nor use any cookie unless it is stricly necessary to make the website to work
Got it
Learn more