Google Research released TurboQuant, a training-free compression algorithm that can compress the KV cache of large language models (LLM) to 3 bits without affecting model accuracy, on Tuesday (24th), according to foreign media. In benchmark tests on Nvidia (NVDA.US) 's H100 GPUs, compared to unquantized 32-bit keys, the 4-bit TurboQuant can increase the efficiency of computing attention logits by up to 8x, while reducing the KV cache memory by at least 6x.Related NewsCore PPI MoM for Apr in United States is 1.0%, higher than the previous value of 0.2%. The forecast was 0.3%.Memory stocks Sandisk (SDNK.US) and Micron Technology (MU.US) cascaded 3.5% and 3.4% each overnight (25th). (Real-time Streaming US Stocks Quote; Except All OTC quotes are at least 15 minutes delayed.)
AASTOCKS Financial News