TENCENT Hunyuan AI Infra Open-Sources Upgraded HPC-Ops Inference Core Operators

Recommend

Positive

Negative

SHORT SELL

STOCK INFO

aacat by AASTOCKS
Open a new trading account to get 1 Intel/Selected Gold ETF share!

TENCENT Hunyuan announced that its HPC-Ops inference operator library has undergone a system-level upgrade, evolving from standalone operators into a comprehensive optimization suite covering the entire inference pipeline, including five key operators.

This upgrade effectively addresses real-world engineering bottlenecks on mainstream inference platforms, such as long-tail latency in Attention, GPU memory transfer overhead, and cross-card communication. Multiple performance metrics significantly outperform existing open-source baselines.

Related NewsCiti: TENCENT (00700.HK) WeChat Mini Programs Smoothly Integrating into AI Ecosystem; Reiterates "Buy"
HPC-Ops is an industrial-grade, high-performance large model inference operator library open-sourced and long maintained by the TENCENT Hunyuan AI Infra team. Key highlights of this upgrade include:

Attention: To tackle computation imbalance and long-tail inference issues caused by mixed short and long requests under real workloads, a runtime dynamic load scheduling solution is adopted. Tests show up to 2.95x acceleration for long-text scenarios and up to 17% improvement in end-to-end QPM.

Router GEMM: To achieve FP32-level high-precision computation through a dual BF16 GEMM combination, balancing inference accuracy and GPU utilization. Precision is significantly superior to conventional BF16/TF32 solutions, with up to 3.22x speedup compared with CuBLAS FP32.

Related News G Sachs: TENCENT (00700.HK) Valuation Has Bottomed, Maintains Buy Rating with TP HKD700
FusedMoE: To establish a full-module MoE pipeline, integrating multi-stage processes while eliminating GPU memory transfer and kernel launch overhead. Compared with mainstream frameworks such as vLLM and SGLang, performance improves by 1.2-1.6x.

Fused AllReduce+Norm: To deeply integrate cross-GPU communication, residual addition, and normalization computation. Compared with mainstream solutions including NCCL and FlashInfer, performance achieves 1.04-1.68x acceleration.

Sampler: To consolidate sampling computation in the decoding stage, originally requiring more than ten operator steps, into two CUDA kernels, significantly reducing scheduling, read-write, and synchronization overhead. Compared with vLLM, speed increases by 4.0-7.5x, and by 1.9-4.7x versus FlashInfer, addressing inference-end bottlenecks.

Related NewsTENCENT (00700.HK) Gains 3% as BofAS Says WeChat AI Agent’s Tangible Progress Supports Rating Re-rating

Auto-translated by AI

View Original Text

This article was automatically translated by AI, the original language version should be considered the authoritative version. AASTOCKS.com Limited does not guarantee its accuracy or completeness and accepts no liability for any damages or losses arising from the use of this translation.

AASTOCKS Financial News

Copyright(C) AASTOCKS.com Limited 2000. All rights reserved.
Disclaimer: AASTOCKS.com Ltd, HKEx Information Services Limited, its holding companies and/or any subsidiaries of such holding companies endeavour to ensure the accuracy and reliability of the Information provided but do not guarantee its accuracy or reliability and accept no liability (whether in tort or contract or otherwise) for any loss or damage arising from any inaccuracies or omissions.