News Sharing
For sharing news, please enter the email address of you and the receiver, then press SEND button.*Mandatory Fields
Receiver*
Enter email addresses, separated by semicolon (;). E.g. a@a.com;b@b.com
Your email address*
Content Sharing
TENCENT Hunyuan AI Infra Open-Sources Upgraded HPC-Ops Inference Core Operators
TENCENT Hunyuan announced that its HPC-Ops inference operator library has undergone a system-level upgrade, evolving from standalone operators into a comprehensive optimization sui...
Reset
Send
The window will close in 5 seconds
TENCENT Hunyuan AI Infra Open-Sources Upgraded HPC-Ops Inference Core Operators
Close
Recommend
11
Positive
22
Negative
6
 
 

TENCENT Hunyuan announced that its HPC-Ops inference operator library has undergone a system-level upgrade, evolving from standalone operators into a comprehensive optimization suite covering the entire inference pipeline, including five key operators.

This upgrade effectively addresses real-world engineering bottlenecks on mainstream inference platforms, such as long-tail latency in Attention, GPU memory transfer overhead, and cross-card communication. Multiple performance metrics significantly outperform existing open-source baselines.

Related NewsCiti: TENCENT (00700.HK) WeChat Mini Programs Smoothly Integrating into AI Ecosystem; Reiterates "Buy"
HPC-Ops is an industrial-grade, high-performance large model inference operator library open-sourced and long maintained by the TENCENT Hunyuan AI Infra team. Key highlights of this upgrade include:

Attention: To tackle computation imbalance and long-tail inference issues caused by mixed short and long requests under real workloads, a runtime dynamic load scheduling solution is adopted. Tests show up to 2.95x acceleration for long-text scenarios and up to 17% improvement in end-to-end QPM.

Router GEMM: To achieve FP32-level high-precision computation through a dual BF16 GEMM combination, balancing inference accuracy and GPU utilization. Precision is significantly superior to conventional BF16/TF32 solutions, with up to 3.22x speedup compared with CuBLAS FP32.

Related News G Sachs: TENCENT (00700.HK) Valuation Has Bottomed, Maintains Buy Rating with TP HKD700
FusedMoE: To establish a full-module MoE pipeline, integrating multi-stage processes while eliminating GPU memory transfer and kernel launch overhead. Compared with mainstream frameworks such as vLLM and SGLang, performance improves by 1.2-1.6x.

Fused AllReduce+Norm: To deeply integrate cross-GPU communication, residual addition, and normalization computation. Compared with mainstream solutions including NCCL and FlashInfer, performance achieves 1.04-1.68x acceleration.

Sampler: To consolidate sampling computation in the decoding stage, originally requiring more than ten operator steps, into two CUDA kernels, significantly reducing scheduling, read-write, and synchronization overhead. Compared with vLLM, speed increases by 4.0-7.5x, and by 1.9-4.7x versus FlashInfer, addressing inference-end bottlenecks.

Related NewsTENCENT (00700.HK) Gains 3% as BofAS Says WeChat AI Agent’s Tangible Progress Supports Rating Re-rating

Auto-translated by AI
This article was automatically translated by AI, the original language version should be considered the authoritative version. AASTOCKS.com Limited does not guarantee its accuracy or completeness and accepts no liability for any damages or losses arising from the use of this translation.

AASTOCKS Financial News

Copyright(C) AASTOCKS.com Limited 2000. All rights reserved.
Disclaimer: AASTOCKS.com Ltd, HKEx Information Services Limited, its holding companies and/or any subsidiaries of such holding companies endeavour to ensure the accuracy and reliability of the Information provided but do not guarantee its accuracy or reliability and accept no liability (whether in tort or contract or otherwise) for any loss or damage arising from any inaccuracies or omissions.