Chinese AI startup Moonshot AI recently released a technical report on its model Kimi, proposing a new architecture called “Attention Residuals,” attempting to rewrite the residual design that Transformer has long used. Shortly after the report was released, Elon Musk also commented on social media, saying, “Impressive work from Kimi,” which quickly drew attention to this technology.
The Chinese AI model Kimi extends attention between models.
The focus of Kimi this time is actually on a very core mechanism in Transformers that has rarely been rethought: Residual Connection. Since ResNet, most models have simply “added back” the output of each layer directly, with the same weights. This approach is simple and stable, but as the model becomes very deep, problems begin to arise: the accumulated information from the earlier layers becomes too much, and new signals struggle to take effect, even getting overwhelmed, making model training more difficult.
Kimi’s approach is to extend the attention mechanism from being used “between tokens” to “between model layers.” In Attention Residuals, each layer no longer receives an average of all past layer information, but instead uses attention to “select” which layers are more important. In other words, the model no longer just accumulates information continuously but actively chooses useful information based on the current input.
Kimi successfully improves efficiency by 1.25 times without increasing inference latency.
However, if every layer looks at all historical layers, the cost would be too high. Therefore, Kimi proposed a compromise approach called Block Attention Residuals: first, divide the model into several blocks, maintaining the original summation method within the blocks, but using attention to make selections between blocks. This retains the ability to “select information” while significantly reducing memory and computational burden, and can actually be directly applied to existing models.
From the results, Kimi has almost no increase in inference latency (less than 2%) on a large model, yet achieves approximately a 1.25 times improvement in efficiency, with progress on multiple testing metrics. This indicates that this modification is not only theoretically appealing but also possesses practical value. Previously, attention addressed the “relationship between words,” while Kimi allows the model to start considering “which information should be used between different layers.”
In simple terms, the model not only reads data but also begins to learn how to go back and find its previously computed content.
This article was praised by Musk: “Impressive!” What is the secret weapon of China’s AI model Kimi? It first appeared on Chain News ABMedia.