Press the AI Agent acceleration key

2025-04-23 03:39:39

Author: Wan Chen

DeepSeek-R1's excellent writing style, GPT-4o's Ghibli art style, OpenAI o3's image-based geographical inference...

This is a phenomenon-level AI product that has been trending back and forth over the past two months. You can clearly see that reinforcement learning can finally generalize, and multimodal models are becoming increasingly usable. This also means that 2025 really marks the time for the application and accelerated implementation of Agent.

The previously popular AI Agent - the Manus team revealed that at the end of last year, Claude 3.5 Sonnet reached the level required for an Agent in long-term planning tasks and step-by-step problem-solving abilities, which was the premise for the birth of Manus.

Now, with the further maturity of deep thinking models and multimodal model capabilities, there will definitely be more Agents capable of handling complex tasks.

Based on this judgment, on April 17, ByteDance's cloud and AI service platform "Volcano Engine" launched a stronger model for the enterprise market - Doubao 1.5 Deep Thinking Model, which is also the first appearance of the reasoning model behind ByteDance's AI application Doubao App. Along with this, the Doubao Text-to-Image Model 3.0 and an upgraded version of the Visual Understanding Model were also launched.

Regarding the model released this time, the president of Volcano Engine, Tan Dai, believes that "the deep thinking model is the foundation for building Agents. The model must have the ability to think, plan, and reflect, and it must support multi-modal capabilities, just like humans possess vision and hearing, so that Agents can better handle complex tasks."

When AI evolves to have end-to-end autonomous decision-making and execution capabilities, moving towards core production processes, Volcano Engine is also prepared with the architecture and tools that allow Agents to operate in both the digital and physical worlds—OS Agent solutions and AI cloud-native inference suites, helping enterprises build and deploy Agent applications faster and more cost-effectively.

In Tan Cheng's view, developing an agent is like developing a website or APP, only the model API cannot completely solve the problem, and many AI cloud-native components on the cloud are required. In the past, cloud native had its core definitions, such as containers, elasticity, etc.; Now, AI cloud-native will have similar key elements. Volcano Engine is committed to becoming the optimal solution for infrastructure in the AI era through continuous thinking, exploration, and rapid action on AI cloud native, such as various middleware around the model, evaluation, monitoring, observability, data processing, security assurance, and related components such as sandboxes.

01 Doubao Deep Thinking Model, thinking, searching, and seeing like a human.

Since the launch of DeepSeek-R1 at the beginning of the year, many ToC applications have integrated the R1 inference model, except for the Doubao App. The "Deep Thinking" mode launched on the Doubao App in early March is backed by ByteDance's self-developed Doubao deep thinking model.

Now, this reasoning model - Doubao 1.5 · Deep Thinking Model is officially released and can be experienced and invoked on the Volcano Ark platform.

Clicking on the online mode, Doubao can think like humans when solving problems: ponder, search, and then continue to think... ultimately aiming to solve the problem.

This is an example in a shopping scenario, where after providing constraints such as budget and size, Doubao recommends a suitable set of camping gear.

On this issue, Doubao first broke down the points to note, planned the necessary information, then identified the missing information and conducted an online search. Here it searched in 3 rounds, first searching for prices and performance to ensure they met the budget and needs; it also considered the separate needs of children, and finally took the weather into account, searching for relevant detailed reviews. It thought while searching until it obtained all the necessary context to make a decision, providing a reasoned answer.

In addition to searching and thinking simultaneously, the Doubao deep thinking model also possesses visual reasoning capabilities, similar to humans, capable of thinking not only based on text but also based on the visuals it sees.

Take the scenario of ordering food as an example. The May Day Golden Week is approaching, and friends traveling abroad no longer need to take photos and upload them to translation software to translate menus; the Doubao Deep Thinking Model can directly help you order food based on images.

In the example below, the Doubao Deep Thinking Model first performed currency conversion to control the budget, then considered the preferences of the elderly and children, while carefully avoiding dishes they are allergic to, and directly provided a menu plan.

Networking, thinking, reasoning, multimodal, Doubao 1.5・The deep thinking model demonstrates comprehensive reasoning ability and can solve more complex problems.

According to the technical report, the Doubao 1.5 Deep Thinking Model has achieved a high level of completion in reasoning tasks in professional fields, such as scoring on par with OpenAI o3-mini-high in the mathematical reasoning AIME 2024 test, and its results in programming competitions and scientific reasoning tests are also close to o1. In general tasks like creative writing and humanities knowledge Q&A, the model also demonstrates excellent generalization ability, making it suitable for a wider range of use cases.

The Doubao deep thinking model also features low latency. Its technical report shows that the model adopts the MoE architecture, with a total of 200B parameters and only 20B active parameters, achieving performance comparable to top models with a smaller number of parameters. Based on efficient algorithms and high-performance inference systems, the Doubao model API service ensures high concurrency while maintaining latency as low as 20 milliseconds.

At the same time, it also has multimodal capabilities, allowing it to apply deep thinking models to a variety of scenarios. For example, it can understand complex enterprise project management flowcharts, quickly locate key information, and respond to customer inquiries strictly according to the flowchart with its powerful instruction-following ability. When analyzing aerial images, it can assess the feasibility of regional development by combining geographic features.

In addition to the inference model, this time the Doubao large model family also brought updates to two models. In terms of text-to-image models, Doubao launched the latest version 3.0 upgrade, which can achieve better typography performance, photorealistic image generation effects, and 2K high-definition image generation.

The new model not only better addresses the generation challenges of small characters and long texts but also improves image layout. For example, the two posters generated on the far left, "Manifestation" and "Harvest Plan," have finely generated details and a more natural layout, making them ready for immediate use.

Another upgrade is the Doubao 1.5 visual understanding model. The new version has two key updates: more accurate visual positioning and smarter understanding of videos.

In terms of visual positioning, the Doubao 1.5 visual understanding model supports frame positioning and point positioning for multiple targets, small targets, and general targets. It also supports positioning counting, describing positioning content, and 3D positioning. The enhancement of visual positioning capabilities allows the model to further expand application scenarios, such as inspection scenarios for offline stores, GUI agents, robot training, and autonomous driving training.

The model has also made significant improvements in video understanding capabilities, such as memory ability, summary comprehension ability, speed perception ability, and understanding of long videos. Enterprises can create more interesting commercial applications based on video understanding. For example, in a home setting, we can leverage video understanding capabilities along with vector search to perform semantic searches on home surveillance videos.

For example, in the case below, a cat owner wants to know about the cat's daily activities. Now, by directly searching "What did the kitten do at home today?" it can quickly return semantically relevant video clips for the user to view.

With reasoning models that have visual understanding and a greater capacity for reasoning, many things that were previously impossible can now be achieved, unlocking more scenarios. For example, cameras with such capabilities are sure to become more popular, and there will also be new development opportunities for AI glasses, AI toys, smart cameras, smart locks, and so on.

02 Cloud, Entering the Agentic AI Era

In the past few days, OpenAI researcher Yao Shunyu (core author of Deep Research and Operator) pointed out in the article "The Second Half of AI" that as reinforcement learning has finally found a path to generalization, it not only works effectively in specific fields, such as defeating human chess players with AlphaGo, but can also achieve near-human competitive levels in software engineering, creative writing, IMO-level mathematics, mouse and keyboard operations, and many other areas. In this case, competing for leaderboard scores and achieving higher scores on more complex leaderboards will become easier, but this evaluation method is already outdated.

The competition now is the ability to define problems. In other words, what problems does AI need to solve in real life?

In 2025, the answer is Productivity Agent. Currently, the application scenarios of AI are rapidly entering the era of Agentic AI, where AI is gradually able to complete complete tasks that are of higher professionalism and take longer time. In this context, the Volcano Engine has also built a series of infrastructures for enterprises to "define their own general Agent."

The most important aspect is the model, which is capable of autonomous planning, reflection, end-to-end decision-making, and execution, moving towards the core production processes. At the same time, it also requires multimodal reasoning abilities, allowing it to complete tasks in the real world through the collaboration of ears, mouth, and eyes.

Beyond the model, the Infra technology stack also needs to continuously evolve. For example, as the MoE architecture demonstrates more efficient advantages, it gradually becomes the mainstream architecture for models. Consequently, scheduling adapted to the MoE model requires more complex and flexible cloud computing architectures and tools.

In the scenario of enterprise universal Agents, Volcano Engine has launched a better architecture and tools - the OS Agent solution, which supports large models to operate in the digital and physical world. For example, the Agent can operate a browser to search for product pages, perform price comparison tasks for iPhones, and even use video editing and music composition on remote computers with Jianying.

Currently, the Volcano Engine OS Agent solution includes the Doubao UI-TARS model, as well as veFaaS function services, cloud servers, cloud phones, and other products, enabling operations on code, browsers, computers, mobile phones, and other Agents. Among them, the Doubao UI-TARS model integrates screen visual understanding, logical reasoning, interface element localization, and operations, breaking through the limitations of traditional automation tools that rely on preset rules, providing a model foundation for intelligent interaction of Agents that is closer to human operations.

In the general-purpose Agent scenario, the Volcano Engine enables companies, individuals, or specific fields to define and explore Agents as needed through this OS Agent solution.

On vertical class Agents, the Volcano Engine will explore based on its own areas of advantage, such as the previously launched "Smart Programming Assistant Trae" and the data product "Data Agent," the latter of which maximizes data processing capabilities by building a data flywheel.

On the other hand, with the penetration of Agents, there will also be a greater consumption of model inference. In the face of large-scale inference demands, Volcano Engine has specially created the AI Cloud-Native ServingKit inference suite, allowing for faster model deployment and lower inference costs, with GPU consumption reduced by 80% compared to traditional solutions.

In Tan Dai's view, to meet the demands of the AI era, Volcano Engine will continue to focus on three areas: continuously optimizing models to maintain competitiveness; constantly reducing costs, including expenses, latency, and improving throughput; and making products easier to implement, such as tools for developers like KOUZI and HiAgent, as well as cloud-native components like OS Agent. By maintaining product and technological leadership, market share will also lead. Previously, IDC released the report "Analysis of the Market Landscape of Public Cloud Large Model Services in China, 1Q25", showing that Volcano Engine ranks first with a market share of 46.4%.

In December last year, the daily average token call volume of the Doubao large model was 4 trillion. By the end of March this year, this number has exceeded 12.7 trillion, achieving a rapid growth of over 106 times in less than a year since the Doubao large model was first released. In the future, with the further maturation of deep thinking models, visual reasoning, and the optimization of AI cloud infrastructure, the Agent will also drive a larger token call volume.

AGENT-2.9%

DEEPSEEK1.13%

GPT1.59%

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Share

Comment

0/400

No comments

Topic
Gate 2025 Q2 Report Released
2k Popularity
Gate Derivatives Volume Hits New High
4k Popularity
CPI Data Incoming
32k Popularity
4Crypto Legislation Voting Week
3k Popularity
5BTC Hits New High
110k Popularity
6My Gate Moments
26k Popularity
7VIP Exclusive Airdrop Carnival
26k Popularity
8Fed June Meeting Minutes
7k Popularity
9Gate Alpha Trading Share
14k Popularity
10Trump Tariff Hikes
16k Popularity

sitemap