NLP Agent
AI Product Manager, Integrated Solution, knows a bit about software, hardware, AI, blockchain, and VR/AR, but can't code.
43 original articles
2025-03-06 11:46 Shanghai
Last night, I came across Manus, which left me in awe. It has been a long time since a product has awakened my curiosity and sense of wonder.
Seeing this product and its use cases, I felt a sense of familiarity. I will try to explain the technical implementation and principles behind Manus in a way that is easy to understand.
Manus is an Agent product. As the official slogan goes, "Leave it to Manus," it can do anything automatically without human intervention or correction. It delivers the best results seamlessly, whether in the form of animations, various visual charts, detailed research PPTs, or other outputs. It is a general AI agent that connects thoughts and actions. It not only thinks but also provides results. Manus excels in various tasks in work and life, completing everything while you rest.
Manus, derived from the Latin word for "hand," is a general AI agent that can transform your ideas into actions.
Table of Contents
I'm sure everyone has heard the term "Agent" countless times. Sometimes it's unavoidable. If someone talks about AI and keeps mentioning "Agent" without knowing much else, I would judge them as a blowhard, all show and no substance.
What is Agent Base?
In Manus' architecture, the most obvious feature is the use of the Agent framework. When it receives a user's task, it doesn't execute it directly. It doesn't act impulsively like "System 1." Instead, it uses "System 2," thinking and planning. It breaks down the user's task into a series of subtasks, creating a process directory. How many steps are needed? What should each step do? Which tools or information sources should be called to execute the subtasks? Finally, it integrates each subtask systematically to deliver an impressive visual output to the user.
Here's an old diagram from 2023:
Each step of the Agent is conscious and planned. The model autonomously decomposes, plans, executes, and reflects on tasks, behaving in a human-like manner. The general SOP of humans is likely what the model considers as human know-how.
For more details, you can refer to previous articles about Agents.
Agent Base in Manus
As seen in the diagram above, after receiving a user's request, the first step is often decomposition and task planning. In this use case, Manus breaks down the user's demand into dozens of subtasks and completes them step by step. It can call various tools. For example, in understanding the concept of conservation of kinetic energy, it calls the search engine tool. Each subtask is marked with an [X] upon completion.
The MCP protocol is something new. It was released last year along with the Claude 3.5 Sonnet-1120 version, along with "computer use," which is an RPA+AI visual tool that manipulates elements by recognizing page information through multimodal models.
What is the MCP Protocol?
The Model Context Protocol (MCP) is an open standard that helps AI applications, especially large language models (LLMs), connect with external data sources and tools.
You might think, "Isn't this just an Agent?" Yes, you are right. Essentially, it is an Agent. However, MCP has higher operational permissions than a regular Agent.
Typically, an Agent operates within a web page in a browser. It can only call APIs through the network or local open interfaces, such as Google Search or Yahoo Finance, or local crawlers like Craw4AI. Such an Agent is confined within your browser. It cannot access your computer's control, nor can it obtain administrative privileges or manipulate your computer. However, with MCP, things are different.
MCP usually requires you to package the Agent as a client installed on your computer, similar to how you have Feishu or WeChat clients on your computer. With this client form of Agent empowered by MCP, it can take over your computer, gaining the highest level of access. It can manipulate your command-line tools and browsers. In essence, an Agent client with MCP becomes you.
MCP in Cursor
In AI programming tools like Cursor and Windsurf, MCP has significant applications. For example, when you write a bunch of code in Cursor to create a web product, the usual process is to preview your code in VSCode, open the browser to check it, encounter a bug, copy the error message, return to Cursor, paste the error message, get the reason, and then modify it. You repeat this process until the bug is fixed. It's a cumbersome workflow. However, with Cursor empowered by MCP, you no longer need to do this repetitive work. It can automatically preview your code, identify bugs, modify them, and debug until the product is error-free.
MCP in Manus
Let's get back to Manus. In its use case of "teaching animations and demonstrations on the conservation of momentum," I noticed "Watson."
In the images, we can see that Manus is manipulating the browser, scrolling through the webpage, locating elements, and clicking buttons. It even uses command-line tools to execute commands, create files, and edit files. The right side of the image mentions that it is using a terminal. It is important to note that it is not manipulating your own computer's browser and command-line tools but rather those in Manus' virtual machine. You can simply think of it as having its own computer, where it uses its browser and command-line tools to fulfill your requests.
To summarize, how does Manus automate the manipulation of browsers and command-line tools on its own computer?
"MCP"
To users, Manus appears as a web page in a browser. However, behind the scenes, Manus, as an Agent, is working diligently on its own computer, empowered by the MCP protocol. Through the screen, you might almost see Manus typing code line by line for you.
In my view, Deep Research can be described in one sentence: it is an AI-based writing tool.
What is Deep Research?
Deep Research is essentially a combination of two core components.
The first is Deep Search, which is different from ordinary AI search. During the search process, Deep Search continuously evaluates whether the retrieved content matches the user's needs. If not, it reorganizes the search keywords and continues searching, layer by layer and repeatedly, until it finds the matching content. Of course, we usually set a fixed number of search rounds or token consumption limits to avoid endless searching. Otherwise, not only would the costs skyrocket, but the user experience would also be terrible.