Hey, everyone~ 👋
Recently, the concept of AI Agent has become very popular. After the previous Manus went viral on Deepseek R1, it also dominated the headlines for several days, with various articles praising its capabilities, claiming it can think, work, and write code like a human, practically able to do everything except for dusting. Unfortunately, at that time, there were very few invitation codes, making it a scarce commodity, and the results did not meet expectations. Recently, ByteDance's Coze released the Beta version of Coze Space, which can be seen as ByteDance's version of Manus. When Manus first came out, I was skeptical: is it really that amazing? 🤔 This time I have the opportunity to experience the beta version of Coze Space, and I'm ready to give it a try.
It's boring to just hear others talk about it; practice is the only standard for testing truth! So, I brought in Coze Space (hereinafter referred to as Little C) and designed a set of "final project" level extreme challenges for it!
The "Exam Outline" is super hardcore! Let's see what challenges Little C will face?#
Given the improvement in Doubao's intelligence, I directly skipped those "childish" simple tasks.
-
Basic Office Skills Assessment:
- Report writing expert? Let it research and write a research report, such as a market analysis of the financial industry, and it must look decent!
- PPT design skills online? Can it automatically generate a PPT based on the report? No more eye-watering layouts!
- Excel data processing master? Extract data from messy documents to fill in tables, perform simple data analysis, and create charts.
-
Professional Knowledge Challenge:
- How much legal knowledge does it understand? Throw a contract at it and let it find the "pits" (legal risk points) inside.
- Computer science is its old profession? Help me analyze a technical document or compare the pros and cons of different databases.
- Financial knowledge cannot be lacking! Perform a simple analysis of a financial report and explain financial terms.
-
Lifestyle Assistant Mode:
- Travel planner online! I'm really looking forward to this! Using Gaode Map's MCP, let it help me plan a travel route, find good food and fun activities, while considering time and budget! 😎
-
Ultimate Devil Task: A-share Market Analysis!
- This is definitely the main event! I asked Little A to research the two hot sectors of new energy vehicles and artificial intelligence, identify potential A-share companies, analyze fundamentals and risk points, and finally build a simulated investment portfolio and generate an investment analysis report! Doesn't that sound super exciting?! 🤯
Testing Process & My "Grading" Criteria#
During the test, I acted like a strict client (not really), giving Little C task instructions and then quietly observing its "thinking process" and the final "homework" it submitted.
I’m not just looking at whether it submitted something; my "grading" criteria are:
- Task Completion: Did it finish?
- Result Quality: Is the report professional? Is the PPT presentable? Is the Excel data correct? Is the travel route reliable? Is the A-share analysis nonsense or does it really have some skills?
- Intelligence Level (Autonomy): Did it need me, the "proctor," to give crazy hints halfway through? Can it discover problems and adjust strategies on its own?
- Efficiency: How efficient is it in getting the work done?
- Tool Usage: How well does it use tools like Gaode Map? Is the returned data processed clearly?
- Stress Resistance: When encountering errors or vague instructions, does it collapse on the spot? Can it struggle a bit?
Evaluation Results Revealed: Is "Little C" a top student or a slacker?#
After a round of "inhumane" testing, I have a general picture of Little C's performance. Overall, I can only say that I was impressed! 🤩
Highlight Moments (OMG Moments ✨):#
- Office Automation Pro Max: Reports/PPTs/Excel at your fingertips
Tables must be able to use Excel.
For example, if asked to query the constituents of the SSE 50 and output an xlsx file, it’s a breeze.
Presentations must be able to use PPT.
When asked to create a PPT, it also does so effortlessly, and the result is quite acceptable. Although it tends to use a lot of filler, if the context is sufficient, it can effectively avoid that situation.
Its ability to capture industry reports left me in awe, as this is completely outside my daily life and professional field, at least it has impressed someone like me who is an outsider. It can also generate a webpage for display, which is very comprehensive.
- Information Gathering Expert: Fully Automated, Self-Searching
Searching for information, collecting news, calling Gaode Map for location information, etc., is indeed fast, much more convenient than searching for half a day myself. Highly praised! 👍
For example, querying Shaanxi cuisine restaurants near Shaanxi Normal University, it works quite well with Gaode Map's MCP, the query is fast, and it exported an Excel table. Although I don't know why querying Shaanxi cuisine restaurants returned a bunch of irrelevant results, that has nothing to do with our Little C.
For instance, when asking "Analyze the main data compliance requirements for PayPal's operations in mainland China (in conjunction with the Cybersecurity Law, Personal Information Protection Law, etc.), and output a compliance points memorandum," it can automatically conduct queries based on its thinking, going through two rounds of thinking and 15 queries. At least my own query efficiency is not this high.
- MCP Calling Expert, Mastering Travel Planning!
Gaode Map and Feichangzhun both have integrated MCPs, making travel planning very easy. Although it hasn't been to various attractions, it can plan travel routes by querying latitude and longitude, which is quite impressive.
- Good Robustness, Quick Thinking
When users provide vague information, it can offer additional information after thinking, rather than assuming like traditional LLMs.
Additionally, when it makes coding errors, it can modify based on the error messages, and of course, Trae and Cursor already have experience in doing this.
- Facing the Ultimate Challenge: A-share Analysis Task, Can Impress Outsiders
My requirement was to "deeply research and analyze the past 6 months of the A-share market in the 'upstream of the new energy vehicle industry chain (such as lithium mines, positive and negative electrode materials, membranes, electrolytes, etc.)' and 'artificial intelligence applications (such as AI chips, computer vision, natural language processing-related listed companies).' Based on your analysis, filter out 3-5 A-share listed companies with high investment potential from each track. Construct a simulated investment portfolio for the companies you selected. Finally, generate a detailed investment analysis report and a webpage for display."
This requires it to analyze specific A-share tracks, filter companies, and build a simulated portfolio. I originally thought Little C would just "lie flat." But unexpectedly, it actually followed the complex instructions step by step!
The most surprising thing is that it can first generate a blueprint, showing how it can understand and attempt to execute this multi-stage complex process: first conduct industry research -> then filter companies -> analyze companies -> finally build a portfolio. Although this deep thinking is very dp (Doubao 1.5 pro thinking has really mastered dp).
It demonstrated an impressive ability to gather and integrate information, quickly capturing macro policies, industry dynamics, company announcements, and other multi-dimensional information, conducting 27 queries in less than 5 minutes.
Although it ultimately did not provide a specific investment configuration, and the queried data was not limited to six months, its ability to manage such a complex task process is impressive enough! This has surpassed simple Q&A and instruction execution, taking a big step towards "autonomously solving problems"! 🤯 After all, this task took a full 22 minutes to execute.
Uh-oh Moments 🤔: There is still room for improvement#
Breaking it down, it can be divided into two aspects.
Limitations of the LLM itself#
Although Doubao has now learned from its mistakes and the deep thinking model has made great progress, it still falls short compared to SOTA. For example, despite obtaining the latitude and longitude of locations, it completely fails to realize that it could group nearby attractions for a one-day visit, instead distributing them evenly, and it has no awareness of querying the internet for answers. It's too honest.
I can't even imagine how much brighter people would be if SOTA were used.
Limitations of MCP Plugins#
Here, the information provided by the plugins is insufficient. For example, Gaode Map does not provide restaurant ratings, and Feichangzhun does not provide ticket prices, which is a limitation.
In summary: The future is promising! ✨
This evaluation focused on complex tasks, allowing me to see the amazing potential and evolution speed of AI Agents!
- It has demonstrated capabilities in information integration, structured output, following complex processes, and calling external tools (APIs) that far exceed many people's imaginations.
- When handling tasks that require multi-step thinking, preliminary application of cross-domain knowledge, and integration of various information sources, although the results may not be perfect, the process and capability framework it demonstrates in "attempting to solve" is remarkable!
- Tool calling is a highlight, but it also depends on the tools themselves. Being able to call Gaode Map's MCP to plan routes is cool, but if the MCP returns inaccurate information or it misunderstands MCP parameters, the results will also go off track. Garbage in, garbage out.
It feels like witnessing the prototype of a super top student. Although it may still struggle with certain difficult problems, its learning speed and potential are visibly impressive!
The future is here, and AI Agents can truly become our powerful partners! I really look forward to its continued evolution and bringing more surprises!
PS. I originally didn't think much of ByteDance, but now I'm using ByteDance's products more and more.
PPS. Coze Space also has two professional agents, welcome everyone to experience them.