Since December 2023, I have been working as a corporate mentor in collaboration with Professor Junming Liu from USTC on an AI Agent practical project, with about 80 students from across the country participating. Most of them are undergraduates with only basic programming skills, along with some doctoral and master’s students with a foundation in AI.

In December 2023 and January 2024, we held 6 group meetings to explain the basics of AI Agents, how to use the OpenAI API, this AI Agent practical project, and to answer questions students had during the practice. The practical project includes:

  1. Corporate ERP Assistant
  2. Werewolf
  3. Intelligent Data Collection
  4. Mobile Voice Assistant
  5. Meeting Assistant
  6. Old Friends Reunion
  7. Undercover

From February 20-24, some students participating in this research project gathered in Beijing for a Hackathon and presented the interim results of their projects. Participants generally felt the power of large models, surprised that such complex functions could be achieved with just a few hundred lines of code. Below are some of the project outcomes:

Corporate ERP Assistant

Project Description

ERP (Enterprise Resource Planning) software is a key software system for enterprises, currently mostly operated through a GUI interface, where complex operations require multiple mouse clicks, making it very cumbersome. AI Agents can convert users’ natural language queries into SQL statements, thus achieving automated queries.

The requirement is to set up a MySQL/PostgreSQL database containing two tables:

  1. Employee table, including employee ID, name, department name, current level, date of joining, and date of leaving (null indicates current employment).
  2. Salary table, including employee ID, pay date, and salary, with one pay record per month.

The system should be able to automatically answer the following questions:

  1. How long does an employee stay in the company on average?
  2. How many employees are there in each department?
  3. Which department has the highest average level of employees?
  4. How many people joined each department this year and last year?
  5. From March 2020 to May 2021, what was the average salary in department A?
  6. Which department had a higher average salary last year, department A or B?
  7. What is the average salary of employees at each level this year?
  8. What is the average salary of employees who have been in the company for less than a year, one to two years, and two to three years in the last month?
  9. Who are the top 10 employees with the highest salary increase from last year to this year?
  10. Has there been any case of salary arrears, i.e., an employee was in employment for a month but did not receive a salary?

Note that the task is not to manually convert the above queries into SQL statements, but to use a large model to automatically convert users’ queries into SQL statements (the prompt needs to include the definition of the table structure and the description of each field). The AI Agent executes the SQL statement and returns the result, and then the large model outputs the answer based on the returned result.

Outcome Review

The ERP Assistant project team achieved the highest completion rate, developing a web interface based on Java that supports all the above queries, as well as the functions of deleting, modifying, and adding records.

Users can perform CRUD operations and display the results in table form by simply describing their requirements in natural language. For example, saying “Pay February 2024’s salary to employees in department A, the same as last month’s salary” will generate and execute the SQL statement.

By accurately describing the database table structure in the prompt and using the chain-of-thought method in GPT-4 Turbo, the accuracy rate of generating SQL statements reached 100% in the 17 queries demonstrated (each query was tested 3 times, and all results were correct). In addition to the 10 read-only queries, 7 CRUD queries and 3 read-only queries were added:

Additional CRUD queries supported:

  1. Promote Xiao Ming by one level.
  2. Add a new employee named Zhang San, in the technical department, level 3, joining date is today.
  3. Delete the employee named Xiao Ming and all his salary records.
  4. Pay February 2024’s salary to employees in department A, the same as last month’s salary.
  5. Pay a joining bonus to all employees who joined in 2024, the bonus amount is the employee’s level multiplied by 10,000.
  6. Lay off all employees in department A, i.e., set their leaving date to today.
  7. Pay arrears to all employees who were owed salaries, i.e., if an employee was in employment for a month but did not receive a salary, pay them the last salary they received before the arrears.

Additional read-only queries supported:

  1. Query Xiao Ming’s basic information and all salary records, displayed in two tables.
  2. List the basic information and the most recent salary of all employees in each department, each department displayed in one table.
  3. List the annual statistics for each department, including the department’s name, number of employees, lowest level, highest level, and average salary, with each year displayed in one table, and each row representing a department.

Next steps for improvement:

  1. When deleting, modifying, or adding records, it is recommended to add a confirmation of the modification set function, allowing users to confirm the records to be modified before submitting them to the database, to avoid irreversible damage caused by incorrect SQL statements.
  2. The database table structure currently set for the project is relatively simple. The plan is to use an open-source ERP system as a blueprint to implement database operations in more complex scenarios.

Werewolf

Project Description

Werewolf is an interesting LARP (Live Action Role-Playing) game. AI Agents can also play various roles in the Werewolf game, allowing AI Agents to play the Werewolf game with humans. Werewolf tests the AI’s reasoning ability and the ability to hide its true identity.

Requirements:

  1. Develop a Werewolf game service that executes the basic rules of the Werewolf game.
  2. At least the roles of judge, werewolf, villager, witch, and seer should be included, and roles like hunter and policeman can also be included if interested.
  3. There should be one human player in the game, with the rest being AI Agents.
  4. Each AI Agent and the human player participating in the game must follow the game rules, with roles assigned randomly, only able to see the information they are supposed to see, and not able to see information they are not supposed to see.
  5. Agents need to have some basic game skills (which can be specified in the prompt), such as werewolves generally should not reveal their identity, werewolves should not kill themselves in most cases, werewolves should learn to hide their identity, and witches and seers should make good use of their abilities.
  6. Agents need the ability to analyze other players’ statements and infer who the werewolf is, not choosing randomly.

Outcome Review

The Werewolf project team also achieved a high completion rate, implementing a complete Werewolf game with a command-line interface, including various roles and game processes. There is one human player, with the rest being AI players.

The game backend ensures that each role can only see the information they are supposed to see during the game process, not seeing information they are not supposed to see.

To address the issue of insufficient depth of thought in the voting phase, where only a number is output, leading to random voting, the project team used the chain-of-thought method, first outputting a piece of analysis, then outputting the voting result. In the speaking phase, it is also “think before you speak,” first outputting the thought content, then outputting the spoken content, making the speech more organized and easier to achieve deception and disguise.

Initially, the project team provided GPT-4 with complete historical records, but after a few rounds of the game, the efficiency of information extraction was not high because the information was scattered across a large number of uninformative speeches and votes, and some logical connections between speeches were not discovered. To provide the large model with higher-density information, the project team used a method similar to MemGPT, where each AI character records on their own notebook:

  1. The content “thought” during the “think before you speak” process;
  2. Changes in the game state each round (voting results, who died);
  3. After all speeches in a round are finished, summarize the speeches of the round, and in subsequent rounds, only look at the summary, not the specific speech content.

When making decisions, only the content in the notebook is provided, which not only saves input tokens but also better simulates the characteristics of human memory, allowing the large model to better extract information from the historical dialogue of Werewolf.

Some technical issues encountered during the project:

  1. Sometimes GPT-4 would say, “I am an AI model, I can’t kill.” The solution is to replace “kill” with “banish” or “remove.”
  2. Sometimes GPT-4 would not output player numbers according to the format. The solution is to require output in JSON format and provide a JSON example.
  3. Sometimes GPT-4 would refuse to vote when it couldn’t decide who to vote for. The solution is to list the selectable player IDs and require outputting one, choosing randomly if unsure.
  4. During the speaking phase, GPT-4 talks too much nonsense. The solution is to tell GPT-4 that everyone’s speaking time is very limited, requiring concise speech.

Next steps for improvement:

  1. Currently, AI players’ strategies are too conservative and neutral, making the game not fun when they can’t make a judgment. Different AI players should be given different personalities and game strategies in the prompt, some more aggressive, some more conservative, to make the game more fun.

Intelligent Data Collection

Project Description

Collecting data is a very troublesome task. For example, if you want to collect information about each professor and student in a laboratory, you need to click into each page one by one, identify the corresponding information, and paste it into a table.

This project aims to build an intelligent data collection system that can automatically crawl all pages on a laboratory’s website after the user provides the URL of the laboratory’s homepage, use a large model to analyze the content of each page, and if the page contains information about professors and students, extract the relevant information.

The information extracted by the large model should include:

  1. Name
  2. Photo (if available, download it, noting that not all images on the web are photos of people)
  3. E-mail (if in the format of example at ustc dot edu dot cn, the large model should automatically convert it to the standard format of example@ustc.edu.cn)
  4. Title (e.g., Professor)
  5. Research area (e.g., Data Center Networks)
  6. Introduction

The system should be able to automatically collect information from most laboratory homepages, not a collection system customized for a specific laboratory homepage. Therefore, the large model is needed to automatically extract information from web content.

Data crawling can be done with scrapy, and web page rendering can be done with selenium.

Outcome Review

Intelligent data collection was the most chosen project, with over 30 students selecting this project, divided into several groups. Most project teams implemented the function of automatically collecting personal information of professors and students in the laboratory relatively completely.

The project team automatically found all links on the webpage, visited the links, converted the webpage content into text, called GPT-4 to determine whether it was a teacher or student homepage, and if so, output the information such as name and E-mail in JSON format. Then parse the JSON and store it in the database.

The project team used GPT-4V to analyze images on web pages to determine if they are solo photos, and if so, save them.

Tested on 10 typical Chinese university professor homepages, all were able to correctly extract the basic information of the professors.

The project team also encountered issues where GPT-4 did not output in the specified format, causing regular expression parsing to fail. The solution was to output in JSON format.

Next steps for improvement:

  1. Currently, it is assumed that all teachers’ links are in the directory page URL provided by the user. However, in reality, there may be nested directory pages, such as only displaying professor information on the homepage, and the directory pages for associate professors and lecturers need to be accessed by clicking on the sidebar. In this case, it is necessary to handle recursive directory pages and use a large model to determine whether it is still within the same department or laboratory. Because clicking to the homepage of another department or laboratory is also a directory page, the information of teachers on these pages should not be collected.
  2. Sometimes the homepage may have multiple pages, and the research introduction and name are not on the same page. It is worth considering handling this complex situation where multiple pages of a personal homepage are stitched together.
  3. Some people use images to prevent email scraping by crawlers. In this case, GPT-4V can be considered for recognizing images and extracting emails.
  4. There may be multiple photos on the homepage, and it is necessary to select the photo that is more similar to an ID photo.
  5. All links on the directory page were visited once, and a large model was used to determine whether it was a teacher’s homepage, which is a waste of tokens. It is possible to preliminarily determine which links might be teacher homepages based on the link text with a large model, rather than visiting links that are obviously useless.

Mobile Voice Assistant

Project Description

Have you ever imagined being able to operate your phone directly through voice, interacting with your phone even in situations where it’s inconvenient to use both hands?

This direction is called RPA (Robotic Process Automation). Its product form can be a voice assistant app, which runs in the background. Upon receiving a user’s command:

  1. Open the specified app, take a screenshot;
  2. Input the screenshot and the current task execution status text into the vision large model, which specifies what the next step should be, or extracts the required content;
  3. Map the operation output by the large model to the buttons in the screenshot, simulating clicks to perform the corresponding operations (this step is quite challenging, as the vision large model will not provide the X/Y coordinates of the buttons, only text answers, so methods like OCR need to be used to find the corresponding buttons);
  4. Continue with the following steps until the task is completed.

As an example, the smart assistant needs to complete the function of sharing a recently bought item from Taobao/JD.com/Pinduoduo to a specified WeChat contact:

  1. The user tells the smart assistant the task “Share the most recent order from Taobao to friend A on WeChat”;
  2. The smart assistant opens the Pinduoduo app;
  3. Finds the order list;
  4. Opens the most recent order;
  5. Copies the share text;
  6. Opens the WeChat app;
  7. Finds friend A in the contact list;
  8. Opens friend A’s chat box;
  9. Pastes the share text and clicks send.

Note, it cannot be done in a way that clicks on fixed positions like a macro spirit, but must use the RPA method, automatically extracting information from the interface.

Project Review

The mobile voice assistant project team encountered some difficulties in implementing the mobile voice assistant, so they turned to implementing a browser voice assistant.

The project team first implemented relatively accurate voice command recognition using the OpenAI whisper API.

The main difficulty was that GPT-4V could not accurately give the coordinates of specified objects. GPT-4V can easily point out whether a specified icon exists in the interface, but it’s unclear about the exact location of the icon. The project team used a method of dividing the screen into four quadrants for binary search. For example, given a desktop screenshot, ask in which quadrant the Chrome browser is, then extract the image of that quadrant, and further ask in which quadrant, thus achieving the effect of binary search. However, this method is inefficient and prone to errors for icons close to the horizontal and vertical center lines of the screen.

The project team also tried using OCR methods to recognize icons, which works for desktop icons with app names and buttons with text. But many buttons do not have text, so the OCR method does not work.

The project team plans to deploy their own MiniGPT-4, a multimodal large model that can output the coordinates (bounding box) of specified objects, but this has not yet been completed.

Due to the difficulty of GPT-4V not being able to output precise coordinates, the project team turned to implementing a browser voice assistant.

  1. Users can ask questions on web pages, and the large model extracts specified information from the page and then broadcasts it through voice. This is like intelligent data collection.
  2. Users can also click links on web pages in natural language. The large model finds the corresponding link from the markdown converted from the web page, and then simulates clicking the corresponding link in selenium. Since the browser voice assistant is based on markdown code converted from web page HTML code, it also cannot recognize buttons that contain only icons without text. Fortunately, such buttons are generally less common on web pages compared to mobile apps. Moreover, since no vision large model is used, the button’s background color, layout, and animations on the page are invisible.
  3. Users can request to enter text in a specified input box and submit a form. The large model finds the corresponding input box from the markdown converted from the web page and enters the text specified by the user.
  4. Users can also ask the browser to search for specific content or enter a specific website. For example, “Help me check the weather in Beijing,” the large model will use Google to search for “Beijing weather,” and Google will directly return the current weather in Beijing, from which the large model can obtain information about Beijing’s weather and broadcast it through voice.

Currently, the project team has not implemented planning and solving complex tasks. For example, “Help me find out who Li Bojie is,” it will only help me Google search “Li Bojie,” and then I need to command it “click the first search result,” and then answer questions based on the web page content. For example, my personal homepage is just a list of blog articles and does not have my introduction, so I still need to command it to click “About Me” to see my introduction.

Also, it has not yet implemented complex multi-step tasks like “Help me write an email,” which can only be done by humans like this, watching the browser and operating step by step:

  1. Open Gmail.
  2. Click write an email.
  3. Write test@example.com in the recipient.
  4. Write Hello World in the email subject.
  5. Write “Did you eat today?” in the email content.
  6. Click send.

Meeting Assistant

Project Description

Have you ever experienced being cluelessly called out by your boss during a meeting after zoning out for a bit?

We can create a meeting assistant that leverages the capabilities of large models to provide a lot of help during meetings. Including but not limited to:

  1. Transcribing the meeting’s voice content into text in real-time;
  2. Saving each page of the PPT shared during the meeting as a screenshot;
  3. Summarizing the content discussed during the meeting based on the real-time transcription;
  4. Answering questions posed by users based on the real-time transcription.

This way, people attending the meeting can know what was discussed at any time, without worrying about missing key content.

It can be implemented using screen recording, in either mobile or desktop form.

Note: Recognizing PPTs in meeting recordings can be done using the open-source project Video2PDF

Project Review

The meeting assistant project team implemented offline video and real-time microphone recording for voice-to-text, meeting content summary, and question answering functions, making it a relatively complete meeting assistant.

In terms of voice recognition, the project team initially encountered situations where OpenAI whisper models could not recognize overly long voices. To cut overly long voices into small segments, the project team checked the volume of the voice. If the volume was below a threshold for 0.5 seconds, it was considered that the sentence was finished, and this segment of voice was sent to the OpenAI whisper API for recognition.

This method also solved the problem of voice recognition for real-time microphone recording. The microphone’s sound also used 0.5-second volume checks for pause detection. After detecting a pause, this segment of voice was sent to the OpenAI whisper model for recognition, achieving relatively low voice recognition latency. Of course, the OpenAI whisper API does not support streaming voice recognition. To achieve even lower latency in voice recognition, models that support streaming voice recognition are needed.

The transcription of meeting content may contain some errors due to a lack of background knowledge, such as incorrect recognition of professional terms and inconsistent names before and after. Therefore, before outputting to users, the project team corrected it through the GPT-4 model, providing the GPT-4 model with previous voice transcription results as background knowledge. This way, most of the recognition errors in professional terms can be corrected, and the names before and after can be kept consistent.

The project team implemented the function of summarizing content based on real-time transcription and answering user-specified questions.

Especially commendable is that the project team proposed a “focus on specified characters or topics” function, which is to tell the meeting assistant before or during the meeting which people or topics I want to focus on. When the meeting mentions the specified person or topic, a notification will pop up to remind the user that you have been cued.

Next steps for improvement:

  1. Implement the function of recognizing PPTs in meeting recordings and correct voice recognition results based on the content OCR’d from PPTs.
  2. Implement more real-time voice recognition using streaming voice recognition models.
  3. Use speaker recognition models to distinguish different speakers, enabling speaker tags in offline discussions for voice recognition results.
  4. Currently, the focus on specified characters and topics is implemented using keyword matching. In the future, large models can also be used to recognize topics even when they do not match word for word.

Friends Reunited

Project Description

The six main characters in Friends each have distinct personality traits and stories. What kind of stories would unfold if they were to reunite in the show again?

We will leverage the power of large models to bring the six main characters of Friends into a group chat, letting them chat and continue their interesting stories.

Subtitles from Friends or plot summaries can be downloaded from the internet as prompts, and then let the large model play the 6 main characters, allowing the 6 characters to chat freely, create new plots, or make secondary creations based on a certain episode’s plot.

Ideally, users can also join in at any time to interact with them and ask them some questions.

Note: Obviously, not every character should speak after each other’s turn, as this would explode the number of messages. Nor can each character speak in turn, as that would become quite boring. Therefore, it is necessary to control when each character should speak and when they should not, which can be achieved by adding prompts to the large model.

This project was not chosen by any group.

Who’s The Undercover

Project Description

Who’s The Undercover is an interesting board game. AI Agents can also play the Who’s The Undercover game with humans. The game tests reasoning ability and the ability to hide true intentions.

Requirements:

  1. Develop a Who’s The Undercover game service, executing the rules of the Who’s The Undercover game.
  2. Include roles such as God, Good People, and Undercover.
  3. There is one real person in the game, and the rest are AI Agents.
  4. Each AI Agent and the human participating in the game must follow the game rules, with roles being random, only seeing the information they are supposed to see, and not seeing information they are not supposed to see.
  5. Agents need to possess some basic game skills (which can be specified in the prompt), such as not being too straightforward in descriptions, nor too abstract, not repeating other players’ statements, while providing new information without exposing themselves.
  6. Agents need the ability to analyze other players’ statements and infer who is the undercover, not choosing randomly.

This topic does not have group selection.

Conclusion

A fellow student who participated in this practical topic helped me write a conclusion with GPT-4. Without changing a word, it is excerpted as follows:

In this era full of innovation and exploration, the practical topics of the University of Science and Technology of China have opened a window for us: by fully utilizing the capabilities of large models, even undergraduates with weak programming foundations can solve a series of complex technical problems in a short time and achieve remarkable results. This not only demonstrates the infinite potential of AI technology but also opens new paths for future software development and artificial intelligence applications.

The topics covered include enterprise ERP assistants, Werewolf game, intelligent data collection, mobile voice assistants, meeting assistants, and more, each project being a deep exploration of the practical application of large models. By reviewing the outcomes of the projects, we can clearly see how large models simplify tasks that would normally require a significant amount of development work, allowing students with weaker programming backgrounds to easily cope.

In the enterprise ERP assistant project, by allowing the AI Agent to automatically convert users’ natural language queries into SQL statements, database operations are greatly simplified, making complex information queries readily accessible. The Werewolf project demonstrates the potential of AI in simulating complex human interactions, where the AI Agent can not only participate in the game but also exhibit a certain level of reasoning and strategy through a “think before speaking” approach. The intelligent data collection project addresses a major pain point in the data collection field, collecting information efficiently and accurately through automation, significantly improving efficiency.

Each project has lowered the technical threshold to some extent, allowing students with weak foundations to participate in the development of complex projects. This is not only a significant enhancement of the students’ abilities but also a valuable investment in their future careers. The application of large models has shown the wide applicability and strong potential of AI technology in various fields, making us look forward to the future of artificial intelligence with great anticipation.

However, the application of large models also exposes some limitations, such as the need for improvement in real-time performance, accuracy, and adaptability to complex scenarios. For example, in the mobile voice assistant project, accurately recognizing and operating interface elements remains a challenge. These issues remind us that although large models have brought revolutionary simplification to software development, there is still a need for continuous optimization and deepening to better adapt to complex and variable real-world application scenarios.

In the future, with further perfection of large model technology and the improvement of the level of intelligence, we have every reason to believe that AI will display even more astonishing capabilities in more fields. From enterprise operations to daily life, from simple data processing to complex decision analysis, the application of AI will become increasingly widespread, and its impact on society will also become increasingly profound. We look forward to more explorations and practices like the practical topics of the University of Science and Technology of China, which not only provide students with valuable learning opportunities but also contribute to the progress of society as a whole. In this new era driven by data and intelligence, let us move forward together and jointly create a bright future for AI.

Comments