✓

Follow along with this comprehensive guide

In this Q&A, we explore how an AI researcher automated their intellectual toil using GitHub Copilot and created a system called eval-agents to enable their team to analyze coding agent trajectories more efficiently. The original article described a pattern familiar to software engineers: automating repetitive tasks to focus on creative work, but here it was taken to a new level by automating the analysis of agent performance data. Below, we answer key questions about this process.

What is agent-driven development in the context of Copilot Applied Science?

Agent-driven development refers to using AI agents to automate intellectual work, such as analyzing the thought processes and actions of coding agents during benchmark evaluations. In the Copilot Applied Science team, an AI researcher created a system called eval-agents to automate the repetitive analysis of trajectories—JSON files detailing how agents solve tasks. By offloading this intellectual toil, developers can focus on higher-level insights and creative problem-solving. This approach leverages GitHub Copilot to surface patterns in data, reducing the need to manually read hundreds of thousands of lines of code. The result is a faster development loop for both the individual researcher and the entire team.

From Code Analysis to Agent Automation: A Q&A on Copilot Applied Science — Source: github.blog

What problem sparked the creation of eval-agents?

The project was born from the need to analyze coding agent performance against benchmarks like TerminalBench2 or SWEBench-Pro. Each task in an evaluation dataset produces a trajectory—a JSON file with hundreds of lines showing the agent’s thoughts and actions. With dozens of tasks per benchmark and multiple runs daily, the researcher faced hundreds of thousands of lines to analyze. Manually reading this volume was impossible, so they initially used GitHub Copilot to identify patterns and reduce the workload to a few hundred lines. However, the repetitive nature of this process frustrated the researcher, who wanted to automate the intellectual work itself. This led to the development of eval-agents, a tool to automate trajectory analysis.

How does eval-agents automate intellectual toil?

Eval-agents is a set of agents designed to analyze coding agent trajectories automatically. The researcher used GitHub Copilot to build agents that can ingest trajectory data, identify patterns, and surface insights without human intervention. For example, instead of manually inspecting each JSON file, an agent can scan all trajectories for common failure modes or efficiency bottlenecks. The system is built on the principle that engineering and science teams work better together, so the agents are easy to share and use by peers. By creating a primary vehicle for contributions through coding agents, the researcher enabled team members to author new agents for specific analyses. This effectively automates the intellectual toil of pattern recognition, allowing the team to focus on interpreting results and improving agent performance.

What were the design goals for eval-agents?

The researcher set three main goals for the eval-agents project:

Make these agents easy to share and use – Leveraging GitHub’s collaborative platform, the agents were packaged so any team member could run them with minimal setup.
Make it easy to author new agents – The system was designed with a low barrier to entry, allowing researchers to create custom agents for their specific analyses without deep programming skills.
Make coding agents the primary vehicle for contributions – Instead of relying on static scripts, the team would contribute by building and improving agents, fostering a culture of automation and continuous improvement.

These goals align with GitHub’s ethos of collaboration and the researcher’s experience as an OSS maintainer. By focusing on shareability and ease of authoring, eval-agents lowered the friction for the entire team to adopt agent-driven development.

How does GitHub Copilot enable a faster development loop?

GitHub Copilot acts as a copilot in two ways: it helps the researcher write code for the agents themselves, and it assists in analyzing the agents’ outputs. The researcher used Copilot to surface patterns in trajectories, reducing the need to read hundreds of thousands of lines manually. This iterative process—using Copilot to identify insights, then building agents to automate those insights—created a feedback loop. With eval-agents, the team can now run experiments, get instant summaries, and drill down into specific failures. The speed comes from not having to reinvent analysis scripts each time; agents are reusable and adaptable. This has unlocked a significantly faster development cycle for the researcher and enabled teammates to build solutions tailored to their needs without starting from scratch.

What impact did eval-agents have on the Copilot Applied Science team?

The introduction of eval-agents transformed how the team analyzes coding agent performance. Instead of spending hours manually reviewing trajectory files, team members can now run pre-built agents to automatically generate reports on benchmark results. This has democratized the analysis process—even less technical members can contribute insights by using or authoring agents. The researcher noted that the tool enabled peers to build solutions that fit their specific needs, fostering a culture of automation and collaboration. Moreover, the agent-driven approach reduced repetitive work, allowing the team to focus on more creative and strategic tasks, such as improving agent architectures or designing new evaluation metrics. Ultimately, eval-agents made the team more efficient and empowered each member to take ownership of their analytical workflows.

How can other teams apply these lessons from agent-driven development?

Other teams can adopt the principles demonstrated by eval-agents: identify repetitive intellectual tasks, use AI tools like GitHub Copilot to automate pattern recognition, and build shareable components. Key lessons include:

Start with a pain point – The researcher automated an analysis process that was manual and time-consuming.
Leverage existing tools – GitHub Copilot accelerated both coding and analysis phases.
Design for sharing – Making agents easy to use and extend ensures team-wide adoption.
Enable contributions – Allow team members to author their own agents to address unique needs.

By following these steps, any engineering or science team can create a faster development loop and reduce intellectual toil. The key is to treat agents as products that evolve through collaboration, much like open-source projects.

For more on this topic, see the original article “Agent-driven development in Copilot Applied Science.”

From Code Analysis to Agent Automation: A Q&A on Copilot Applied Science