Tower Research Ventures and GenAI Collective Host Event with the Princeton NLP Group

Large language models (LLMs) are one of the most impactful machine learning breakthroughs in recent years. They have shown promise towards automating previously intractable tasks, and one domain LLMs excel in is software development. This has drawn attention from leading public tech companies to early stage startups, as well as the open-source software community. One of the major efforts has come from Princeton NLP Group with SWE-agent. It can be thought of as an autonomous software engineer that debugs existing codebases by leveraging models such as GPT-4.

On July 9, 2024, Tower Research Ventures co-hosted with GenAI Collective, the Princeton NLP Group that developed both SWE-agent (an autonomous software engineer) and SWE-bench (a benchmark to test agents’ effectiveness), for an evening of talks and Q&A, discussing applications of AI for autonomous software development. Attendees included research scientists, ML engineers, founders and other practitioners.

The talks ranged from comparisons between state-of-the-art (SOTA) open-source and closed-source software engineering agents to calls for agent-specific computing interfaces akin to the UI needs of human software developers.

John Yang discussing agent generation paradigms

Representatives from Princeton NLP demoed SWE-agent, which turns language models (LMs) into agents that can fix bugs and issues in GitHub repositories. They explained its implementation architecture and ended with how practitioners can modify SWE-agent. Upon release in April, their approach was a SOTA open-source system with a ~12% resolution rate as seen in the figure below. SWE-agent became competitive with leading closed-source companies, one of which has a ~14% resolution rate. For reference, the best RAG plus generalist foundational model approaches cap out at ~4%.

Source: Twitter thread about SWE-agent’s performance.

Another talk covered SWE-bench, a benchmark designed to evaluate LMs’ ability to resolve real-world GitHub repositories. Researchers discussed components of a good benchmark: challenging for SOTA models, reflects realistic use cases, and has straightforward solution evaluations. These criteria led them to collect benchmark instances by scraping issue-pull request pairs with filters such as contribution tests. From the online leaderboard presented on July 9th, one could see that newer approaches from both companies and research groups now top SWE-agent’s performance at ~19%[1] resolution rates.

Kilian Lieret walking through implementation details

We had a great group of engineers attend, many of whom shared ambitious visions for what autonomous software development will look like in the future. Special thanks to the GenAI Collective for co-hosting the event!

If you are building in this space, we’d love to chat. Please reach out to Tower Research Ventures at ventures@tower-research.com!


[1]Rate as of July 9, 2024.  Note that the online leaderboard is updated on a weekly basis

The views expressed herein are solely the views of the author(s), are as of the date they were originally posted, and are not necessarily the views of Tower Research Ventures LLC, or any of its affiliates. They are not intended to provide, and should not be relied upon for, investment advice, nor is any information herein any offer to buy or sell any security or intended as the basis for the purchase or sale of any investment. The information herein has not been and will not be updated or otherwise revised to reflect information that subsequently becomes available, or circumstances existing or changes occurring after the date of preparation. Certain information contained herein is based on published and unpublished sources. The information has not been independently verified by TRV or its representatives, and the accuracy or completeness of such information is not guaranteed. Your linking to or use of any third-party websites is at your own risk. Tower Research Ventures disclaims any responsibility for the products or services offered or the information contained on any third-party websites.