~/today's vibe
Published on

16 Opus Agents Built a C Compiler for $20K

Authors
  • avatar
    Name
    오늘의 바이브
    Twitter

GCC Took 37 Years, Claude Took 2 Weeks

Server racks running in parallel in a data center

GCC's first release was in 1987. This project, started by Richard Stallman, became what it is today through 37 years of thousands of developers building millions of lines of code. Building a C compiler from scratch is one of the most difficult tasks in computer science. You need to build everything: preprocessor, lexer, parser, semantic analyzer, optimizer, code generator, assembler, linker. A C developer with 30 years of experience called this project "something only the top 1% of developers could do."

In February 2026, Nicholas Carlini, a researcher on Anthropic's Safeguards team, challenged this impossibility. A former penetration tester who used to dig into vulnerabilities in big company products. His chosen tool: 16 Claude Opus 4.6 agents. Method: parallel operation for two weeks. Result: a 100,000-line C compiler written in Rust. Name: CCC, Claude's C Compiler. Cost: $20,000. And this compiler successfully compiled Linux kernel 6.9 on three architectures—x86, ARM, RISC-V—and booted.

On GitHub, it already has over 2,400 stars. The numbers alone are impressive. But the real meaning of this project lies beyond the numbers. Can a compiler built for $20,000 stand alongside a 37-year-old compiler? The answer is more complicated than you think.


How 16 Claudes Work

Programming workspace with code displayed on screen

The core of Carlini's system design is "unsupervised parallel work." The 16 Opus 4.6 agents each ran in independent Docker containers. Each agent had its own workspace (/workspace) and exchanged code through a shared git repository (/upstream).

There was no orchestration agent. Without central control, each Claude chose its next task on its own. To prevent task conflicts, they used a lock file approach by creating text files in the current_tasks/ directory. When merge conflicts occurred, Claude resolved them autonomously.

According to Carlini's explanation, the structure was remarkably simple. "Put Claude in a simple loop to work on the current task until it's perfect, then immediately move to the next task." Human intervention was limited to writing tests and setting work direction. Not a single line of code was written directly. Except for one paragraph in the README, 100% of CCC's code and documentation was written by Claude Opus 4.6.

Beyond core compiler development, agents performed specialized roles. Detecting and refactoring duplicate code, optimizing compiler performance, efficient code generation, critiquing Rust design and improving structure, maintaining documentation—all distributed across agents. The clean room environment was strict. Internet access was blocked during development, with only the Rust standard library available.

The parallelization strategy evolved with the project. Early on, each agent handled different failing tests. When pass rate reached 99%, they changed strategy, with each agent compiling different real projects like SQLite, Redis, libjpeg, QuickJS, Lua. For the massive single task of the Linux kernel, they used GCC as an oracle. They randomly split file sets, compiled some with GCC and others with CCC, then compared results to track bugs down to the file level.


200 Million Tokens and 3,982 Commits

The numbers clarify the scale of this project.

ItemValue
Number of agents16 (Opus 4.6)
Total sessions~2,000
Duration~2 weeks
Input tokens2 billion
Output tokens140 million
API cost$20,000
Generated code100K lines (96.2% Rust, 3.8% C)
git commits3,982
Successfully compiled projects150+
GCC torture test pass rate99%

3,982 commits in two weeks means roughly 0.2 commits per minute. If we assume a human developer makes 5-10 commits per day, 16 agents produced code at a speed equivalent to dozens of human team members.

Networked server infrastructure

The compiler architecture faithfully followed textbook structure. Preprocessor, lexer, parser, semantic analysis, SSA-based intermediate representation, 15 optimization passes, code generator, assembler, linker—all components built from scratch. Target architectures: four—x86-64, i686, AArch64, RISC-V 64. x86 SIMD instructions (from SSE to AVX-512, AES-NI, FMA, SHA, BMI2) and ARM NEON intrinsics included in built-in C headers.

The list of successfully compiled projects is impressive. Linux kernel 6.9, PostgreSQL (all 237 regression tests passed), FFmpeg (7,331 FATE checksum tests passed), QEMU, Redis, CPython, LuaJIT, GNU coreutils, Busybox, libsodium, libpng, jq, libjpeg-turbo, mbedTLS, musl, and even DOOM. It even compiled TCC (Tiny C Compiler). An AI-made compiler compiled another compiler. Not a simple "Hello World" toy, but capable of compiling over 150 large-scale projects used in production environments.


12% Faster Than GCC -O0, 2.76x Slower Than -O2

Performance benchmarks most honestly show CCC's current position. Binaries generated by CCC are 12% faster than GCC with optimizations off (-O0). But compared to GCC's standard optimization (-O2), they're 2.76x slower.

The cause of this gap is clear. CCC generates 3.3x more instructions than GCC. While 15 optimization passes exist, all optimization levels from -O0 to -O3 run the same pipeline. In other words, optimization level differentiation isn't implemented yet.

Interestingly, CCC's generated code has an IPC (Instructions Per Cycle) of 4.89, higher than GCC's 4.13. Because CCC's code is simpler, the CPU can process instructions more efficiently, but the overwhelming total number of instructions means it falls behind in total execution time. Catching up to 37 years of accumulated optimization expertise in two weeks is near impossible.

In terms of accuracy, it achieved 100%. In all test cases, binaries generated by CCC produced functionally identical results to GCC's output. 100% accuracy in a compiler is no small achievement. Compiler bugs are more critical than other software bugs. If the compiler is wrong, everything built on top of it shakes.

However, technical limitations clearly exist. The 16-bit x86 code generator is incomplete, so GCC is called at that stage. CCC's generated 16-bit code exceeds 60KB, far beyond Linux's 32KB limit. Edge cases in _Complex arithmetic, _Atomic qualifiers only parsed but not tracked in the type system, partial support for GNU extensions—all remain unsolved. Linux-only, it doesn't work on macOS or Windows.


Chris Lattner's Cold Assessment

Source code displayed on monitor

The person who gave the most weighty evaluation of this project was Chris Lattner. The creator of LLVM, Clang, and the designer of the Swift language. After Apple, Google, and Tesla, he published a detailed technical review of CCC.

Lattner acknowledged CCC as "real progress and an industry milestone" while simultaneously drawing a line: "not a true revolution." His assessment was specific. CCC is "a textbook implementation that an excellent undergraduate team might create in the early stages of a project." The SSA IR used GetElementPtr inspired by LLVM and the basic block "terminator" concept.

Technical criticism was sharp. The code generator is "toy-level," the optimizer re-parses assembly text instead of IR, the parser's error recovery capability is lacking, and instead of parsing system headers, test-required content was hard-coded. He identified this last point as "the biggest problem" limiting CCC's versatility.

But the most crucial part of Lattner's review was his insight into AI capabilities themselves. "Implementing known abstractions is different from inventing new ones. I don't see anything novel in this implementation." His conclusion: LLMs are "extremely powerful distribution followers that reproduce decades of accumulated compiler engineering consensus."

In other words, what Claude did was reproduce what's in compiler textbooks—at scale, quickly, and accurately. It didn't devise novel optimization techniques or revolutionary architectures no one thought of. This distinction is critically important. Implementation ability and design ability are different dimensions.

Based on this, Lattner presented three forecasts. First, manual rewrites become AI-native tasks, automating large engineering categories. Second, as implementation cost approaches zero, architectural design ability's value maximizes. Third, engineers' roles move upward toward specification, verification, and design. Not writing code, but deciding what code should exist becomes central.


"You Don't Call Stealing and Mixing Bread 'Making Bread'"

Online reactions split sharply. One of the most upvoted GitHub comments went like this: "You don't call stealing bits of every bread from the supermarket and mixing them 'making bread from scratch.' That's called theft." Pointing to the possibility that all open-source compilers were included in the training data.

The core of this criticism is valid. GCC, Clang, TCC, and countless educational compiler source codes are public. It's hard to believe this code wasn't included in Claude's training data. Whether CCC was made "from scratch" or reassembled existing compiler implementations remains an open question.

A Hacker News user raised this issue more rigorously. "To verify true ability, you need to test with minority languages like J that are barely in the training data." Success in in-distribution tasks doesn't guarantee capability in out-of-distribution tasks.

Cost skepticism also emerged. A contract developer from South Africa claimed they could make an equivalent compiler in 4 weeks for 20,000attheirhourlyrate.Theanalysissuggestedactualneededcodewasaround15,000lines,not100,000.Anotherusercounteredthat"actualcostismuchhigherthan20,000 at their hourly rate. The analysis suggested actual needed code was around 15,000 lines, not 100,000. Another user countered that "actual cost is much higher than 20,000," pointing to Carlini's time spent managing AI, cost of failed attempts, and API pricing currently running at a loss.

But advocacy was equally strong. **"100,000 lines of working code for 20,000incrediblycheapcomparedtohiringahumandeveloperteam"gainedsubstantialsupport.Basedondatashowinginferencecostsdroppingabout9020,000—incredibly cheap compared to hiring a human developer team"** gained substantial support. Based on data showing inference costs dropping about 90% annually, forecasts suggested the same project a year later would cost 2,000. Carlini himself evaluated this cost as "a tiny fraction of my solo cost, and incomparable if forming a team."

Compiler expert ndesaulniers from Google's Clang team took a more cautious position. "Can't judge production compiler correctness yet." Passing tests and safely compiling arbitrary code are entirely different dimensions. Indeed, a Hacker News user analyzed that CCC "doesn't properly type-check." It allows dereferencing non-pointers and doesn't error when calling functions with wrong argument counts.

Former Microsoft executive Steve Sinofsky pointed out comparison errors from another angle. GCC didn't take 37 years to "work"—it adapted for 37 years to C language standard evolution and new platform emergence. The comparison of two weeks vs. 37 years itself doesn't hold.


Why Carlini Said "Excited Yet Uncomfortable"

Carlini's reaction, who performed the project directly, was ambivalent. He described this project as "the most fun experience recently" while adding, "I never expected this to be possible in early 2026."

What made him uncomfortable wasn't technical limitations but social implications. His statement that "programmers deploying software they haven't personally verified is a realistic concern" reflected professional instinct as a safety team researcher.

Carlini also shared practical lessons from the project. The most important principle: "Write extremely high-quality tests." Test validators must be near-perfect for Claude to solve the right problems. If tests are inaccurate, Claude endlessly optimizes in wrong directions.

Practical tips were interesting. Minimize stdout, store detailed logs in searchable files. Place error tags on same line for grep detection. Provide pre-computed aggregate statistics. Create a --fast option sampling only 1-10% of tests deterministically for quick per-agent feedback. Build CI preventing regression.

And one critical observation. "Claude doesn't perceive time. If left alone, it can run tests for hours without progress." This reveals a fundamental limitation of current AI agents. While capable of writing code, they lack metacognitive ability to efficiently allocate time toward goals.

Another lesson Carlini shared was "design from Claude's perspective." Explicitly output progress, provide fast execution modes. Because Claude can't judge whether test execution is progressing, humans must design judgment criteria in advance. Managing agents is similar to yet fundamentally different from managing junior developers. Juniors ask when stuck, but Claude doesn't ask—it repeats the same attempts.


A World Where Implementation Cost Approaches Zero

This project's real meaning isn't in CCC the compiler itself. Whether a 100,000-line compiler is useful, whether it can replace GCC—these are secondary questions. CCC isn't production-ready at this point. Carlini himself admitted "it's not a direct replacement for actual compilers."

The core is implementation cost collapse. An era has opened where $20,000 and two weeks can build 100,000-line complex systems. Chris Lattner forecast this change will fundamentally alter software engineers' roles. "When implementation cost approaches zero, scarce resources move upward. The ability to decide what systems should exist, how software should evolve, becomes most valuable."

Carlini's previous experimental results show this trajectory. Opus 4.0 couldn't even make a functioning compiler. Opus 4.5 first produced a compiler passing large-scale tests but couldn't compile actual projects. Opus 4.6 succeeded in compiling the Linux kernel. Each model generation dramatically widens the scope of possibilities.

Simon Willison drew deeper questions from this project. "If an AI system trained on decades of public code can reproduce familiar structures, patterns, even specific implementations—where exactly is the boundary between learning and copying?" No consensus exists on this question yet. And the lack of consensus itself speaks to this technology's current state.

CCC didn't prove AI "can write" code. That's already known. What CCC proved is AI "can build complex systems from start to finish." And that cost is a fraction of a human team's.

Carlini summarized this project's implications: "Agent teams show the possibility of autonomously implementing entire complex projects. We're entering a new world and need new strategies to navigate safely." A fitting conclusion for a safety team researcher. Admiring technology's potential while simultaneously guarding against its risks. This perspective will outlast CCC the compiler.


Sources: