- Authors

- Name
- 오늘의 바이브
A Pelican Rides a Bicycle

On February 23, 2026, a model called DeepSeek V4 Lite leaked through unofficial channels. Internal codename: "sealion-lite." It had been circulating among NDA-bound testers before someone let it out.
In the leaked demos, this model generated a pelican riding a bicycle as SVG code. 42 lines. The same prompt took Gemini 3.1 Pro over five minutes. Claude Opus 4.6 nailed the pelican's beak and pouch but produced a wonky bicycle frame. V4 Lite supposedly beat both on code optimization and logical structure.
The problem: nobody has verified this.
What the "Pelican Benchmark" Actually Tests

"Generate an SVG of a pelican riding a bicycle." Django co-creator Simon Willison coined this prompt. What started as a joke has become a de facto standard for testing AI code generation. There's a GitHub repo (simonw/pelican-bicycle) and a Hugging Face space (victor/pelican-benchmark) comparing multiple models.
The prompt is hard because the model has to understand pelican anatomy (beak, throat pouch, wings), bicycle mechanics (frame, wheels, spokes, pedals), and combine them spatially with precision. No training data exists for "pelican riding a bicycle." The model has to generalize from scratch.
Willison admitted: "I started my benchmark as a joke, but it's actually starting to become a bit useful." Tom Gally later expanded it to 30 creative prompts -- "octopus operating a pipe organ," "starfish driving a bulldozer" -- testing 9 frontier models.
SVG generation matters as a benchmark because generating SVG is writing code. The model calculates coordinates through numbers alone, maintains valid XML syntax, and reasons about spatial relationships between objects without visual feedback. According to the 2025 State of JS survey, 69% of front-end developers now use AI-assisted SVG generation tools. This isn't a toy benchmark. It's a proxy for real work.
On the current SVG leaderboard, Gemini 3 Pro leads with Claude Opus 4.5 (Thinking) close behind. Gemini 3.1 Pro produces high-quality results but takes over five minutes. If V4 Lite's 42-line claim holds up, it beat existing leaders on both speed and efficiency.
What V4 Lite Has and What It Doesn't
The estimated specs: roughly 200 billion parameters and a 1-million-token context window. That's nearly 8x the 128K of the V3 series -- about 500 pages of A4 text in a single interaction.
The "Lite" label exists for a reason. It's missing Engram Conditional Memory, a key feature reserved for the full V4. Engram separates static knowledge storage from dynamic reasoning, offloading syntax and API knowledge to CPU RAM so GPU compute focuses on logic-intensive tasks. It uses O(1) hash-based lookups and reportedly cuts VRAM usage by about 30%. V4 Lite is a pure text model that traded cross-modal visual reasoning for strength in text compression and information retrieval.
What V4 Lite did inherit: two of three core architectural innovations from the V4 family.
First, Manifold-Constrained Hyper-Connections (mHC). DeepSeek founder Liang Wenfeng personally uploaded this paper to arXiv on January 1, 2026. Standard Hyper-Connections caused signal amplification exceeding 3,000x, causing training to diverge at scale. mHC constrains this to 1.6x using the Birkhoff Polytope -- a mathematical structure of doubly stochastic matrices enforced via the Sinkhorn-Knopp algorithm. A 4x wider residual stream adds only 6.7% training overhead. On a 27B model, BBH score jumped from 43.8 to 51.0.
Second, DeepSeek Sparse Attention (DSA). It focuses compute on relevant context rather than processing everything uniformly, cutting computational overhead by roughly 50% compared to standard transformers. The Lightning Indexer handles fast preprocessing for million-token contexts.
These two features reportedly let V4 Lite maintain over 60% accuracy at 1 million tokens, with relatively stable performance through 200,000 tokens before gradual decline. The claim: it outperformed concurrent Gemini benchmarks on this metric.
Leaked Benchmarks, Deleted Evidence
The performance claims came from multiple sources. A deleted Reddit post, a @bridgemindai tweet, and reports from a well-known leaker called "Legit." None independently verified.
Here's what the leaked numbers look like:
| Benchmark | DeepSeek V4 (claimed) | Claude Opus 4.6 (verified) | Gemini 2.5 Pro (verified) |
|---|---|---|---|
| SWE-bench Verified | >80% | 80.8% | 63.8% |
| HumanEval | 90% | 88% | Not disclosed |
| Xbox controller SVG | 54 lines | Not disclosed | Not disclosed |
| Pelican SVG | 42 lines | Generated (wonky frame) | Generated (5+ min) |
Impressive on paper. But Kilo.ai's analysis landed the key punch: the leaked V4 benchmarks compared against outdated versions of Claude and GPT that had already been surpassed. By late February 2026, Claude Opus 4.6 and newer models had moved past those numbers. Beating last month's models and framing it as beating today's leaders.
It gets worse. Chinese tech outlet 36kr reported that in the same pelican test, DeepSeek's vector output showed "structural chaos and geometric distortion." The model revealed limitations in code generation tasks involving geometric coordinates and spatial relationships. That directly contradicts the polished "42-line success" demo that leaked later.
Two possible explanations: the results came from different versions or testing periods during development, or the successful demo was cherry-picked. Either way, the "beat Gemini in 42 lines" claim takes a credibility hit.
The Meaning and the Trap of 42 Lines

Why does "42 lines" matter? In SVG, line count is a direct measure of optimization. Producing the same visual result with less code means the model skips unnecessary elements, combines paths efficiently, and demonstrates deep understanding of SVG syntax.
But line count alone doesn't determine quality. SVG is a visual medium. A 42-line pelican that looks worse than a 100-line one is meaningless. Willison's benchmark surfaces recurring problems across all models: bicycle frames remain "wonky," anatomical positioning goes haywire, and mysterious "floating eggs" appear from time to time. Whether V4 Lite solved these issues while also cutting lines is unknowable without seeing the actual code. And that code hasn't been released.
Communeify's analysis offers a useful lens. Testing 9 frontier models against 30 creative prompts, they found that models using MoE architecture had an edge in SVG performance. DeepSeek V4 is a textbook MoE model -- 256 tiny experts with only 2-3 activated per token. Specialized experts for tasks like SVG coordinate calculation may exist within the mixture.
Chain-of-Thought reasoning also matters. Models like Qwen3.5 that use step-by-step reasoning produced more accurate geometric output. Whether V4 Lite employs such reasoning is unknown. DeepSeek V3.2 was observed running a short reasoning chain before SVG generation, but whether V4 Lite enhanced or stripped this capability remains unclear.
The SVGenius benchmark (arXiv paper) adds another dimension. Claude 3.7 Sonnet led on understanding tasks (80.25% Easy PQA, 77.78% Easy SQA) and editing (76% bug-fix accuracy), while GPT-4o excelled at generation (20.35 HPS, 19.72 PSS in text-to-SVG). Even within SVG, "understanding" and "generation" are different skills.
The Price Tag Is the Real Weapon
Whether V4 Lite's 42 lines are real or not, the pressure DeepSeek puts on the industry is real. And it's not about performance. It's about price.
V4 API pricing: roughly 6 million -- compare that to GPT-4's $100+ million. The GPU count: about 2,000 versus 10,000+ at Western labs.
The quantized (4-bit) version reportedly runs on a single NVIDIA RTX 5090 (32GB VRAM) for local inference. This tracks with the V4's RTX 4090 dual-GPU setup covered in a previous post. Trillion-parameter models running on consumer hardware is arriving.
This pricing structure is more disruptive than any benchmark number. Whether you hit 80% or 85% on SWE-bench, delivering comparable results at 1/40th the cost makes the market's choice obvious. BingX reported that the V4 leaks put "fresh pressure on Nasdaq tech stocks" -- the same pattern as the DeepSeek R1 leak in early 2025.
Counterpoint Research's Wei Sun called the mHC technology a "striking breakthrough." Not for raw performance, but for fundamentally reshaping cost efficiency.
The Unverifiable Model Problem
The biggest issue with V4 Lite isn't technical limitations. It's that nothing can be verified.
Multiple projected launch dates have passed. Mid-February. Lunar New Year (February 17). Late February. As of early March 2026, DeepSeek has made no official announcement. The leaked benchmark numbers rest on deleted posts and unverified tweets. Not a single independent third-party benchmark exists.
This creates a distortion. Experts keep repeating that "independent benchmarking is required", but since the model itself isn't publicly available, verification is physically impossible. Leakers are under NDA. Official channels are silent. Markets and communities react to unverified claims anyway. Nasdaq dips. Tech blogs proliferate. Investment reports circulate. All built on "reportedly" and "according to sources."
Some observers have called this an "organized FUD campaign" -- suggesting that regardless of V4's actual quality, public reaction is being steered toward negative interpretation to protect market stability and investor sentiment. The flip side: the leaks themselves could be deliberate marketing by DeepSeek. The R1 rollout followed a similar playbook.
The full V4 with its trillion-parameter open-source release has been teased. Only then will anyone know whether V4 Lite's 42 lines were real or cherry-picked.
The Number Isn't the Story
DeepSeek V4 Lite's 42 lines make for a compelling headline. Behind that number sit unverified claims, deleted evidence, and contradictory reporting. 36kr saw "structural chaos." Other leakers saw "a clean 42 lines." Both can't be true.
What is certain: DeepSeek's architectural innovations (mHC, Sparse Attention) are published, peer-reviewable work. The 40x price advantage is real. Training on 2,000 GPUs versus 10,000 is real. These realities exert more pressure than any pelican SVG.
The episode also exposes a fundamental limitation of AI benchmarks. Whether it's pelican SVGs, SWE-bench, or HumanEval, no single benchmark captures a model's full capability. Willison himself said the test was "starting to become a bit useful" -- not "definitive." Forty-two lines of SVG, at best, demonstrate code optimization under specific conditions. Not general coding ability. Not reasoning. Certainly not real-world software engineering.
The more interesting question isn't whether the 42 lines are genuine. It's why a single unverified benchmark from unofficial channels creates this much noise. The answer lies in the current state of AI: as frontier models converge in performance, even tiny differences get amplified. The gap between 42 and 54 lines, between 80% and 82%, moves billions in market value. Benchmarks have stopped measuring technology and started functioning as marketing.
When DeepSeek officially ships V4, the real game begins. Until then, it's all rumors.
Sources:
- Technical leaks reveal DeepSeek V4 Lite outperforms Gemini 3.1 - TechBriefly
- DeepSeek V4 Lite Surfaces With Breakthrough SVG Generation Skills - Dataconomy
- DeepSeek V4: Rumors vs Reality for the Next Big Coding Model - Kilo.ai
- DeepSeek V4 Lite Leak Points to Fast, Clean SVG Code - Geeky Gadgets
- After Zhipu and Minimax, DeepSeek Launches a Basic Attack - 36kr
- DeepSeek V4 Benchmark Leaks: What the Numbers Actually Show - HumAI
- Simon Willison - Pelican Riding a Bicycle
- AI Model SVG Generation Benchmark of 9 Top LLMs - Communeify
- Yupp SVG AI Leaderboard