Gemini tops benchmarks, again

10x faster models and the consulting angle for AI

Hey I’m Ben. I build stuff with agents, even though I’m not technical. Here’s all the stuff I’m reading and tinkering with. If you want to start building or level up your ‘vibe-coding’ skills, join our community.

Hey folks,

Google is back on top of the benchmark charts with Gemini 3.1 Pro. Impressive on paper, genuinely strong at reasoning tasks, creating SVGs, but there’s a speed issue. Many folks are really enjoying using it for frontend work—once they are able to get it working. Again, there’s some drama - a lot of people got their Google account banned for using their Google AI/Antigravity subscription to use Gemini 3.1 Pro with OpenClaw.

A 2.5-year-old hardware startup, Taalas, built a chip that has the weights of Llama 3.1 baked into the hardware, and it lets them achieve ~17k tokens/second in output speeds. For comparison, Groq is at ~600 tokens/second, and Cerebras is at ~2k/second. The model on the chip (they call it “silicon llama”) is largely unwritable, but supports custom context window sizes and LoRA fine-tuning. I compared the same model on their chat demo and Groq’s playground. As expected, it is dumber on Taalas’s demo (due to low-quality quantisation), but at this stage, the proof of “any AI model can be made 10x faster and cost 20x less” is more important. They plan to release a reasoning model version very soon, with frontier LLMs in plans too.

OpenAI is partnering with 4 major consulting firms, BCG, McKinsey, Accenture and Capgemini to make enterprises use their new platform “ Frontier ” that lets you create AI coworkers. Weren’t consulting shops supposed to die with AI?

Claude Code updates - built-in support for git worktrees for parallel agents, CC desktop can preview running apps and a new security scanning feature in beta.

Why’s there always a meeting bot in your Zoom call? Blame Recall.ai. They power every meeting AI app, from Cluely to Hubspot to Clickup. Recall.ai handles the hard part: getting recording data across meeting platforms. Get started with $100 in credits *

🌐 What I’m consuming

Anthropic says Chinese model makers “stole” Claude chats to make their models good. That opens a can of worms. a) Why is it fair use when Anthropic does that to the internet and book authors? b) Is it just a lobbying attempt? c) Are their claims really honest? and a lot more.
The 2028 global intelligence crisis - A fictional thought exercise from Citrini Research is fueling another selloff. But here’s a counter essay to it.
The shortcomings of SWE-Bench-Verified and why OpenAI will not report it anymore.
Vibe coding is the new product management.
Inside Felix - The OpenClaw AI earning $1,000s a week.
The filesystem is the database for an agentic personal OS.
Elaborate Agents.md or Claude.md files might be hurting the performance of your agents.
Aesthetics of AI - Different ways AI products are approaching their brands visually.
Agentic Engineering Patterns by Simon Willison - Patterns for getting the best results out of coding agents.
The software development lifecycle is dead.
Inference Engineering by Baseten - A book for AI engineers to learn how to serve different types of models (LLMs, media gen models, and more) to millions of users.

⚙️ Tools and demos

AssemblyAI Universal-3 Pro - Prompt your speech model to get jargon, speakers, and formatting right the first time. Free to try through Feb.*
here.now - Free, instant web hosting for agents, static elements only.
mdnb - a markdown notebook for MacOS.
Rork Max - One-shot almost any app for iPhone, or any Apple device (including watches, TVs and Vision Pro). i’m an investor.
Interpreter - Desktop agent that can fill PDFs, edit your Excel and Word docs, and learn new skills. Runs locally, works with any model.
Wideframe - AI agent that speeds up the 75% of video work happening outside the editor.
Typefully has a new writing assistant to help you write better (not just more).
Trajectory Explorer by Raindrop - Every decision your agent made, searchable in seconds.
FasterGH - GitHub with instant navigation and a modern UI. (repo)
Quipslop - A live game where different models try their best to be funny. (repo)
Shiori - A beautifully simple read-it-later app.
I was looking for a way to add a “browser” to a web-app I’m working on. Came across hyperbeam and lifo.sh.

🥣 Dev Dish

Websockets in Responses API - for low-latency, long-running agents with heavy tool calls. Also, OpenAI has a new speech-to-speech model in the API: GPT-Realtime-1.5.
Multimodal function calling is now available in the Gemini Interactions API.
CloudFlare’s new MCP server uses code mode and takes <1000 tokens in the context window.
Chowder - UI patterns for agents on mobile. (demo)
Agentsview - A local web app for browsing, searching, and analysing your past AI coding sessions.
mdr - A lightweight, fast Markdown viewer with Mermaid diagram support.
fastpass - A CLI for rapidly configuring Cloudflare Access.
tools from vercel-labs - a visual JSON editor and the ability to render a PDF from JSON.
api2cli - Claude Code skill to turn any API into a working CLI and then wraps that CLI in a skill.
A collection of skills from Matt Pocock for writing PRDs, creating issues from them, developing them with a Ralph loop and manual QA.

🍦 Afters

GPT-5.2-Chat-Latest (the model in ChatGPT) is a big improvement over the raw GPT-5.2 based on Arena scores.
Rows (Modern spreadsheet, pivoted to AI data analyst) is joining Superhuman (the combined company of Superhuman email, Grammarly and Coda).
Agentica claims it has solved all of ARC-AGI-3’s puzzles.
Standard Intelligence’s new foundation model, FDM-1, learns to use computers from videos, not just screenshots.
ChatGPT finds an error in Terence Tao’s argument for an Erdős problem.

Enjoy this newsletter? Forward it to a friend.

That’s it for today. Feel free to comment and share your thoughts. 👋

Find me on X, Linkedin, or Instagram
Read about me and Ben’s Bites
📷 thumbnail by @keshavatearth

Keen's Clippings

Explorer