It’s the end of the year for Radar! We hope all of our readers enjoy the holidays. Here’s one prediction for 2025:
Is this the end of the road for improving LLM performance by scaling either the number of parameters or the training data? No one knows yet. Regardless of the answer, we expect interest to shift toward smaller models. We’ll grudgingly allow the 70B parameter model to qualify as “small,” but we really mean 20B or fewer parameters. These models will prove to be easier for companies developing AI-enabled applications to work with: They won’t cost as much to run and they’ll be simpler to fine-tune for specialized applications. Very few applications will need a fully general language model.
Artificial Intelligence
- The OpenGPT-X project has released its open large language model, Teuken-7B. This model is significant because it supports 24 European languages and is designed to be compliant with European law. It is available on Hugging Face.
- OLMo 2 is a newly released, fully open, small language model that comes in 7B and 13B sizes. Both versions claim the best performance in their group.
- NVIDIA has announced Fugatto, a new generative text-to-audio model that can create completely new kinds of sounds. They position it as a tool for creators.
- Anthropic has announced the developer preview of its Model Context Protocol. MCP allows Claude Desktop to communicate securely with other resources. The MCP server limits the services that are exposed to Claude, filters Claude’s requests, and prevents data from being exposed over the internet.
- OpenScholar is an open source language model designed to support scientific research. It’s significantly more accurate than GPT-4o and more economical to run. It uses RAG to access a large database of open-access scientific papers, which ensures that citations are accurate.
- Meta has partnered with VSParticle to create new materials from instructions generated by AI. They are focusing on nanoporous materials, which could be catalysts for breaking down CO2 into useful products.
- Perplexity has introduced in-app shopping: Users can search for something, then have Perplexity buy it. It’s the first widely available example of an AI agent that changes the state of the physical world.
- Research has shown that generative AI models have their own distinctive styles, not unlike human writers. Stylistic analysis can identify the source of a text to the model that generated it.
- Mistral has released Pixtral Large, a 124B parameter multimodal model with benchmark performance on a par with the latest versions of other frontier models.
- Mozilla’s Common Voice project collects speech samples in languages other than Anglo-American English to help developers build voice-enabled applications using other languages and dialects. The project is open source.
- Mechanistic interpretability is a research area that uses AI to examine what’s happening within each layer of a large language model. It provides a path toward AI interpretability: the ability to understand why an AI produces any output that it generates, and possibly to control that output.
- Google’s Pixel phones will be able to monitor phone conversations to detect scams in real time. Processing takes place entirely on the phone. The feature is off by default and can be enabled on a per-call basis. Another new feature detects stalkerware, apps that collect data without the user’s consent or knowledge.
- The Common Corpus dataset for training large language models is now open and available on Hugging Face. The dataset contains over 2T tokens taken from “permissibly licensed” sources, and it documents the provenance of every source.
- OpenAI’s newest model, Orion, is an improvement over GPT-4. But is it a significant improvement? Apparently not. This may be the end of the road for improving LLMs by making them larger. (And is Orion GPT-5?)
- FrontierMath is a new AI benchmark that is based on very tough mathematical problems. At this point, no language model scores higher than 2% (Gemini 1.5 Pro).
- Separating the instruments in a musical performance is tough, but it’s possible. Here’s an AI-free masterpiece of signal processing that attempts to do so. Can we turn a performance back into sheet music?
- Standard Intelligence has released hertz-dev, a new model for real-time voice synthesis. It was trained purely on audio and can participate in unscripted conversations without the use of text.
- Microsoft’s Magentic-One is a generalist agentic system that is capable of performing complex tasks. Magentic-One is open source for researchers and developers. Microsoft has also released AutoGenBench, an open source tool for evaluating the performance of agentic systems.
- ChainForge is a new visual tool for prompt engineering. It can be used to test prompts against multiple models and evaluate the quality of the response.
- AI was used to de-age Tom Hanks and Robin Wright in a new film, allowing the actors to play their characters across a 60-year time span.
- Anthropic has released Claude 3.5 Haiku, a new version of its smallest and fastest model. The company claims that its performance on many benchmarks is superior to Claude 3 Opus, its previous leading model. Anthropic has also significantly increased the price for using Haiku.
- OpenAI has introduced predicted outputs. If the output to a prompt is largely known ahead of time—for example, if you’re asking GPT to modify a file—you can upload the expected result with the prompt, and GPT will make the changes necessary. Predicted outputs reduce latency; apparently they don’t reduce cost.
- Fortunately, AI Psychiatry has nothing to do with psychoanalyzing human patients. It’s a forensic tool for postmortem analysis of AI failures that allows investigators to recover the exact model that was in use when the failure occurred.
- SmolLM2 is a new small language model, designed for running on devices. It comes in 135M, 360M, and 1.7B parameter versions. Early reports say that its performance is impressive.
- vLLM is a framework for serving LLMs. It works with most of the language models on Hugging Face. Not only does it claim to be simpler, but it also claims to have significant performance and cost benefits by using a key-value store to cache input tokens.
- AI Flame Graphs show developers what their models are doing in detail. If you’re concerned about performance or energy use, they are revolutionary.
- Google’s Project Jarvis is reported to be the company’s answer to Anthropic’s computer use API. Jarvis takes over a browser (presumably Chrome) to perform tasks on behalf of the user.
- NotebookLM’s ability to generate a podcast from documents is impressive. Can other models do the same thing? NotebookLlama is an open source project that generates podcasts using the Llama models.
Programming
- bpftune is a utility that constantly tunes Linux system performance using observability data from BPF. It has “zero configurables” (no configuration) and low overhead and is smart enough to stay away from settings a system administrator has made. It apparently does not use AI.
- Kyanos is a new open source network analysis tool that’s based on eBPF. Because it has access to eBPF data, it can filter packets by process or by service, and it can give precise information about packet latency.
- VMware Fusion and VMware Workstation are now free to all users, including commercial users. Broadcom will continue to develop the products but will cease providing troubleshooting support for users.
- OpenCoder is a family of language models for generating code. It’s completely open source, and training data, the data pipeline, training results, and training protocols are all available in addition to the code. Its intent is to encourage further experimentation and research on code generation.
- Mergiraf is a tool for solving Git merge conflicts by using an understanding of common programming languages (including Java, Rust, and Go) and file formats (including JSON, HTML, XML, and YAML). The authors claim that new languages can be added easily.
- A proposal has been published for Safe C++, a new version of C++ that will incorporate memory safety features.
- DataChain is a Python library for working with structured data in the context of artificial intelligence. It’s designed for building data pipelines and manipulating data at scale.
- NoCode GitHub? GitHub Spark allows users to create small “micro-apps,” or sparks, without writing any code. What may be more important than no code is no deployment; sparks are deployed on GitHub’s infrastructure and accessed through the web.
- Using Git to backup Linux’s /etc directory is obvious, once you think of it.
- Ractor is an Actor framework for Rust, which means that you can program in Rust somewhat as if it were Erlang. I’m impressed by the longest, most complicated “Hello, World” that I’ve ever seen.
- Kubernetes is a platform for building platforms. And platforms need to serve both development and operations teams.
- GitHub Copilot can now use models other than GPT. Users can select Claude Sonnet or Gemini in addition to different OpenAI models. Other new features include auto–code review, an upgrade assistant for Java, multifile editing, and something called Spark that sounds something like Claude’s Artifacts.
- Is your AI-generated code secure? No. We’re not likely to stop using tools like Copilot and Cursor, but we need to understand the challenge: AI models were trained on publicly available code. Most publicly available code has vulnerabilities. Those will be reflected in the AI’s output.
- Does Java need another build tool? Mill is waiting to take over. Mill claims to be 5–10x faster than Maven, 2–4x faster than Gradle.
- Amphion is an open source toolkit for generating all forms of audio, including music and speech.
Security
Robots
- Grasso is an AI-powered trashbot: a mobile robot made of trash. It uses Llava-v1.6-mistral-7B to understand visual input from its camera, and Mistral-7B for prompts and responses. (It doesn’t understand or generate speech.)
- Meta has released several new projects for touch perception, a crucial element in building AI-driven robots that can interact with the real world. Digit 360 is a tactile digital fingertip, Sparsh is an encoder for tactile data, and Digit Plexus is a platform for building artificial hands.
- Tie two unintelligent micro robots (bristlebots) together with a short, flexible tether and they acquire the ability to solve simple problems.
Web
- Want to run Linux in your browser? You can. WebVM is a virtual machine that runs in a browser. Linux in the browser may not be that interesting; it’s more important as another example of Wasm’s abilities.
Virtual Reality
- Want to talk to Rosa Parks or Abraham Lincoln? Try ENGAGE XR, a tool that combines VR and generative AI. Whether this is actually history is an interesting question; the bus in the Rosa Parks example looks like a modern European bus, not an American bus from the 1950s.
Quantum Computing
- Google’s DeepMind has developed AlphaQubit, an AI system that detects errors in quantum systems. Error correction has made tremendous progress in the past year but still remains a major problem in quantum computing.