Signals
Could small custom-built or tuned large language models be superior for business use cases than the generic models from large vendors? Infosys' chairman seems to think so:
Indian technology grandee Nandan Nilekani expects companies around the world will increasingly build their own smaller-scale artificial intelligence models to streamline operations and boost productivity, dampening hope of a substantial enterprise payday for more powerful generative products.
The chair of IT services major Infosys told the Financial Times he was “not so sure” companies would want to shoulder the high costs and the potential “black box” of data and copyright liabilities associated with large language models behind popular applications, such as OpenAI’s ChatGPT.
There may be something in this, especially if the reasoning models don’t quite meet expectations.
Researchers are sharing encouraging early reports about o1-preview as an aid for tackling complex scientific challenges.
Researchers at the national lab have also been surprised by o1-preview’s ability to recognize when it doesn’t have all the necessary information to answer a question and make reasonable assumptions for variables it might be missing, the person said.
The Lawrence Livermore example is similar to the positive reaction Australian-American mathematician Terence Tao shared after the initial release of o1-preview and o1-mini. Tao used the models to solve math problems and write proofs—something that a typical ChatGPT user probably wouldn’t do.
“It may only take one or two further iterations of improved capability” until such a reasoning model becomes a “competent graduate student…at which point I could see this tool being of significant use in research level tasks,” he said.
This dilemma mirrors one potentially faced by junior lawyers: if AI handles graduate-level research tasks, how will the next generation of researchers develop their skills?
An example of a fine-tuned LLM getting good results versus a human:
LLMs trained on the vast scientific literature could potentially integrate noisy yet interrelated findings to forecast novel results better than human experts. Here, to evaluate this possibility, we created BrainBench, a forward-looking benchmark for predicting neuroscience results. We find that LLMs surpass experts in predicting experimental outcomes. BrainGPT, an LLM we tuned on the neuroscience literature, performed better yet. Like human experts, when LLMs indicated high confidence in their predictions, their responses were more likely to be correct, which presages a future where LLMs assist humans in making discoveries. Our approach is not neuroscience specific and is transferable to other knowledge-intensive endeavours.
They used LoRA to finetune Mistral-7B-v0.1 on neuroscience literature.
Generative AI is increasingly able to perform entry-level work in white-collar jobs, which will impact workforces far into the future.
Across the white-collar economy, entry-level jobs are suddenly vulnerable to automation because they involve low-stakes assignments of the sort that generative AI is best at. AI could therefore sever the career ladder of industries like finance and law, forcing many would-be bankers and lawyers to look elsewhere for work.
Having used some of the more recent AI meeting tools as a meeting scribe, I can attest to the fact that they are increasingly able to perform routine tasks with a good degree of accuracy. Compared to the early days when we’d have a good laugh at the software’s attempts to summarise, it’s now a useful tool for taking notes and extracting action items.
Consider the legal field. Law is among the industries most exposed to generative AI’s capabilities because of its orientation toward language. Traditionally, the first few years of a newly accredited lawyer’s career is spent working under the tutelage of more senior lawyers and engaged in routine tasks—missives like “document review,” basic research, drafting client communications, taking notes, and preparing briefs and other legal documents. Advances in AI-powered legal software have the potential to create vast efficiencies in these tasks, enabling their completion in a fraction of the time—and a fraction of the billable hours—that it has historically taken junior lawyers and paralegals to complete them.
If we don’t need to train up junior lawyers, how do we grow the legal workforce? Or do we need to rethink the role of a lawyer?
Stripe APIs are adding payments and metering capabilities to LLM agentic workflows:
In the case you want to have an agent perform purchases:
Agentic workflows need not have exclusively virtual outcomes. Imagine a travel agent that can book flights for your company. Using LLMs and function calling we can assemble a set of agents that can search for flights online, return options, and ultimately identify a booking URL. With Stripe, you can embed financial services and enable the automation of the purchase flow as well. Using Stripe Issuing, you can generate single-use virtual cards that agents can use for business purchases. This enables your agents to spend funds. The Issuing APIs allow you to approve or decline authorizations programmatically, ensuring your purchase intent matches the authorization. Spending controls allow you to set budgets and limit spending for your agents
Additionally, it can be used for metering and billing:
Conducting agentic workflows have material cost – typically measured by token use or time. With usage-based billing, you can charge based on a customer’s usage of your product. The toolkit provides middleware to easily track prompt and completion token counts and send billing events for that customer.
The Issuing API sounds particularly useful in stopping an LLM Agent buying travel tickets to Yellowstone National Park, or worse. From the Claude announcment on computer use:
In one, Claude accidentally clicked to stop a long-running screen recording, causing all footage to be lost. In another, Claude suddenly took a break from our coding demo and began to peruse photos of Yellowstone National Park.
Ben Thompson on Shopify’s questionable self-awareness when they ventured into logistics:
Here’s the thing: logistics were and are a big challenge for merchants. It is a problem that needs to be solved. In this case, however, Shopify was too laser-focused on merchants; logistics was a situation where they needed to not just understand the problems being faced by their merchants, but also understand themselves and what problems they were actually capable of solving.
Specifically, Shopify is a software company, and that puts one hard constraint on your business: you need to go horizontal, not vertical. Software requires huge amounts of R&D but it also scales infinitely; software businesses profit by maximizing leverage on their investments, which means serving more customers with the same capabilities. Logistics, though, means the physical world, which means variable costs and a limited addressable market; this limits the leverage you get from software, without decreasing the need for R&D.
Self-awareness of one’s core strengths is as crucial as customer focus, and Shopify’s failure to stay within the bounds of what it could effectively manage as a software company resulted in significant setbacks.
Generative AI excels in codebases that are relatively clean, well-structured, and adhere to best practices. In these environments, LLMs can easily follow patterns, understand context, and generate useful suggestions or boilerplate code. Companies with younger, high-quality codebases will likely see the biggest productivity gains, as AI tools can navigate their code with precision and speed.
The team at Gauge.sh see the same thing:
There is an emerging belief that AI will make tech debt less relevant. Since it’s getting easier to write code, and easier to clean up code, wouldn’t it make sense that the typical company can handle a little more debt?
The opposite is true - AI has significantly increased the real cost of carrying tech debt. The key impact to notice is that generative AI dramatically widens the gap in velocity between ‘low-debt’ coding and ‘high-debt’ coding.
Companies with relatively young, high-quality codebases benefit the most from generative AI tools, while companies with gnarly, legacy codebases will struggle to adopt them. In other words, the penalty for having a ‘high-debt’ codebase is now larger than ever.
High-debt codebases are often a patchwork of custom solutions, undocumented hacks, and interdependent modules. This complexity is kryptonite for AI models, which struggle with code that deviates from standard patterns. LLMs are trained on large datasets filled with best practices and typical coding paradigms, not on the idiosyncrasies of a company’s specific, legacy system. As a result, AI-generated suggestions are more likely to make faulty assumptions, or even worsen existing issues in high-debt environments. In practical terms, this means the productivity boost that AI offers in low-debt environments simply isn’t likely to translate to high-debt codebases.
This widening gap has turned tech debt into an arguably more urgent and strategic problem. Back in 2003, Nicholas Carr's provocative Harvard Business Review article “IT Doesn’t Matter” argued that as IT became ubiquitous, its strategic value diminished. Carr's point was that once a technology is available to everyone, it ceases to be a source of competitive advantage.
While Carr was correct about IT’s ubiquity, he couldn’t have predicted how this ubiquity would lead to layers of accumulated complexity in codebases. Today, many companies, especially in industries like finance, are shackled by these legacy systems. For decades, banks and investment firms poured billions into proprietary trading systems, risk management platforms, and customer-facing applications. These were the “crown jewels,” meant to give them a competitive edge.
But ironically, the very systems that once differentiated them have now become corporate concrete shoes. They are mired in layers of custom code that can’t easily be modernized or replaced. Rather than enabling innovation, these systems prevent it. Companies find themselves allocating massive resources just to maintain the status quo, with little bandwidth left for new projects or innovation.
The problem isn’t just that tech debt exists; it’s that the cost of carrying it has escalated. Generative AI tools are a force multiplier, but only for those who are already well-positioned to take advantage of them. For companies with modern, well-maintained codebases, AI is a powerful accelerator. For those with tangled, legacy systems, AI is not a shortcut but a spotlight, highlighting the inefficiencies and fragility of their code.
In this new reality, tech debt is no longer just a drag on velocity—it’s a strategic risk. Companies that fail to address their tech debt may find themselves falling further behind, not just in their ability to deliver software but in their capacity to leverage the next wave of AI-driven innovation.
Existing benchmarks are becoming saturated with new AI models, highlighting the need for new benchmarks.
Companies conduct “evaluations” of AI models by teams of staff and outside researchers. These are standardised tests, known as benchmarks, that assess models' abilities and the performance of different groups' systems or older versions. However, recent advances in AI technology have meant many of the newest models have been able to get close to or above 90 per cent accuracy on existing tests, highlighting the need for new benchmarks. “The pace of the industry is extremely fast. We are now starting to saturate our ability to measure some of these systems [and as an industry] it is becoming more and more difficult to evaluate [them],” said Ahmad Al-Dahle, generative AI lead at Meta.
see also LiveCodeBench
Researchers examining how GitHub’s Copilot affects distributed work patterns found several key trends in developer behavior. High-performing developers shifted toward core coding and away from project management tasks. This shift stemmed from increased autonomous work (requiring less collaboration) and more exploratory behavior rather than exploitative approaches. The study also revealed that AI’s influence on task distribution had a more pronounced effect on developers with lower skill levels.
This study seeks to shine light on the importance of AI, and in particular generative AI and it’s consequences on distributed work. Going beyond the first-level understanding of whether or not it increases productivity, we dig deeper to understand how it changes the nature of work processes of adopters. We find that top developers of open source software are engaging more in their core work of coding and are engaging less in their non-core work of project management. Both of these main effects are driven by two underlying mechanisms — an increase in autonomous behavior (and a related decrease in collaborative behavior) and an increase in exploration behavior (and a related decrease in exploitation behavior). In particular, the reduction of the need to collaborate with other humans, leads to humans circumventing collaborative frictions and transaction costs that would otherwise occur during their work. We further find that the programming generative AI Copilot shifts the task allocation of developers with lower ability more than those with higher ability.
In highly specialized trading systems development, current LLMs fall short of practical utility. Without training data from modern financial trading architectures—which companies closely guard as proprietary—these LLMs cannot provide effective assistance. My testing has shown no current LLM can correctly implement complex systems like multiprocess Aeron-based architectures.
Tariffs are a cost on American companies and consumers, and could make or break the United States' ability to compete with artificial intelligence.
Congress does not need to approve tariffs with existing law:
On the campaign trail, candidate Trump was a vigorous advocate of increased general tariffs ranging from 10 percent to several hundred percentage points or more. While some assume approval from Congress might be required for any tariff increases, this view is misguided; existing law, in fact, enables swift presidential action on tariffs.
This matters because the USA relies so heavily on imports for hardware and semiconductors:
In a September 2024 report, UBS, an investment banker, predicted both tech hardware and semiconductors to be among the top four sectors that would be hardest hit by a general tariff. Their analysis is spot on. Many of the hardware components that make AI and digital tech possible rely on imported materials not found or manufactured in the United States.
Perhaps Elon will be able to inspire the necessary policy adjustments to prevent a tariff-induced hit to the nascent AI industry.
Another usage of formal verification to great, claimed, benefit in the realm of distributed systems. I am yet to see anyone in finance using any formal verification techniques.
Design reviews, code audits, stress testing, and fault injection are all invaluable tools we use regularly, and always will. However, we’ve found that we need to supplement these techniques in order to confirm correctness in many cases. Subtle bugs can still escape detection, particularly in large-scale, fault-tolerant architectures. And some issues might even be rooted in the original system design, rather than implementation flaws. As our services have grown in scale and complexity, we’ve had to supplement traditional testing approaches with more powerful techniques based on math and logic. This is where the branch of artificial intelligence (AI) called automated reasoning comes into play.
Their use of automated reasoning focuses on code correctness and maintainability rather than performance optimization. While I initially expected it might automatically identify efficiency improvements, its primary value lies in making the codebase more robust and adaptable to changes.
The reason is that the bug fixes we make during the process of formal verification often positively impact the code’s runtime. Automated reasoning also gives our builders confidence to explore additional optimizations that improve system performance even further. We’ve found that formally verified code is easier to update, modify, and operate, leading to fewer late-night log analysis and debugging sessions.
One other challenge I have with formal verification is that I am yet to see source code that is a perfect implementation of the formal specification.
A completely open source code generation LLM family. Their paper provides performance results against HumanEval, MBPP, BigCodeBench and LiveCodeBench (mentioned earlier in this stream). Qwen seems to be the best performer, but having access to the training data and the ability to reproduce the results is a big improvement over other open source models.
OpenCoder is an open and reproducible code LLM family which includes 1.5B and 8B base and chat models, supporting both English and Chinese languages. Starting from scratch, OpenCoder is trained on 2.5 trillion tokens composed of 90% raw code and 10% code-related web data, reaching the performance of top-tier code LLMs. We provide not only model weights and inference code, but also the reproducible training data, the complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols
For Java, the 1.5B model is on par with Qwen, but the 8B model is a bit behind.
I tested the 8B model (including Q6_K, Q8_0 and F16 variants), and while it gave a workable (but not great) answer to the prompt “write a Java function to connect to Aeron and send Hello World over a publication”, on one occasion it added on a whole lot of additional complaints about PRAM and SMC resets on macs running El Capitan 10.11.6 and finished off with a recommendation to upgrade my macOS to the latest version. That amount of hallucination was a first for me.
A team from IIIT Delhi published a paper on performance of several LLMs for generating unit tests.
ChatGPT operates with a focus on understanding and generating content in natural language rather than being explicitly tailored for programming languages. While ChatGPT may be capable of achieving high statement coverage in the generated unit tests, a high percentage of the assertions within those tests might be incorrect. A more effective approach to generating correct assertions would be based on the actual semantics of the code. This presents a concern that ChatGPT may prioritize coverage over the accuracy of the generated assertions, which is a potential limitation in using ChatGPT for generating unit tests, and a more semantic-based approach might be needed for generating accurate assertions.
When prompted to generate tests, ChatGPT will go ahead and generate tests that increase code coverage to the specified level, but the assertions are often incorrect. I’d be curious to see how the results would change if the tests were run on a more recent LLM. Which brings me to a frustration with papers like this - the paper was published in 2024, but studies ChatGPT running GPT-3.5, which was released in late 2022. Academic papers are often published long after the LLM models they study have been superseded. While still useful in the academic context, it means that companies wishing to adopt the latest LLM models are left to their own devices to figure out how to do so and how to interpret the results.
An interesting post from the Google Project Zero team on how they used an LLM to find real-world vulnerabilities in SQLite. I expect the cost implications of this approach will be significant, but it’s an interesting outcome.
Today, we’re excited to share the first real-world vulnerability discovered by the Big Sleep agent: an exploitable stack buffer underflow in SQLite, a widely used open source database engine. We discovered the vulnerability and reported it to the developers in early October, who fixed it on the same day. Fortunately, we found this issue before it appeared in an official release, so SQLite users were not impacted.
The approach uses an agent model, which I expect will be a growing trend across businesses in the coming years.
It appears LLMs have developed quite the appetite for LeetCode-style puzzles, becoming remarkably good at solving them. Given that most LLM benchmarks test against similar puzzle-based tests, this overfitting is not that surprising. Unfortunately for actual software developers—who spend surprisingly little time reversing binary trees in production—this expertise is somewhat less useful.
Enter LiveCodeBench: a benchmark that continuously harvests fresh programming problems, ensuring LLMs can’t simply memorize their way to algorithmic enlightenment while leaving real-world programming challenges in the dark.
LiveCodeBench is a holistic and contamination-free evaluation benchmark of LLMs for code that continuously collects new problems over time. Particularly, LiveCodeBench also focuses on broader code-related capabilities, such as self-repair, code execution, and test output prediction, beyond mere code generation