AI Trends 2026: OpenClaw Agents, Reasoning LLMs, and More [Sebastian Raschka]

--- Summary:

Video: AI Trends 2026: OpenClaw Agents, Reasoning LLMs, and More [Sebastian Raschka] - 762
Views: 1,221; Published: Feb 26, 2026
The “reasoning revolution” shifted focus from pre-training to post-training, utilizing verifiable rewards in math and code to scale model intelligence without human-labeled data.
Inference scaling, including extended “thinking” time and parallel techniques like self-consistency, allows models to trade increased compute for significantly higher accuracy.
LLM usage is evolving from simple chat interfaces toward agentic loops and multi-agent systems that autonomously interact with tools, execute code, and refine their own outputs.
A primary practical takeaway is using LLMs to build custom, deterministic software tools (e.g., Mac apps or scripts) to automate specific professional workflows rather than relying on LLMs for every task.

--- Transcript (English captions, best effort):

the R&D like the research and development of the focus of the research team I I think it’s more focused nowadays on the post training like getting more performance out of that because it’s more like the newer paradigm and there are still lowhanging fruits to be picked where in pre-training it’s already pretty sophisticated and you will still get better results if you use more data optimize the data mix maybe multi-token prediction and these types of things but most of the interesting things are happening now on the [music] post training front in the reasoning realm basically. So I think we will see more there. [music] All right, everyone. Welcome to another episode of the Twimmel AI podcast. I am your host Sam Cherington. Today I’m joined by Sebastian Rashka. [music] Sebastian is an independent LLM researcher. Before we get going, be sure to take a moment to hit that subscribe button wherever you’re listening to today’s show. Sebastian, welcome back to the podcast. It’s been a little bit. >> Yeah, thank you for inviting me back, Sam. I’m happy to Yeah. be back and to chat about LLMs, AI, and whatever you have in mind. I had a lot of fun last time, so I hope we can make it fun and interesting again. >> You know, my joke around this time, it’s getting a bit old, but it’s like, uh, the last time we spoke was three years ago. Not much has changed, right? [laughter] >> Well, all good good things come in threes. I think there’s a saying, right? [laughter] >> And in fact, a ton has changed. And we’re going to be focusing on the most recent and most important of those changes. In particular, what’s new with LLMs and what to expect with LLMs in 2026. This is an area that you spend a lot of time focusing on with your research and education work. Um, you know, maybe we can start with just, you know, kind of top of mind like if you think about, you know, very big picture where we are now compared to where we were a year ago, you know, what what uh is your broad reflection about the evolution of the space? >> I would look at today compared to one year ago. It’s almost like the anniversary of uh DeepSeek um the big DeepSseek version 3 model accompanied by the R1 model. The reasoning I would say reasoning revolution in quotation marks. It’s still LLM. It’s um still the same base model but we have now more techniques on top of that to make the models smarter in terms of solving more complex problems. And um the other so I would say architecture wise LLM architectures are looking still relatively similar but the reasoning training is one of the new things if we compare today to last year and then also I think there’s a more heavy focus on tool use. So back then uh when chat GPD was launched or also the first um iteration of LLMs the focus was mainly on uh general purpose tasks but then also having the LLM answer all the things we are curious about like from memory like if we ask it a math question or a knowledge question the LLM would basically draw from its memory and then write the answer but that’s not always uh let’s say the most effective or accurate thing to do similar for for us humans. Um I mean LMS are different from how humans think but we as humans if you asked me a complicated math question like or just like a multiplying two large numbers I would pull out my calculator and calculate that on a calculator. I wouldn’t do that in my head. Um I maybe could but it would take a long time. It’s more errorprone and so forth and there’s no need to do that. And the same with LLMs. Now with more modern tooling it becomes more and more popular to use or to have the LLM use tools too. Uh it requires training the LLM to use those tools but with that um I think we can reduce like hallucination rates not completely getting rid of those but reducing those and then also making answers more accurate. So uh and then with reasoning uh capabilities it’s essentially giving the LLM more time in quotation marks to think through a problem. So these are I think the two main um I would say um yeah knobs that we can tune and to make progress on in the last year if we look particularly like last year and now um the difference yeah >> we’ll dig into the technical aspects of like how we’ve evolved in reasoning and how we’ve evolved in tool use among other things but before we do that I was thinking it might be interesting to talk a little bit about from a practical perspective, how do you think where we are today is is different and and has shifted and you know it’s it’s super interesting. We’re talking in kind of you know second week of February and already this year in 2026 there’s been a ton of uh you know new news, new models uh Opus 4.6, six open AAI 5.3 um you know there’s been the whole open claw multbot you know talk a little bit about you know the the what we’ve seen already this year uh but in the context of like where you see LLMs are from a practical perspective >> um yeah I would say yeah that’s a good point we are just in the second uh week of February and that means the Chinese New Year hasn’t even uh like occurred where I think there will also another batch of releases but I think um like on the open weight front but I think uh that that is like a separate thing where um you have now companies developing the tooling around LLMs that is becoming more and more mature and then you have better LLMs yourselves and I think I would also almost like separate those two. So my hypothesis is if you would take the best um open weight LLM and put it into let’s say a CHPT or Gemini or claude um interface you would almost get the same type of quality performance and everything like where I think a lot of uh use cases evolve around the tool wrapper around the LLM nowadays. That’s this idea that was popularized kind of towards the end of last year around harness engineering. >> That that is also something how we changed using LLMs because before it was just simply um yeah like a very simple pet interface. Yeah. Yeah. And then it became you know more sophisticated. You could upload files and PDFs. And so for my personal use case, I use LMS mostly uh for like actually it sounds weird but like proof reading uh checking things and these types of things. So just uh before we recording here I was finishing uh writing a chapter and I wanted to update the table of contents and then I just uploaded the PDF to the chatbd interface and say hey can you give me the headers so I don’t have to pull that um out myself and then you can just double check also that it is correct but like little convenience tasks like making work a bit simpler like these um tedious things. Um but then like you said there was also the new Oppos model and then um JPD released um codeex 5.3 and Mac OS app with that and I think that is also like yet another leap in terms of what um these models are capable of. I mean before there were also coding LLMs and it became more popular to use LMS for coding but it’s always you know more and more and getting better and better and so uh before I used um Visual Studio Code uh I mean because I I uh just used Visual Studio Code the code editor for like years like maybe five years now 10 years and before that I was using Vim and other things but I’m very familiar with the um the UI basically. So I have my git tree. I I know where I have a terminal inside and and that stuff. And so I actually liked uh having the LLM as a plug in there where you sometimes can say, “Okay, can you uh I have a buck. Can you just double check?” You know, it’s just like another layer of um tools you add to your workflow. So the LM doesn’t have to be front and center. It can be also this little helper, you know, like uh before you I mean I still debug things myself, but often it’s actually quite nice and fast to ask the LLM to double check things. And um what I like about it is when it’s like a second pair of eyes, but it’s also like um it’s not completely taking over and doing everything. It’s but it’s making your work better in a sense like you have additional checks and you can ask hey um can you suggest um improvements to make my code let’s say more performant still I mean you as the person you still have to um kind of ask the right questions and you still have to run actually the experiments to see whether it actually makes the code faster. So it’s it doesn’t mean like the LM does everything for you but it suggests useful things. I know a lot of people also use it for coding things. Um so that also works with the new um let’s say the codeex plugin but also the codeex app. What’s new is I mean a year or two ago people were uploading code files to chat GPT or um Gemini or claude and then getting some feedback and then you had to manually incorporate that and now it’s more in line like you >> I think it’s been a while since folks have been doing that. >> Yeah. Right. So that is uh that is uh now more native where you can see the file diff uh you don’t have to leave your coding environment but then on top of that also now when you run these tools locally you can give it access to your to your whole folder let’s say your whole git folder and then it can see the context of all the files you don’t have to manually upload anything and then on top of that uh so you it can also nowadays use tools itself so it you can give permission for to the LLM to run certain commands to run a unit test by itself and these types of things and that um together I wouldn’t say there is a single thing that is like groundbreaking or like a gamecher but all these little things they add up to make the LM more capable because it’s more and more getting more and more sophisticated and I think that’s what we have been seeing um in recent um uh months and yeah maybe yeah the last few quarters months where people like they develop these these types of cap capabilities instead of just making the model better. So there’s a lot of performance we can get from the LLM by yeah making the the the interface better basically. Yeah. Have you found that either of these new models um you know you you just said that you know there’s no break you know breakthrough changes there but do you find uh did you find yourself surprised with you know some new capability in either of these two models or is it you know very much incremental to what you’re already doing? >> For me it’s personally more incremental. Um, it’s just more like the convenience. They’re just getting more robust and and better where I wouldn’t say there is anything where it’s like there’s no wow effect to me like where it’s like oh my previous model was not able to do XYZ. It’s just a bit better you know it’s it’s getting more robust better and then I also develop a bit more trust in the results. It’s more like a gradual uh improvement I think. Um the one thing is still we still have the distinction also between the different uh reasoning efforts in terms of uh it’s like a a slider in terms of how like how much time the LLM should spend on getting you the results and there are different you know settings from low or no reasoning effort to high reasoning effort and that changes the time it takes for the LLM to generate results and I remember like half a year ago or a year ago if you wanted to have good results you almost always had to use the highest settings uh the high uh reasoning modes which took forever. Nowadays, uh even the lower modes I feel like are pretty good like where for most tasks it’s sufficient to use these um medium high reasoning efforts instead of the extra high ones and then you get results faster and I think that’s that’s also like a quality of life improvement for these models where um before you ran them maybe occasionally because you don’t want to wait five minutes but now it becomes more routine that they are part of your uh workflow basically. I would expand on that and say that the LLMs have gotten really good at themselves knowing how much effort is required to provide a a good answer to a a query. And so I find myself, you know, in the vast majority of times just, you know, typing my prompt into, you know, chat GPT, for example, and not specifying a model or level of thinking and letting it figure it out. And if I want more, I’ll tell it I want more. Um, but it does a fairly good job of determining when to just give me a quick answer, when to use a search tool, when to, you know, you know, do more thinking, that kind of thing. >> I agree. I have my setting on CPT an auto the auto mode where it automatically or by itself decides whether it should use more or less thinking efforts. The same thing. The only uh context where I still use the pro mode is when uh coming back if I uh to the chapter I mentioned when I have a chapter written like a 40page PDF I would upload it there and say hey can you check for any inconsistencies uh incorrect numbering and and all that type of stuff and then I set it to the pro mode uh like the one that takes 20 minutes I go have lunch or have dinner come back and look at the results where there it’s like it’s like a rare thing that I do that I mean once a month I finish like a chapter or something and or like if I write something important where I want like the maximum um let’s say quality check on that but like you said for most tasks it’s sufficient to use the the light effort. Yeah. Or the automatic one where it decides by itself essentially. Yeah. >> And I mentioned Moltbot and the release of that tool. Have you spent much time digging into that? >> Well, yeah. Multbot uh I think now now called Open Claw. >> Open the name. Yeah. Yeah. it changed a bit quite a bit. Um, it’s it’s interesting. It’s like this local uh agent that people can now run on their own computers. Um, where I I think that what I find interesting about it is it it gets people excited about things. Uh it’s um almost like back then when uh Deep Mind had Alph Go like the go playing um like the it’s like a board game like the go playing um yeah model where yeah it it got really exciting because not people people I mean I not many people let’s say in the grand scheme of things that at least in my uh my circles who played go before but it got people like my family and everyone really excited to see this type of progress when it was playing against the world champion. I think with Moldbot, it’s kind of like similar where it gets people interested in checking these things out and excited. Um, I think there’s also a lot of genuine use cases around it where you can run it u I mean to organize your calendar and emails. For me personally, that’s something I have not done. maybe I have a little bit of a trust issue um where I’m like yeah I mean personally it’s like yeah well I don’t know if I trust it enough to do my finances or my calendar uh a bit I’m a bit hesitant still to adopt something like that but I think it’s a cool um demonstration of what and like to kind of like show someone who is let’s say also not developing LLMs what these LLMs can do and what the what the purpose of those is also in a sense. I think I think that’s actually quite cool. Yeah. >> Any other, you know, tools or services that are, you know, largely kind of wrappers around LMS that you have come to depend on or do you find yourself mostly turning to the models themselves or, you know, like the dev environments? Yeah, it’s mostly still uh for me for my workflows, I don’t have anything like super automated where I’m like um you know I need to run something incrementally or in an agentic type of setting. What what I’ve been doing a lot though is developing my own apps uh like productivity apps. I think back in the day I grew up as like a a coder like using bash uh the terminal and Python and that stuff where I was writing myself like for myself like scripts for all kinds of things to automate things. Um and now with LMS um I kind of changed uh that a bit towards um developing native Mac OS apps like I always wanted to learn Swift uh coding coding in Swift. I’ve never had the time because yeah, I mean there are so many other more important things to do for me where like um that was an opportunity to say hey I want this what I have as a script as a native Mac OS app because it’s just more convenient. Um for example just the other day my wife also has a podcast. Uh it’s like a book um book club uh podcast and I help her with the episodes basically um like uploading everything and and that and editing and uh like just a workflow in in general because she’s not like a tech person. And then um I had like a script to add these chapter um marks to the podcast. And now I made a just the other day a native Mac OS app where you can just add the timestamps and click a button and it adds the the chapter marks to the audio file. like simple things like that and then I can share it with her and she can use it now and it’s just like these little quality of life things in in your everyday life where you instead of just doing things manually you can just automate them now I mean not necessarily I mean this is not running the LLM but it’s uh using the LLM to develop something that behaves deterministically in a sense so I I I’m more like of that uh more like a person like that who does that for for example I also have when I read social media feeds I mostly as a researcher interested in papers. So, and I often end up bookmarking a lot of archive links um uh links to PD archive PDFs or the abstracts and then I have my markdown sheet where I have a lot of these links and now I wrote myself a native Mac OS app where I just put in these links and it pulls out the um the title, the date, the author names and the link in in like a nice format and just making my my my life easier so I don’t have to click on them individually. I get a nice list and see the titles and um yeah and I think little things like that I feel like LMS are super cool for like to to develop these tools that I would not have time to develop otherwise basically. Yeah, >> that parallels my experience uh quite a bit. I think some of the the most benefit I’ve gotten out of uh LLMs in the past year or so has been writing kind of custom workflow tools. So uh primarily around the the podcast like one of the things that we would do when we work with sponsors is like pull these analytics reports and you know it was repetitive and um time consuming and so I created a web-based tool that will you know hit the API where we get the analytics um and pull information about episodes and you can choose an episode and then we’ve got uh you know pull a bunch of data into pandas and like do some analysis and then generate a um a spreadsheet like a Google doc and like it’s not it doesn’t the app isn’t using an LLM but an LLM was used to create it. Uh and that’s one example of probably like half a dozen you know fairly significant tools that um you know have a big impact in our workflow. Yeah, and that’s a good point also like that you said again also that in these cases the LLM is not doing the let’s say regular work the task task it’s more developing the tool to do the task and I think that’s also an important point in my opinion that well um the LLM is very useful and very capable but there are tasks where it it’s almost like wasteful to use an LLM for that like it’s like uh it’s like all if all you have is a hammer everything becomes like a nail um type of situation where I do think if you have like a dut deterministic task, it still makes sense to develop a deterministic tool. You can use an LLM for that, but it is almost like wasteful to also even ask an LLM what is oneplus 1 or something like that. You can use a calculator. Um so it’s like I think it’s still important kind of to recognize what is what is the nature of this problem and what is the best tool for that problem basically. Yep. >> I I’ve also done some tools where um I’ll use LLMs in like you know almost like a classifier like a very simple use case. I have one where um you know it’s like there’s the name of the guest uh so you know your name you know and then I pull a bunch of uh recent directories from the Google Docs API and say find the the directory that corresponds to the project for this particular you know guest. Um, and it’s like, you know, a reax or like a text pattern match doesn’t always work because they’re kind of they can be different sometimes. Um, but an LLM like you know can do it pretty easily and you know with a very high level of uh repeatability and and a low error rate. >> Well, it’s like where you need almost like a human or like some less structured approach then LM are great for that. I had actually had a similar uh project as um a college student. I was doing a sports prediction as a side project for fun just for fun like uh daily fantasy sports like predicting outcomes of like how who which player scores a goal in in the Premier League soccer on the weekend uh like daily fantasy sports and um for that um I was also developing this very sophisticated uh thing which was pulling information about the players from different websites and looking at how who’s injured who’s in I know good form and these types of things and for that and um now I kind of revived that project just for fun using an LLM. It’s kind of like the same uh problem you mentioned with the names because um some players have um like the spelling of the name is slightly different. There are these accents over certain letters and sometimes not in certain databases and then just getting them lined up in the database. It’s really hard uh with regax or just deterministic uh things. So that is actually a great use case for an LLM to kind of these unstructured uh almost like yeah vague data uh set parsing things that depend also a bit on the context basically. Yeah. >> So maybe kind of pulling back into like where we are with you know LLMs from a practical perspective. I think both Sam and Seb are using them and very kind of well I think two main things came out of this one. you know, if you um I was going to actually caveat this by saying if you are are a development mindset, but you know, I think we’ve seen with like vibe coding that even less technical people or non-technical people can, you know, get a lot of value by creating custom tools to automate, you know, specific parts of their workflows. So, that’s, you know, a huge thing that I think has been very impactful for both of us over the past uh year or so. Um and uh otherwise you know just kind of uh you know taking advantage of the the improvements in models by you for me it’s just like I can’t really articulate like a rule set but like I you know for if I’m confronted with a particular thing you know I have kind of a soft mental model for yeah I think I’ll start with chat GPT here or I’ll start with Claude for this or that. Um, you know, so I think the the the takeaway is that neither of us are, you know, using uh open claw or, you know, any, you know, particularly, you know, slick LLM wrapper agentic tools with any regularity. Um, you know, maybe the the caveat for me would be something like a circle back or granola to do like meeting summaries, but beyond that, uh, it’s mostly, you know, like you described, you know, use cases through the the native chat interfaces, um, and, uh, the development oriented use cases. >> I would add maybe also one more thing uh, you mentioned. So it’s also most mostly a slider like you can use LM not at all you can do still everything manually then you can only use LLMs like uh I know some people who develop uh let’s say even a company just uh based on um LLM code people call it VIP coding but like I I think VIP coding doesn’t even do it justice anymore but like don’t not doing any manual coding anymore just using LLM like having the LMS build the website the product and everything so like these two extremes dreams and I think we are more like in the middle where we kind of adopt LLMs but we are not like let’s say going full LLM and uh I think for me also I think there is still like a I mean I would say also for people who are nowadays learning how to program and like as a uh like is it worthwhile and I think it is um actually still worthwhile to learn math and coding even though there are LLMs that can do that because it makes makes your life also still more efficient and it makes uh you better at using these LM LM because like an example I had also I was using um an LLM for my website to add a dark mode. So I that’s something I always wanted to do. I wrote the website myself like 12 years ago. But then well I knew uh HTML and CSS and JavaScript much better back then than I do now. And I always procrastinated on adding a dark mode button because it’s I knew it it would take me like a month maybe to do it well or something like that. And it’s not my main let’s say job. So I was like, “Okay, I well, I can’t spend that much time on it.” But then I was like, “Hey, let me try using [clears throat] an LM for that.” And it did a really good job adding it, but it was not perfect. So the button was misaligned and everything. And then I was like, “Hey, make it a bit higher, make it a bit lower, edit, move it to the left and okay, this is actually I thought very inefficient. Why don’t I just go into the uh HTML or CSS file in that case and adjust the settings there?” And because I still knew a bit about CSS files, I was more effective to make these adjustments uh myself instead of having the LLM do everything and just brute force telling the LM, oh, move it that way, move it this way, and I could just just change them on myself and refresh the page and see. And I think in that sense it does make sense still to have like an understanding of how these things work because then there are cases where it is just more efficient to do things yourself um still then prompt the LLM to you know redo everything and so I think like I think the the what I wanted to say is that there’s a there’s a middle ground basically where I do think there’s still value in learning how things work. Yeah. >> I wonder what your experience is. I’ll often see around these new model releases on social media. Oh, I oneshotted this. I oneshotted that. Like, I’m trying to remember the last one uh that that um I had this experience. And then I’ll go and try and oneshot the same thing. And the results that I get are horrible. Like nothing like what is reported in social media. And you know like hey is it me [laughter] or is it just people like reporting these successes for engagement and they’re not really there or they’re fake like what’s your sense for do you experience similar things? Yeah, I would say so. I think that uh I mean I mentioned my native Mac apps even like something I have a Mac app where I just put in a PDF and it exports the PNG, WEBP and PDF versions in a certain uh resolution and it took multiple tries even with uh there was back then codeex 5.2 to get really uh everything all the buttons working correctly. Like you said it was not oneshoted at all. It was multiple iterations to get it to work and uh even something simple like that. And then I sometimes wonder are my instructions maybe bad or uh maybe I wasn’t clear. Maybe you have to say please test everything thoroughly and make sure everything works and blah blah blah. Maybe you have to be super explicit about that and we are not that explicit because we kind of assume it would make sure everything works or maybe these cases we see are just lucky, you know, like sometimes on certain things it just happens to work very well. So, I I don’t know for sure, but I agree with you that um it’s not all what it seems when someone shows you, oh, I one-shoted this. I I don’t think that’s um reflective of how things work today. Um yeah. Uh so, let’s switch gears a little bit and talk through some of the key areas that you expect to see continued innovation around with LLMs in the upcoming year. And then you know for each of them we’ll dig in and talk a little bit about the the recent history and where you expect to see things going. >> I would say it’s still going to be the reasoning uh we can maybe go into more detail there because it’s a very broad topic. So like push uh like pushing more on the reasoning front the post training. Um the second one I would say is also inference scaling like more sophisticated techniques that u they are partly related to training but mostly how to use the LLM after training and then I also think we will see more of this um yeah this agentic type of use like how because right now mostly LM are focused on like a turn by turn and how to like people will double or companies will double down on this uh loop basically running L&M as a loop like mold bot and optimizing for that and I think these three things will be mainly the the biggest I guess focus areas for companies. Yeah. >> Awesome. So let’s dig into reasoning uh to set the stage for where you think we’ll be heading in 2026. What do you think were the big advancements in 2025 around reasoning? So yeah, the biggest advancement was I mean first uh Open AI1 which got everyone excited about it and then um OpenI1 was using both inference scaling and I mean no one knows for sure because there’s no paper but likely also training techniques but then R1 deepse R1 they published their uh pre-train sorry the uh reasoning pipeline and I think that that was like um really something that took off where a lot of other companies also doubled down on that but it’s still very new in the grand scheme of things. It’s just like a year old and there have been um I was recently working on a chapter on reasoning. There were so many improvements to the algorithm. Uh I mean I just the other day compiled a list of 15 different tweaks and improvements from basic uh things changing sequence level lock props to token level but then also GDPO by Nvidia. Lots of progress there and I think we will see more of that. But first also one reason is that with pre-training we have seen basically that I mean it it still works and I think it’s still the biggest part of the pre-training of the whole training pipeline because it’s just so much data and very expensive but um the R&D like the research and development of the focus of the research team I I think it’s more focused nowadays on the post training like getting more performance out of that because it’s more like the newer paradigm and there are still lowhanging fruits to be picked where in pre-training it’s already pretty sophisticated where yeah, you still need a lot of data. You still need a lot of compute, but there is no let’s say like there’s nothing you can really do much there compared to post- training in terms of changing up the algorithms to get more performance. Of course, you can still do that and you will still get better results if you use more data, optimize the data mix, maybe multi-token prediction and these types of things. But most of the interesting things are happening now on the post training front in the reasoning realm basically. So I I think we will see more there. >> On the reasoning front, the one topic that I heard about um I heard come up quite a bit uh last year is the idea of verifiable rewards. And I think that led to a lot of the advancements or contributed to a lot of the advancements that we saw in terms of coding models. Can you talk about that as a paradigm and some of the big milestones that we’ve seen there uh over the past year? >> Yeah, thank you for the question. That’s actually I uh it’s a really really uh important point like so the reasoning training is uh essentially mainly based on verifiable rewards which means um there are tasks where you can verify the answer. So for example in Deepseek R1 the verifiable rewards were um coding and math. So with math um for example you ask the model to output the final answer in uh a boxed format in the latex it’s like a latex command like boxed and um then you can have like a deterministic like a rex or deterministic um code to extract the answer and then you can use something like warm alpha or simplically to a reference answer like if uh 2 over 3 matches 2 over 3 or 4 over 6 matches 2 over 33 it’s essentially the same answer but you can symbolically double check the answer and get a reward signal whether it’s correct or not and uh this is actually great because you can kind of like infinitely uh you can evaluate infinite numbers of answers because before with uh I mean it’s still an important point but the re uh reinforcement learning with human feedback um yeah you need human feedback essentially you can train a reward model to approximate that and it’s part of the training where you get get a score for each answer but it’s not quite as let’s say um accurate as a really correct answer like where you can verify the answer like there’s an absolute it’s math you it’s either correct or not and um you can have if you have something like that where you can uh verify the answer deterministically cheaply you can have the LLM generate infinite answers you can say okay generate 60,000 answers for this problem and then you can calculate the reward on all of them in in a fraction like a very short time it’s still expensive to generate these answers but you don’t have um vagueness and you don’t have to have let’s say human evaluating these answers and so I think that helps um scaling these things and the same with code where with code um in the deepse R1 paper the original approach was to um have the code and make sure that the code compiles basically like if it compiles correctly uh and you can use also a code interpreter for that um I think I mean both are are great but I think this is just the beginning I mean we will probably see this uh being extended to uh to have more than just the correctness of reward. I mean there are already other types of rewards that are being added. For example, a formatting reward where you want the model to use I mean not it’s not required but you some u companies prefer to have the the thinking in the think tag. So they have like a token think and then a closing token think closed like in HTML like the opening and closing um tag. It’s not required but it can be helpful to have it because then you can parse out the intermediate stuff and do something with it where um you can train the model to output this structure and this is like it’s called a format reward. So you can have multiple types of rewards added to this uh thing in in addition to the correctness reward. And I think we will maybe also see interesting things there where people will come up with um formatting rewards or like auxiliary rewards that help the overall model to to learn. And uh one thing is also they tried that in the deepsec R1 paper to evaluate the the answer explanation instead of just looking at the final um score or the if it’s correct or not. Um making sure or like evaluating if the the reasoning like the explanation is correct or not and >> process reward. >> Exactly. Yeah. they use a called process reward model that it’s basically another model that you train to give a score for this explanation but um I remember I mean it’s been a while since R1 came out in the paper um they had a section that they listed that as a failed attempt or unsuccessful attempt so they tried it but they thought okay this um increases the chance of reward hacking and then uh it was just not worth it it’s more expensive it resulted in reward hacking the model exploiting thing because it’s um yeah it’s easier that way for the model to kind of cheat to kind of mislead the model that evaluates the model basically and um so it is still tricky to do that but I think also in the recent months there were some more uh yeah interesting success stories like deepseek math version 3.2 too. They used something like that where they evaluate also like the whole answer with a rubric have another model for that and then they have another model that evaluates that model that has the rubric and so forth. It’s like multiple levels and but that seems to work and they had like ablation studies that showed that this is actually helping and I think we will see also more of that. It’s just like a very new paradigm like u making the reasoning training more sophisticated essentially. Yeah. Right now the verifiers are focused on like math and coding and that’s because you know they’re you know for a given response there’s a concrete uh ability to verify. Do you see this verification paradigm expanding beyond math and code? And and I think you know in part you know the focus on math and code is successful because you know even though not all LLM responses are about math and code those things kind of have an inherent you know logic or um reasoning capability in them and so the ability for the model to reason generalizes to you know non-math encoding problems. Um but do you is do you see a focus on expanding this idea of verification beyond math and code types of problems? >> Yes. So it’s actually a very uh interesting and important point. How can we I mean you mentioned that if you train the model on math problems uh reasoning on math it will also become better at reasoning in general. But then um it would be even better if you have a target domain to train the model specifically on that target domain on reasoning in that target domain. Um I think you you’re right there will be more of that. For me right now I just lack the creativity right now to come up with um example of um problems that can be verified. But I would say um maybe something even like biology related where for like a drug design like a pharmaceutical drug design or protein structure modeling where you have like physical constraints. So there are um like the angles between atoms, they can only have a certain angle and so forth where you could probably have like a physics types type of equation that double checks whether the generated molecule adheres to these certain types of um formats and then have the have that as a form of reward when you’re training the model. Um I mean this is maybe not a typical case of reasoning because well what is the reasoning explanation when you’re generating a model right I mean but um in general like something like that for other fields and in the worst case you can always uh I mean this is more like a rough approximation but you can always train another model that provides the correctness reward this I think this is more challenging though because um it’s susceptible to reward hacking even going back to back in the day uh generative adversarial networks where it’s easy for the generator to collapse. Um you have the discriminator which says um is this image real or generated and then was like the setup where you train a generator to fool the discriminator and the discriminator gets better at distinguishing and you have like almost like a similar setup. you can use it to say give a reward or not but then the model may or may not exploit it at some point learns how to learns a trick if I only generate this one word or something like that uh then I fool that evaluator but I think maybe we’ll see also more more of that like uh developing like AI based uh reward models essentially that can be used then in other fields to train better reasoning models >> beyond increased focus and and tweaks to the uh verification models. Are there other areas that you see as contributing to uh stronger reasoning going forward? Um yeah I do think it’s also um I mean the training is one part but the other one is the inference scaling that uh you can get much uh better performance if you use simple let’s say not simple I mean simple in quotation marks but if you after training if you uh spend let’s say if you have the model if you spend more compute essentially like inference scaling is all about the the definition is essentially spending more compute after training during inference when someone uses the model to generate the answer and you can do it in multiple ways. I mean reasoning models themselves they are already kind of like a form of inference scaling because they generate more tokens than regular models. Uh the explanation has is longer than a regular model provides and but it helps often the LLM to reach the correct answer but this is more like a sequential inference scaling. You can also have parallel forms of inference scaling where you just generate multiple answers and that’s called um self-consistency. So for example, if you have a math problem, you can have the LLM with different temperature settings, answer the question multiple times and then you take a majority vote or something like that. I mean there are different ways you can do it. There’s also there are different scoring methods or other LLMs that uh look at all the answers and give you the uh the most likely correct answer. And with that you can also boost the performance of the model. It’s more expensive though. So it’s always like this um yeah it’s not like a one-sizefits-all. You don’t want to use it all the time. It’s uh you use it when you need it. But I think what will be interesting is to improve the way to tell when it’s leaned. I think when CHP was it 5.1 or five launched they had like that automatic setting that we uh uh talked about in the beginning. it it was very bad at the beginning but I think it got much better over the months and years and I think um I I’m not quite sure we have anything like that in the open source open weight ecosystem maybe listeners may correct me here but I think something like that I can see also being more important because yeah on the one hand we are developing these very expensive models that can solve very hard problems like in this math olympiad but we don’t want to use them all the time because they are slower and more expensive expensive and um there’s also going to be more like a focus at the same time on cheaper models. So for example um just the other week uh Gwen 3 coder next or next coder sorry Gwen 3 next coder came out which is Gwen 3 new Gwen 3 is one of the most widely used open weight models um because they have like a lot of really high quality models in all different types of sizes but they also the next model it is essentially like a hybrid it’s not like a pure transformer anymore it has um like it it’s inspired by state space models to make things cheaper. But then um it’s like always this trade-off. People are developing higher accuracy models, people develop cheaper models. And I think um yeah, I mean one way would be changing the architecture to control the quality and price. The other one is inference scaling. But uh I think right now it’s in the open weight ecosystem. It’s not quite as popular yet. So I think we will also see more more of that in in local tools and and so forth. I don’t know that I know of any like a open-source project or a model that incorporates this but from conversations I do get the sense that a lot of companies that are building around um you know the Quen models for example and these open weight models uh commonly have like a router component in their architecture that tries to assess the complexity or category of a a prompt and routes it to the right, you know, model and prompt uh that is, you know, either most economical or, you know, maybe post-trained for better responses, that kind of thing. My sense is that that’s the common approach to um to addressing this challenge that you’re describing. >> Now that you mentioned that, another example came to mind. It’s the GPT OSS model, the open- source model by OpenAI, which came out last summer. And in that model even if you use a very simple inference uh or like a simple tool like Olama or uh any comparable tool you can set the reasoning effort in the system prompt. So you can say um no mild medium high reasoning effort and then scaling inference um based on the reasoning effort but um I I don’t think there’s any other technique really automatically incorporated like um self-consistency or self-refinement. It’s m it’s mainly you have to as the researcher do it yourself most of the time. >> Can you talk a little bit more about the self-refinement and self-consistency and how folks use those techniques? >> Yeah. Uh so self-consistency and self-refinement are two examples of inference scaling. Um I would say the biggest difference between the two is one is a parallel technique where self-consistency as a parallel technique generates multiple answers. um and you choose let’s say the correct answer based on majority vote or you can have a scorer who assesses the S answers but then people call that technique best of n like best of n answle technique um like classic ensembling almost and the other one is self-refinement where you have the LLM generate the answer and then you feed the answer to another LLM or to itself self um and say um here’s the answer. This is the question. Write a summary if the answer is likely answered correctly and what are weaknesses and like like a rubric almost. You have a you provide a rubric with certain things that the LM should check and then it gives you back a report and says well this is could be better. this is likely incorrect. The explanation doesn’t match the final answer. And then you feed that output back to the original LLM and say, “Hey, um, look at that report and refine your original answer based on the report here.” And often this can lead to the LLM improving its own answer. It’s almost like this phenomenon. Uh, sometimes you ask CHP something and uh, it gives you something and like, wait, that can’t be right. Um I don’t know you ask something about like when was one of a certain model released and you know okay this can’t be right the year is totally wrong and you tell judge be hey you are incorrect you made a mistake and oh yeah you are right I made a mistake and then it tries again and it’s better next time and it’s almost like that same mechanism where um yeah it self-refineses its answers it can I mean based on my experiments it can also sometimes make answers worse like it will overthink or it was originally correct but then it I don’t know the feedback is weird or bad and then it makes the answer incorrect. So it’s not like a like well not a foolproof technique. It’s also with caveats but um in the deepseek math version 3.2 two paper where they had a self-refinement um in a more sophisticated way where they had a third model evaluating the evaluator um they really showed I get a nice graphic or plot where they showed how much the accuracy can improve and uh basically from a I don’t know the numbers off top of my head but the if they cranked up the self-refinement and self-consistency they were able to um have like gold level performance in certain um math competitions which was very impressive given it was still the same model as they used before, but they just cranked up the inference scaling. Basically, >> one thing that’s interesting kind of reflecting on these themes that we’re discussing is just how they’re all very interrelated, you know. So, reasoning is a key theme. Reasoning is enabled by inference scaling. Inference scaling. uh a lot of what we’re hearing as we talk about this is like loops and recursion and uh those kinds of ideas and those are uh key ideas and the third key theme you mentioned which is kind of agentic uses of LLM um you know with that as a segue you know talk a little bit about you know what you’ve seen thus far around Agentic and what you think is exciting in that space >> um I would Yeah, the agentic use cases it’s even like simple I mean no again in quotation marks simple things like codeex or cloud code where it does just multiple iterations to solve a problem. It’s not just like one uh shot. It’s uh like it’s more like uh doing a task um like rather than uh just providing an answer. And I think um I mean multiport would be another example of agentic um systems that have I mean agentic is also like a I would say almost like a not well defined term because people use the term differently but um for this uh podcast maybe we can think of agentic as something that runs in a loop. uh and I think um yeah that is something we will see more of recently cla code and uh GPT 5.3 codeex the app they added a lot of these tasks where you can even schedule something and it does something uh on a reoccurring basis for example and I think uh we will see more of that it’s just like the beginning it will be more like plugins and I mean it’s still the same LLM it’s just like how we use the LLM and um how to get the most out of it out of the context feeding back the context and I think there has not been that much focus on this in the open weight open source community the focus there is more on developing the LLM itself where companies though that like you know openai claude they are more like okay let’s um build these tools so we can actually do more and more impressive bigger things with these LLMs and I think maybe by the end of the year we will have systems that can reliably um book a trip to, you know, some holiday vacation destination where this becomes more more common. I mean, there were already tools that promised to do that. I forgot the names, but I think it was called Devon, something like that. It might still exist or Manis. >> Oh, yeah. Man, right. Yeah. Yeah. But I think it’s just like the beginning. And also, most people, I don’t think they need like a full-blown thing that can do everything. they just maybe need like a plugin for Excel to have certain uh intervals where it updates certain things and then Excel spreadsheets goes into the let’s say goes to the internet and pulls the recent stock price or something like that but like in in a in a type of loop type of setting essentially one of the things we heard a lot about in the context of agentic uses of LLMs I think over the past year maybe two years um is the idea of like multi- aent systems and like decomposing uh a problem into indiv independent agents with kind of their own personas and that kind of thing. And I think that um you know the the whole open clawed uh or open claw uh idea like even like today I’m seeing you know a lot of hey I created my you know AI team and there’s this employee my AI employees right and there’s this employee that employee that employee and they talk to each other using you know Slack or notbook or whatever and like Um what have you seen with regards to um you know kind of from a a concrete you know builder or technical perspective this like multi- aent uh use case are you finding you know folks getting a lot of value out of that >> um to be honest I wish I had a really good answer or interesting answer but this is something I’ve not explored personally where um most of my experiences are like single use case where it’s one LLM that provides solutions or tackles a specific task, but it’s mostly not interacting with other agents. I think um here though I mean I I also see it’s more like a context engineering problem where how do I mean it so the LLMs themselves I don’t think they are the bottleneck. It’s more about how you let’s say provide the result get the results and provide them to another LLM. I mean in in that sense it’s almost like a form of when you do image or video generation where you have one LLM parsing the text or improving the input and then passing that to the L or to the part of the model that generates the output the diffusion model part or the transformer based diffusion part where I think it’s more like a sophisticated form of that how do we provide the right context to the different agents and it could the uh from basic databases to using slack where one our model outputs something there and with the via the API the other model ingests it and I think well that would be that is I think also something that is just getting started also with moldbot and open claw and I think we’ll be seeing a lot more of that but yeah so that that’s all I can say to it because I personally don’t have any concrete experience I haven’t worked on this myself yet do you have a sense for where we’ll see focus and innovation around these kind of aentic uses in the upcoming year or you know maybe what the gaps are what really needs to be worked on in order for them to kind of come into their own. Um I do think it’s still like each LLM has its own kind of like failure rate at some point where so the progress is usually measured. Um how long the LLMs can work autonomously like how how long can they work until they fail. Um and the more models you add, the higher the risk that one of them fails if they depend on each other. And I think um improving the model itself here will also help improving the whole system basically as uh as the main way to improve the performance. Then I can also see right now most uh as far as I know based on what is publicly available these are still the the vanilla LLMs that are in claude or in other um APIs they’re not specifically trained uh to interact in a multi- aent setting and I think uh in that sense uh if you prepare data for training these agents in a multi- aent setting like a fine-tuning type of situation I think you can also get more performance out of them. We have seen that I mean even for simple things like um codecs um so uh GPD 5.2 or 5.3 codeex is not the same as GPD 5.2 and 5.3. These are models that they um forked off and then specifically um trained to work with the codeex app basically and I think um something like that we will also see for these agent models. It’s just harder for let’s say the consumer to do that because you don’t or we don’t have access to these models. So we kind of like dependent on the person who owns the LLM who hosts the LLMs to to do this type of training basically. So yeah I think I can see companies also developing something like this. I mean, if I had to bet uh Claude and OpenI, they really paid attention to what Moldboard or Open Claw is doing and maybe coming up with their own version of that that is maybe even more capable because they control the model and they can fine-tune it for certain interactive multi- aent types of environments. Yeah, one of the things that’s interesting looking back is that a lot of the, you know, things that we might look back and see as uh big advancements over the past year or two years from an architecture perspective, they’re relatively incremental. like the fundamental core architecture, you know, there have been a handful of proposals of like where we might go beyond LLMs, but the the core has been fairly stable. You know, do you do you agree with that? Where do you see um you know, how do you think about the the future of LLM architecture? >> Yeah, that’s a interesting question. So, I would say everything I’m saying here with an asterisk because uh Deepseek version 4 is not out. it might change everything completely in terms of what [laughter] I’m saying. But if we just look at 2025 up to the second week of February, I I don’t think there were any fundamental changes in terms of the state-of-the-art architecture. So I think one thing we have to distinguish between is like there are architectures that are more geared towards doing the same thing more efficiently and there are architecture changes that are geared towards let’s get more modeling performance accuracy out of the model. If we for first if we look at those models that push the state-of-the-art the modeling performance there haven’t been that many changes really recently. I mean um looking at 2025 uh mixture of experts uh models have been making a comeback. I mean there were other models like mixtrol and uh deepsee before but they really became popular after deepseek version 3 came out and deepse version 3 ca became popular because of deepseek r1 which is basically a fine-tuned version or trained post-trained version of deepseek version 3. Um but then a lot of companies adopted this architecture. I think Kimmy uh straight up used that architecture and they scaled it from 670 billion to one trillion parameters or even like the European company Mistrol AI they used the deepseek version 3 architecture. So a lot of people are I would say they are not gambling in that sense in let’s try something different. they will or they take something that works and try to make progress or differences in terms of changing the data and the the algorithms. Um but that doesn’t mean there are no new ideas. So I mean deepseek version 3 besides the mixture of experts they did have uh the multi head latent attention. I think it was also in one of the previous papers but multi head latent attention is essentially like a tweak of um the attention mechanism where you have like a intermediate smaller compressed state of the uh keys and value uh queries and keys no sorry keys and values keys and values I know all of them actually are but the keys and values are important to compress because then your KV cache becomes smaller uh so you don’t store the full keys and values in the KV V but a compressed form but then you reconstruct um the keys and values from the compressed form in inference. So you are basically trading off compute with memory. Uh but but also maybe to explain this a bit better you can think of it as a lura like the low rank adaptation. Um so basically you project it down into a compressed space and then you project it up again. So that’s basically a multi latent attention. That’s so that’s like an interesting tweak. um I think in 2025 26 that people adopted and then uh it was again deepseek version 3.2 too that had another um I would say tweak um sparse attention. I mean sparse attention is also not new but there’s always been this um like research on how we make attention cheaper because it scales quadratically with the sequence length and there have been hundreds if not thousands of papers. Um but you know with papers I’m always a bit um I mean the ideas are interesting but I’m always a bit um careful and I’m waiting always to see that in in production in quotation marks but I mean is uh to see it in a flagship model like uh because the idea might work well if you are only focused on a small model but things may fall apart once you scale a model to 500 billion 600 billion 1 trillion parameters. So, and with DeepS here is a nice um case study because they do have this flagship model and if they use something in that flagship model, you basically know it works at scale and they have uh their own version of sparse attention. uh I think they call it literally deep sparse attention and it’s um yeah it’s uh instead of so they have like a lightning indexer like a small cheap um model in a sense instead of for one token have paying attention to all the previous tokens it’s more selective it selects which tokens it um pays attention to. So it’s kind of like a mask. So you are calculating a mask over all the tokens to select a subset to make it cheaper to make it um scale sub quadratically basically. And um there have been like these types of tweaks but it’s not fundamentally changing how attention works. It’s still the same attention mechanism but how do we yeah how do we make it cheaper basically? Um so I think that is something um where people hone in on what works at the moment but we will see maybe in 2026 maybe one of the flagship models will have a fundamentally different approach. Um little companies I mean not little sorry companies but little changes have al also been made in terms of um alternative architectures. So we mentioned Quen 3 earlier. So Quen 3 is one of the flagship models. Uh right now it’s maybe not at the top anymore because it’s a bit older. It came out uh in summer but usually when Gwen 3 or when Gwen models came out they are usually top of the leaderboards. Um they had also a parallel version of their model. Uh they called it Gwen 3 next and that one tried something different. they had like a hybrid attention mechanism with um a gated delta net um basically to to have more like a statesbased model approach where it’s more like linear where people are trying but it’s not necessarily like their flagship model like they are in parallel trying other things and I think this makes sense because um yeah you don’t want to put all your eggs in one basket basically you want to uh have a good model and then maybe on the side try something and then maybe scale it up later if it works well. >> Yeah. What about continual learning? That comes up frequently as an opportunity um you know particularly before we got really good at incorporating in tools and the ability to do searches because models would get stale um you know very quickly. Uh but you know there’s still this interest in having a model that you know we can you know we can keep its training data updated. We can delete I we can delete things uh we can incorporate new knowledge like do you foresee significant innovation in that area? Yeah, I think this is like maybe the biggest uh dream in in the sense of like hey, how can we make the model improve itself? Like the the biggest uh I guess achievement right now that could be made if that gets if someone finds out a way that this works. But I think right now there is no even no pathway to this uh like there’s no yeah there’s nothing really that is like where you would say oh that’s the thing that will give us reliable continual learning but that being said I think I mean there are already forms of continual learning I would say I mean even something like um well I mean it’s more like controlled like instead of the model automatically updating itself people would collect data uh from like the recent internet or recent uh tasks and then carefully update the model essentially. I think it’s more like that where um it’s not that we don’t update models but we also don’t do it fully automatically. It’s like a semi-automatic almost type of thing. I think also it’s it’s like that not only because that’s more reliable. So because yeah it’s risky to just update a model um on on new data but it’s also because of resource constraints because um for example I don’t know how many copies of the model OpenAI has but um I mean it’s not you you can’t definitely not have a single copy per user that would be way expensive. I mean everyone would have to have a little supercomputer at home or like a 50,000 $100,000 and be a lot of work and it’s it’s it’s I mean even though it’s simpler nowadays to train your own LLM. It’s not something you can do routinely on a weekend. But the goal of that book was to teach people still how that workflow works to understand how LLMs work because that help you that helps you to let’s say use LLM better to understand what is the context what’s the limitation of the context or how does you know attention work and why is it more expensive if my input gets longer and it’s it’s just like if you build the LLM yourself you you kind of like get a real clear understanding compared to just uh explaining it in a more I would say free form based approach and so um yeah a lot of people like that and uh it’s like a very popular textbook also for teaching now and I was then really excited to kind of because it’s only one book I could only cover so much to work on the SQL so uh right now I’m working on build a reasoning model from scratch which is kind of like the sequel there’s no overlap between the books it’s basically I mean it can be read as a standalone book but it’s mainly focused on the reasoning techniques we talked about the uh reinforcement learning with verifiable rewards the gRPO algorithm inference scaling like all these techniques that once you have a pre-trained LLM so the book starts by starts given um so there’s a given pre-trained LLM we use quen 3 the smallest quen 3 model and then adding inference scaling and uh the reinforcement learning so the first uh 360 pages are already uh in the early access and I’m hoping to finish I mean there’s only one more chapter left uh by April Well, I mean the chapter is a lot of work because you have to run all the experiments. So I’ve been running a lot of experiments especially for the gpo algorithms because there have been so many um different papers and improvements and trying them out in practice. It’s been a lot of fun but uh it’s also a lot of work. So um yeah um so I I’ve been mostly running experiments in the last um couple of weeks and months and yeah it’s quite exciting actually. And so can folks pick up the second book and run with that or do you expect folks to have read the entire first book before they start with the second? >> Um, I would say either way works. You don’t have to read the first book if you um, so the second book, it uh, uses a pre-trained LLM. So you don’t have to pre-train your own LLM or you don’t need the first LLM to uh, sorry, the first book to train the LLM for the second book. So it’s kind of independent like that. But the second book doesn’t explain in detail the pre-training or the ar architecture. I mean I have an appendix on explaining the architecture but it’s not quite as detailed as the first book. So um I think if people want to understand the whole let’s say the whole life cycle of anm from pre-training to postraining I I think it would make sense to read them sequentially but you could also start with a second book learn about inference scaling and reasoning and then if you’re interested in the pre-training you can fill in the gaps later on. I think either way works basically. >> Well, very cool, Sebastian. It’s been great catching up with you and we need to do it more often than every three years. Um, but thanks so much for jumping on and sharing a bit of your perspective on kind of where things [music] are and where things are going. >> Yeah, thank you so much for the invitation, Sam. I had a great time. I love talking about LLMs and AI. So, well, that was a treat and thanks for having me on. >> Thank you. >> [music]

Keen's Clippings

Explorer

AI Trends 2026: OpenClaw Agents, Reasoning LLMs, and More [Sebastian Raschka] - 762

Graph View