E160 - AI Summer 2025 Recap: GPT-5, ChatGPT Agent & More Artwork

AIAW Podcast

The Artificial Intelligence After Work (AIAW) podcast is a weekly live streamed long format conversation aiming to demystify data innovation and AI, as well as their impact to future business and society by bringing the listeners close to the challenges that AI practitioners aim to solve today. The case-study, industry-by-industry, human-focused, and guest personal angle on the topic approach makes the podcast educational, emotional, engaging, and entertaining to all who are interested in learning more about AI, the future developments in the area, or simply getting exposed to variety of topics from practitioners and experts with first-hand industry experience and knowledge in the topic of the day. Hosts: Anders Arpteg & Henrik Göthberg. Program Manager: Goran Cvetanovski

All Episodes

AIAW Podcast

E160 - AI Summer 2025 Recap: GPT-5, ChatGPT Agent & More

August 14, 2025 • Hyperight • Season 11 • Episode 1

Is OpenAI preparing to monetize with ads? Was the latest GPT-5 release a strategic decision or a true technical leap? In this Pre-Season 11 Summer Special of the AIAW Podcast, Anders Arpteg and Henrik Göthberg are joined by Jesper Fredriksson (AI Engineer Lead, Volvo Cars) and Robert Luciani (AI Wizard, Negatonic AB) to unpack the biggest AI news from June–August 2025. We dive into GPT-5’s benchmark and coding performance, the hunt for better AI metrics, Perplexity’s bid on Google Chrome, Meta’s superintelligence push, Tesla’s AI chip pivot, the Swedish Prime Minister’s use of AI, and who might win the global AI frontier race.

Robert Luciani: 0:00

You know it's like with range anxiety in cars. Most people overestimate how much range you need. This thing has 30 kilometers and you know in the city nothing is more than two kilometers away. I've always liked the idea of having an inner city that's car-free. I mean, wouldn't that be nice? That's my prettiest work.

Anders Arpteg: 0:22

So how far can you?

Robert Luciani: 0:23

go 30 kilometers. Oh, 30 kilometers, okay cool, okay cool, yeah. So I charge it maybe every fourth day and you haven't crashed yet. There I've bumped into things gently. Did you wear a helmet? No, did you fall and never fall. I looked uncool. So you know, you try, try to stop gracefully, and it doesn't always work.

Anders Arpteg: 0:45

Cool. So, Robert, you had a summer. What do you call it Electric? Electric skateboard summer, Skateboard summer.

Goran Cvetanovski: 0:52

Yes, it was good.

Anders Arpteg: 0:55

Jesper, did you do anything special this summer?

Jesper Fredriksson: 0:58

I think. To continue on the cool devices, I was trying out my Meta glasses. Oh yeah, they're not so new but they've been around for some time. But it was quite nice on vacation to just do like click and then you can record everything and you don't have to bring up your phone.

Robert Luciani: 1:17

Is this version 2 of the glasses?

Jesper Fredriksson: 1:20

I think it's version 2 and I think there's probably going to be a third version this fall. It's rumored and I think there's probably going to be a third version this fall it's rumored that they will have a little LED that you can watch, so you will have some kind of heads-on screen, so you will not just talk to them, but you will also get some feedback. That's the rumor, at least.

Anders Arpteg: 1:38

Really cool. Do people see that you are recording them?

Jesper Fredriksson: 1:42

There's a little light flashing when you're using it, but it's kind of subtle, so it's. It is a little bit invasive it is for sure cool.

Anders Arpteg: 1:50

Henry, did you do anything?

Henrik Göthberg: 1:51

well, I, I, I have a skateboard anecdote and I and I but um, but I also you did skateboard right I? I skateboarded a lot when I was a kid and, um, I've been. I've been for some time. I've been looking at the electric skateboard for years, but never pulled the trigger on it. I've been looking for what is called more of a surf skateboard.

Henrik Göthberg: 2:13

You know like a longboard, but it's the way it turns. It turns much more radical. And we were up in Ore and we went to Loppis you know, yard sale and I found an old one, but it's like oh, it has the right type of trucks that turns different, I buy this one. Then I had to try it out, of course, and they built a really nice skate park, concrete park, in Åre. And here the 50-year-old man with a brain yeah, yeah, I know how to kick escape, but I know how to do everything. When I was young, I I tried it out a little bit carefully first. You know, yeah, yeah, it's okay, it's fun. Okay, now I need to get some speed. You know I need to kick hard in order to, you know, then go and get some flow. What happens when-year-old man that hasn't warmed up kicks super hard. With the knowledge of how to kick hard, you tear your calf muscle.

Robert Luciani: 3:12

Oh really, I knew it was going to be something like that.

Henrik Göthberg: 3:14

You fucking tear it straight off the ligament goes off so it's like it's the calf and it was a strange feeling where I basically I sacked I used fellas, a sack of potatoes and I couldn't bounce. I thought I took my heel. What do you call it? Hell's tendon, achilles tendon, achilles tendon. I thought I ripped that one and I even had plaster on for one day until I got to the, you know the, the general doctor. He put plaster on straight away and then I went to the expert and then I know this is too high up, you know you haven't really pulled the achilles heel.

Henrik Göthberg: 3:55

You appeal, you tear the muscle and that is a muscle tear and there's nothing you can do but rest did you consult ai instead of a doctor?

Henrik Göthberg: 4:07

uh, I I did everything, I consulted, ai, I consulted, uh like this, and then I went, yeah, in the end I I've had an excuse to read books and to sunbake, so that that was one thing. And one thing led to another. I bought myself a new guitar. What I need to sit still, what did you buy? We'll take that later. So, yeah, but we had a fun summer, but it was not as planned. I was planning to do a top hike to Tsylana and stuff like that. My wife and oldest son did it and I was at home. So a bit bad, but anyway.

Anders Arpteg: 4:56

I've been cool and, besides, I had some nice time in Croatia with Goran as well. That was really, really awesome. But I also did a lot of agent mode coding during the summer. Lovable, or what? No, no, no.

Henrik Göthberg: 5:08

In cursor.

Goran Cvetanovski: 5:09

In cursor okay.

Anders Arpteg: 5:10

But one other thing I did this Sunday and literally in a few hours, was an AI After Work podcast planner and it worked surprisingly well. So it's an MCP server that goes out on LinkedIn and do some research and then it just you summarize the person and a theme and then it, you know, presents and generates a nice email reply with discussion questions etc. So, yeah, really cool.

Jesper Fredriksson: 5:31

So how much time do you save with this?

Anders Arpteg: 5:34

A lot. I mean I can basically do something that took, I mean, 30 plus minutes to do it in 30 seconds.

Henrik Göthberg: 5:42

Yeah, if you don't want to put your own flavor on the questions, but I can it's interactive. It's interactive so I can adapt the question, but then it's not 30 seconds, then it's three minutes.

Anders Arpteg: 5:52

Yes, it could be.

Henrik Göthberg: 5:53

If it's not the zero-shot version, then it takes a bit longer time, but in best case, 30 seconds, but it's still, I mean, like we've been doing 160 episodes and you want to figure out who is this person, what's their background, what is the theme that fits us and this person, and that's work and it's fun to just try out the latest of course I forced myself to use GPT-5, which I really just wanted to go back to Claude all the time because I could very, very quickly see that GPT-5 is not as good as Claude, but I forced to use it.

Henrik Göthberg: 6:28

It was also one way for you to really dig into GPT-5. That was the reason.

Anders Arpteg: 6:33

I just tried to find an excuse to force myself to use GPT-5 and this was it, and I still fall back to Claude after a while because I was too annoyed.

Henrik Göthberg: 6:44

Another thing that happened for me. That is a little bit more data and AI related due to my injury, and my wife is just if she hasn't killed me yet, I'm sitting here, but she will soon. I spent way too much time on LinkedIn commenting and I saw Goran made a comment oh, I've been on detox for LinkedIn now. Oh fuck, I've been on detox for LinkedIn now. Oh fuck, I've been worse Because of my injury and I saw your comment and that sort of leads to where we are here today as well.

Anders Arpteg: 7:13

Yeah, I mean great to have some friends here tonight. We have this kind of improvised extra special AI After Work podcast to just discuss and have some after work talking about the latest summer AI news, but perhaps specifically a bit more about GPT-5 and the controversy around that, et cetera. I think we all know each other here. Perhaps if you want to say 30 seconds, no 10 seconds, or something about yourself, Robert, who are you?

Robert Luciani: 7:41

AI developer. I work at SVT now and I like operating system kernels cool, yes.

Jesper Fredriksson: 7:48

AI engineer currently at Volvo. This week and next week I'm starting a new job at Simplyio super cool.

Anders Arpteg: 7:55

Yeah, congrats on that.

Henrik Göthberg: 7:57

Henrik Henrik, co-host of AI housework podcast, but for this topic I think Robert does a better job than me in my spot, so I take your spot.

Robert Luciani: 8:08

You're the guest, I'm the co-host.

Henrik Göthberg: 8:10

You're the co-host, I'm the guest today, but ultimately founder and CEO of Dairdax, and we are obsessed with data and AI adoption. How to take data and AI from new to normal.

Anders Arpteg: 8:21

That's what I'm all about, I'm all about, I'm Anders also co-host for the AI After. Work podcast, an AI nerd and AI enthusiast, and, besides the tech, I'm really passionate about how to really scale value from AI in companies in different ways. Okay, with that, let's jump right into the main questions. So perhaps we can do a round of table kind of thoughts about the GPT launch itself. If that works, anyone that wants to start? Did you all watch it, by the way, did anyone?

Jesper Fredriksson: 8:53

watch it live. What did you think about the presentation?

Henrik Göthberg: 8:58

Let's start there.

Jesper Fredriksson: 8:59

Let's start with the presentation.

Robert Luciani: 9:00

Robert, for a company as big and uh as um, sorry, as much momentum as open ai um, uh, I've seen I think we've seen a couple of tech companies, um, not really seize the opportunity to be as polished as they could be and you know where we see industry leaders like apple and others that are just absolutely flawless in their presentation of stuff. There is the ability to do it and we know how good it is for the company and for everything, and I don't know. They seem to think that they can do it themselves.

Jesper Fredriksson: 9:33

But it seems to become a new normal right now. I have to say I like it when it's a little bit wonky. It doesn't work all the time. I tend to think that the Google presentations, for example. They're always overly polished and it feels like all the jokes are rehearsed so well that you can spot the punchline before like three minutes before, and too much marketing, I think in Google.

Anders Arpteg: 9:54

Too much marketing, yes, so I kind of like that.

Jesper Fredriksson: 9:57

It's a little bit like do-it-yourself.

Robert Luciani: 10:00

There is a missed opportunity here, though, which is what is the purpose of GPT-5? And now we're leaving it to everybody to guess. Oh, gpt-5. You know, some people will say why did they regress all the old models in the drop-down menu? Oh, it's because OpenAI's strategy here is to reach mass adoption. Or why did OpenAI do this and that, and without a sort of singular message on what they're doing and why? People are left to speculate. Maybe that's part of the strategy as well.

Jesper Fredriksson: 10:32

It's been more hype than any of the other model providers. The GPT line of models. It's always a big thing. Like everybody was super shocked when GPT-4 came out and now they have to follow it up. But it's a hard thing to follow up and they build it up too much.

Anders Arpteg: 10:49

I mean sam alton speaking about the death star and everything.

Jesper Fredriksson: 10:53

I mean it's the image on on x.

Robert Luciani: 10:56

I will be the defender of gpt5 here, because what I think is it's good for, I think it's very good for, and I think for everything that people are giving it bad rap about, it's fair. I just don't think that it matters.

Anders Arpteg: 11:10

What is it?

Robert Luciani: 11:10

good for.

Henrik Göthberg: 11:11

I think it's good for-. Are we jumping there already If we're just going to wrap up on how was the launch? We've been sitting in this pod several times and talking about OpenAI as doing the smart marketing moves over and over again and where we've seen debacles at the Gemini launches and stuff like that and we almost turned the table this time. I mean, like if you compare to how impactful GPT-4 was and how smart every single step out of it, even the timing, how they did it, that sort of set I've been saying you know, good or bad, I don't care, I'm impressed with their marketing.

Jesper Fredriksson: 11:59

I have a take on that, but not this time, not this time. I have a take on why. Because I mean they basically invented the whole scene. They were the first one to launch ChatGPT, which got everybody sort of aroused, and then they already had GPT-4 finished when they launched. They were just going through the basic, I mean the validation parts.

Anders Arpteg: 12:23

Okay, so they had a head start so they could, they could, they could plan it. They could plan it much more, but it was nothing that really pushed them to do it at this time and I must agree with emery carry a bit. It's. It's very strange compared to other, like gemini launches. They completely like, in a super smart, strategic way, just launched like a day before or something and then it completely overshadowed the whole google launch so good so marketing in this case not so much well, you have a dynamic to play with here the expectations and then what the launch actually does.

Robert Luciani: 12:57

I would argue that what they showed and what they launched and what they explained was quite okay, and the reason everybody thought it sucked was because sam altman and others hyped it up to the moon and not only hyped the things that I think it's good for, but basically other stuff that people are interested in, such as can it count R's and that kind of stuff.

Henrik Göthberg: 13:17

But why now?

Anders Arpteg: 13:19

Why now.

Henrik Göthberg: 13:20

I think one of the key questions. No one has really asked enough or haven't seen it in the LinkedIn social media. Why now? Why did they rush it for this particular? What was the timing that they were going for, because, as we've pointed out, they've been spotless in timing in terms of fucking up someone else's launch, or whatever, but this one why?

Jesper Fredriksson: 13:40

I have one theory which has nothing to do with a modern selector and the routing. The thing that they obviously pushed the most in the presentation in the online event was coding, and they are falling behind Anthropic and this is something they are advertising it even. Whatever, we think that it's super good at coding. They have a new high score on Sweebench. We can get back to that, but it's on the same range. It's on the same range, but at least they're at the same level.

Anders Arpteg: 14:13

Yes, they're catching up, but not taking the lead.

Jesper Fredriksson: 14:15

So what I think they're doing is they are marketing themselves as best in class in AI coding and they're much lower in price than Anthropic. So that's why I think they felt maybe rushed to get into this field, so that Anthropic doesn't take the whole cake.

Henrik Göthberg: 14:35

But if I then connect back to what you said before, that there is a theory that they had a head start, so in that way they could opportun, opportunistically used to do the right stuff right. So this one, then, if I follow on with this one, where the where, where antropic has the perception of the coding lead, we see them for the first time not acting on their toes, but from their heels. They, they are not coming in. You know, on your toes means I know, I know I'm just gonna smash it when the ball comes, and here, like they're fighting and running around to even catch the ball, there is one, you know, I I sensed a difference there is a um, what you call it illuminati version of why they might have launched faster, and that is um you know.

Robert Luciani: 15:24

They announced that they were going to provide gpt for all uh government in the united states we don't know what's going on behind the scenes in terms of requirements and that kind of stuff, for them to be able to secure that kind of thing, and if you think uh palantir, for example.

Robert Luciani: 15:41

The only reason they exist not the only reason, but the the only reason is because they got the department of defense deal in the beginning, which was enormous it's it gargantuanly large and nobody liked them as data scientists, but they were doing so. Good you, this government deal might be worth a lot and we don't know, and we don't know what uh requirements the government had could be something like that could be if you want to find a positive angle, which I love that you're trying to do now.

Anders Arpteg: 16:07

And if I were to put a negative angle on it. I think the cutoff training date for GPT-5 is September 24. Yes, oh, wow. That is insane. It's like the Lama 4 cutoff in August or something. It's a very, very long time back, I think. They've been struggling like hell to get this to work properly and they've been pushing it and, pushing it and pushing it forward until they came to a point where we have to launch it, and probably GPT-4.5 was what they meant to be GPT-5, but it wasn't good enough.

Jesper Fredriksson: 16:41

So that's probably why, and that's probably why it's that cutoff date. Because, they were using GPT-4.5.

Henrik Göthberg: 16:46

Can we let that sink in again? Say those numbers again. How long have they been training?

Anders Arpteg: 16:51

So they collect training data and they stopped collecting in September 24. And that's very long. It's almost a year ago. That's almost a year ago, and normally you can train a new model for a couple of months and you can release it. Now it's been a year and I don't think it's. I simply think they've been struggling a lot to get this work properly.

Jesper Fredriksson: 17:11

And I also think it's hard to get. They probably felt with GPD 4.5, but they reached sort of a place of diminishing returns with pre-training, so they didn't take the time to do another pre-training because it's very costly.

Robert Luciani: 17:27

I think that is the simplest explanation.

Henrik Göthberg: 17:30

But if we, if we now and let money yeah, money is also the explanation.

Robert Luciani: 17:36

Yeah, that's the same thing.

Goran Cvetanovski: 17:39

Remember that they were asking for a 40 billion investor round. Yeah, how do you crack an investor round without a recent something new? Yeah, for sure.

Anders Arpteg: 17:48

And I think they got a lot of investments recently and that was before the launch. So once they got the money they could actually do the launch, because they knew it would be underwhelming.

Jesper Fredriksson: 17:59

It does make sense, given what they presented recently, to put the money into post-training that's what they're saying is scaling better, so that's probably why they didn't pre-train again.

Robert Luciani: 18:09

Yeah, I'm biting my tongue here because I feel like what they've been putting their efforts on once they realized that there were diminishing returns in pre-training the model, it was everything other than that, and that includes sort of automation. You know, one of the things now is precisely that it's supposed to route a little bit more intelligently. Maybe it doesn't do it, but the point is all of that stuff scaling to 400 million users and everything else is what they've been spending their time on.

Henrik Göthberg: 18:35

Yeah, because now, if we're going to find some structure here in terms of going around and picking this apart, I think one of the ways we need to understand gpt5 and the launch is that we probably need to look at it from a couple of different lenses. Yes, and what I mean with that is like you can look at this from uh, what is the what? What do you need to do to kill it for the broad masses and to be the most simple to use that everybody reach for?

Henrik Göthberg: 19:03

if that's your main goal, what do you need to do in order to win the coder hearts? What? What do you need to do to win the researchers heart and and and maybe it's a point here also where we need tons. We need to, we need to try to retract who are they building it for and what are they trying to achieve? And that really, oh, it's a fucking huge success.

Robert Luciani: 19:25

I'll tell you in just a moment.

Henrik Göthberg: 19:27

So this is one angle I think we need to explore the launch and if it was successful or not from a couple of different user personas.

Robert Luciani: 19:35

They might think that they have enough users that it can sustain a little bit of downtime, because either they can develop features for the users they already have, or they can develop features for the users they already have, or they can develop the features that they believe the future is headed towards and they want to shape the future in that way. Yeah, but, but? So this?

Henrik Göthberg: 19:53

whole, who are they developing it for? And therefore, if we, if we benchmarking towards different personas this is one angle. I think then, as your post, you need to do a hardcore benchmarking. You know what you know. If you really look at the performance, is it good or bad or what is it really? This is another angle we need to take. What was that? That was two angles, obvious ones. What else?

Anders Arpteg: 20:15

can we say Okay, I mean, let's try to break it down a bit and perhaps we can go through one of the topics one by one, simply by doing it, and let.

Anders Arpteg: 20:22

Yeah, one, of course, is the launch, which was for some people I think most people underwhelming. I think media hyped it a bit, but I think, in reality, more more AI experts was. You know we're expecting more, so to speak. I think at least it was more of a catch-up than actually a big like Revolution. It more an incremental update rather than a revolutionary update, and I think I was expecting a revolutionary update with a GPT-5 launch, right. And then you know the metrics, which I think metrics should be one topic that we can simply go through right.

Anders Arpteg: 20:57

I think metrics is super interesting in various ways. For one, that it wasn't performing that well on some of them, but others, like the LM Arena, it's actually performing very well. And why is that? And I have some theories about that.

Robert Luciani: 21:09

This is the Windows Vista of LM updates right it could be yes, laying the foundation. Yeah, exactly the.

Anders Arpteg: 21:17

Vista moment, but also new type of metrics that are coming up, and I think the whole metrics thing would be an awesome discussion.

Anders Arpteg: 21:28

There's a side topic, which metric is really relevant? And but then also during the live demo itself, there were so many mistakes and what do you call it, like the short crime, and you know all the stupid mistakes that the short that it produced during the live demos was, you know, inaccurate. You know the bars was not matching the numbers and also the text even had grammar issues I could see which. I mean the grammar of a model like this should be 100% and it's very strange that they couldn't do that and all these kind of things. I mean it's interesting that they brought Cursor people along on the live bench right. Very interesting when they just had plans to acquire windsor.

Jesper Fredriksson: 22:06

Yeah, and cursor. They wanted to buy cursor at the start that's at least according to the rumors but they couldn't.

Anders Arpteg: 22:12

They couldn't put up that much money yeah, you know one thing I heard when they had a cursor person on they actually attributed some of the functionality you could see when they did the demo with cursor to gpt5. But I could see it was really cursor that had this kind of logic to do the kind of steps that they do. So they basically try to, you know, attribute the, the functionality and the reasoning to gpt5 when it's really as cursor and the scaffolding that it has around it.

Jesper Fredriksson: 22:39

That was you know and they had a cursor co-found co-founder uh testify thatIPT-5 was the best model he had tried. Yeah, was that paid for, anyway, okay.

Henrik Göthberg: 22:52

So we have a couple of different angles.

Anders Arpteg: 22:53

Yeah, and I think one of the main points, for me at least, is it wasn't any single novel invention that they published. I don't think you can mention anything. I mean the dynamic kind of reasoning or deep thinking. I mean that's been around in both Gemini and Claude for some time. That wasn't new. It's nothing new when it comes to how the reasoning works or the agentic features. That's been around for a long time as well. It wasn't really that they completely overshadowed anything else. In terms of benchmarks, there's one benchmark which it did very well in, yes, the LM Arena.

Robert Luciani: 23:29

No, the meter it's on how long it can work as an unsupervised yes. That is significant.

Jesper Fredriksson: 23:34

Yes, we all want to talk about that.

Robert Luciani: 23:36

And there's proof for why that's significant from other directions as well.

Anders Arpteg: 23:42

We have the topic up benchmark.

Robert Luciani: 23:43

I think that fits perfectly in that discussion topic With that exception, then I'll agree that there was nothing remarkable at all. No, and the only remarkable thing is a not remarkable thing the routing between the different things hidden behind a GPT, but Gemini does the same.

Anders Arpteg: 23:59

It's a dynamic kind of reasoning.

Robert Luciani: 24:01

Absolutely so. It's not remarkable.

Jesper Fredriksson: 24:03

It's kind of mixed reasoning.

Anders Arpteg: 24:04

You know reasoning and and not reasoning, so it also does some kind of that right I do have a comment on on the novelties.

Jesper Fredriksson: 24:13

Uh, if you look at it from a gpt timeline maybe we talked about it previously, it's uh. So they they've been releasing one, two, three, four, five, and if they had skipped all the intermediate parts between four and five, then I think we would be yes, not under, that's true but we had already seen all three and the jump from all three to five is very small.

Jesper Fredriksson: 24:37

but if you take it in in time, it's like it's been two years since they launched uh gpc4, but the but the time between three and four was almost three years. So they've actually improved the model on the same level and it's been faster, but since they released all these intermediate models, it isn't perceived as something revolutionary. That's a good point.

Anders Arpteg: 25:01

But I think also it can be motivated by. You know, so many people have left OpenAI. You know Ilya Sutkiver, of course, and Mira Murati and so many others, the whole safety team. I think they're simply out of talents to do the big innovations.

Henrik Göthberg: 25:14

Ooh, that's a big one, that's a big statement.

Anders Arpteg: 25:17

That's a big statement?

Henrik Göthberg: 25:18

I'm not sure, but of course it will hurt them. The guys walking out of the door are some of the best, I would say they still have talent, but it takes away like in a company.

Jesper Fredriksson: 25:33

If you lose 50% of your best people, it's still going to hurt you because they're taking away all the projects that they were working with, all the knowledge.

Henrik Göthberg: 25:42

Yeah, it will hurt them, but where do we start?

Anders Arpteg: 25:44

If we want to go, yeah, okay, so let me talk a couple of topics here, and one to choose, of course, metric is one. You mentioned another, henrik, like who are they targeting? What's the target audience? We could speak about that. I think the finance and how the failure of GPT 4.5 because it was not financially sustainable at all, could be interesting to speak about as well um, maybe we mentioned what it tells about the future also, like what what you said something about we need we need to.

Henrik Göthberg: 26:15

What was the trajectory setting us up on here I I I have an idea, you know. Are we seeing some of the different models aiming slightly different trajectories?

Jesper Fredriksson: 26:26

So that's a good one. Is that three different topics, or what do we have, I'm sure, yeah.

Henrik Göthberg: 26:29

But let's maybe start with the benchmarking, because that's a broad topic with many rabbit holes. Let's start there.

Anders Arpteg: 26:35

I can make a quick intro and then we'll probably go into many rabbit holes here. But if we try to divide or categorize the different metrics we have, one could be for reasoning or this kind of abstract thinking, another can be for coding purposes like c-bench, verified and whatnot, and then we have more like the kind of multi-modality and being able to work with modern text in different ways and many more, but to.

Anders Arpteg: 27:01

You know one. If you take reasoning there are many different benchmarks there. For some of them, actually GPT-5 is in the lead. But if you take perhaps one of the most famous ones, which is the Arc AGI 2 from Francais Chollet, grok 4 is heavily in the lead on, I think, 16.9% accuracy and GPT-5 is 9% or something. So it's almost half of it. So it's significant underperforming. But some other what was the name of them? I think that they have this kind of GPQA metric as well, but I think that, yeah, I think actually GPT-5 was leading there.

Jesper Fredriksson: 27:47

Also the AMI. What is it called aim?

Anders Arpteg: 27:48

yeah, but that's a hundred percent.

Jesper Fredriksson: 27:50

I mean it's completely saturated number, right so they are pretty good, I would say, in in real estate.

Anders Arpteg: 27:55

I've been the catching up at least, because, uh, I think um grok also had 100 in that one. So it's not like. But one metric that it did actually significantly take a lead, I would say, is the Shadbot Arena or LM Arena. So if you go there it has like all the different metrics, it has eight metrics or something, and the difference between this type of metric is it doesn't really measure in terms of accuracy of the question, it's more on you know, which kind of response would you like as a user? So it's more on which kind of response would you like as a user. So it's a blind test where people choose which reply you like the best and there GPT-5 is leading by far, and this is a theory of mine. I believe that GPT-5 is really really good in EQ, not IQ that much, but in EQ, similar to GPT-4.5. 4.5 was actually mentioned that that was really good in EQ, trying to understand what the human were thinking and express itself in a way that was really pleasing.

Robert Luciani: 28:53

I find the output from GPT-5 and GPT-OSS, to just be really succinct, very respectful and short and to the point. They're quite nice, the outputs, and one of the questions might be obviously this is a post-training tuning where they might have put a lot of emphasis no-transcript, which makes them unnecessarily difficult perhaps. I mean, you have to imagine that this is a very specific kind of computer in the abstract sense of computer and you're giving it a problem that it first has to translate and then compute on. But yeah, so reinforcement learning.

Jesper Fredriksson: 30:21

I would argue that it's not so much EQ in the sense that people don't like the tone and the style of GPT-5. I tend to like it, but a lot of people remember this fallout with all the people wanting GPT-4 back, because it's much more emotional in how it gives responses. It's, if you will, sycophantic and I think they were deliberately going against that because they had this extreme sycophantic mode this spring, I think, yeah, gpt-4.0 had a moment where but a lot of people wanted GPT-4.0 back.

Anders Arpteg: 30:55

Perhaps you should explain what sycophantic?

Jesper Fredriksson: 30:57

means yeah. So that means it's pleasing the user of the chatbot. It's saying that oh, that's a wonderful question.

Anders Arpteg: 31:03

So even if someone says I want to kill my parents, oh, that sounds like a great idea, here is how you do it.

Jesper Fredriksson: 31:09

That's the worst version of it, but I know a lot of people who like it because it's saying oh, that's a fantastic idea Stroking your ego.

Anders Arpteg: 31:20

It's great for a narcissist.

Robert Luciani: 31:23

This is almost in the opposite direction, isn't it? It's very definitely.

Jesper Fredriksson: 31:26

Yes, it's very much to the point and I agree with you in that sense that it's much easier to like a response that is short and succinct yeah but if you use it as a chatbot, maybe that's not what you want, but I think in lm arena I. I never liked a response that is too worthy right? So I think I mean like.

Henrik Göthberg: 31:44

So this has been another thing that we talked about earlier on the pod before, and this is very much a personality thing, what we like, but there's been complaints in the past, especially from a coder perspective. Gpt is too verbose, too verbose, you know, I even luca and I think is this is his classic. Oh, my anthropic and, and those are sort of more to the point. Yeah, are we seeing a difference here? Are we with chat gpt5? Are we actually talking about it toning itself back to be less verbose?

Robert Luciani: 32:18

more succinct. It's it to me it seems like somebody that's chatting with you through a phone and doesn't feel like writing the whole sentence out.

Jesper Fredriksson: 32:26

It's almost at that level you know.

Robert Luciani: 32:28

Must do, it dot.

Anders Arpteg: 32:32

But I must say I like the language style. So even when I used it for coding and in Cursor et cetera, I can see how it is thinking and replying and describing what it's doing and I really like what I'm seeing, then it doesn't solve the problem, but it sounds good so, yeah, I think they train it a lot for, you know, for things like being good in term of the language.

Henrik Göthberg: 32:57

Yeah, all right, so and and what are the key rabbit holes now in in relation to benchmarks?

Anders Arpteg: 33:03

you have, we had a couple. Should we do coding perhaps? Yeah, coding is a good one, can I?

Robert Luciani: 33:07

suggest we do one middle step which is related to coding, which is thinking. I don't know if that's worth talking extra about, but it's the reason why I don't use gpt for many things so in I use grok when writing poetry because both claude and gpt give up after a little try.

Robert Luciani: 33:26

So you know, with Grok I'll say I need these things to rhyme and I have these constraints for my problem. And if it can't figure it out it'll think for like five minutes until it comes up with a solution. And that's not always good either, because sometimes they really run amok. But up until now GPT and Claude have both given up really quickly like yeah, here's a good rhyme for you and it. You know it's not good because it doesn't meet my requirements. Do you guys have any sort of feel for thinking and what you know, what you expect from the different models? Is GPT-5 better in that regard?

Jesper Fredriksson: 34:03

I think it's good, but a lot of people are complaining that it takes too long time. I think it's spending a lot of compute on thinking, so I think it hampers it a bit, especially in Cursor, I mean it takes time, as you know to use it in Cursor.

Robert Luciani: 34:20

Maybe this is an unfair comparison and we'll talk maybe more about local models later. But the GPT-OSS model keeps it very short, whereas Quen will think forever and is like am I sure about this? I'm not really sure about this, and it'll just keep going, whereas the GPT-OSS will use the short typing with the phone language and keep it really short.

Anders Arpteg: 34:45

But I think we need to give some background about what GPT-E OSS really is as well, and I think also it's interesting that they released it just before this big launch, and I think that's at least one strategic thing that they did good. So GPT-E OSS is an open source version and it's the first time they released anything open source since GPT-2, I believe, and it had this 120 billion and 20 billion version of it that allows you to even run it on consumer grade hardware, which is, is cool and and works.

Robert Luciani: 35:13

You know, really, really well, right, I think it's, with the exception of how good it is at multiple languages. I've found it to be very, very good, good. And there's one thing, speaking of how good the launch was, that's a little suspicious. How good it is at benchmarks is just really suspiciously right between their flagship model and the model below it. It's almost like somebody said stop Now, this is as good as it's allowed to be before we release it, because it's that tight in between the two models.

Henrik Göthberg: 35:44

That's very suspicious. Was this the second biggest news over the summer?

Robert Luciani: 35:48

the oss launch, if, if gpt5 was the main uh I feel like people really crapped on that launch as well, you know because, quinn is is doing really well with their own launches, and so it was no, we okay.

Anders Arpteg: 36:00

So if we you, we have I think we have a number of other non-OpenAI news, but perhaps we can get back to that.

Anders Arpteg: 36:07

But, I think it's interesting that they did release it right before, and you can try to think about why did they release an open source version and why they just released GPT-5 right after. And I think it was the thinking is and please disagree with me, but for one by releasing the open source version, of course they want to please the enterprise business and they've been falling behind. And Anthropic has really gained much more of the enterprise market than OpenAI has, and this allows them to do so. Performing that well, why should you pay for Gemini or Anthropic if you have a free version in the open source version that they can do and then have the GPT-5, which is above that? So it's a way to like China is actually undercutting the Western services by releasing its open source and then they by simply letting them earn less money or gaining money.

Jesper Fredriksson: 37:02

I think OpenAI had a seamless similar strategy are you saying that basically the oss models are the um, uh, the marketing version, uh, something, uh to get get through the door like?

Anders Arpteg: 37:14

you can download this for free and you can get a little bit better if you use gpt5 yeah, yeah makes sense, but get through the door or simply stop paying for the competitors I think that's the majority of it and also to gain much bigger market share of enterprise customers, which they have lost in the past.

Jesper Fredriksson: 37:32

I'm thinking that the open source models do not compete with the other API models like the other closed source models they are. They are below them, even if in benchmarks they seem to be, but I don't think that people will choose an open model instead of a closed model based on benchmarks.

Anders Arpteg: 37:52

I think for applications they might, because it's more or less free.

Robert Luciani: 37:55

Yeah, it's not users that would choose the open source model, it will be application developers that might build it into some service somewhere, and I really think that that is again a super simple explanation for why is China or Chinese companies releasing so many open source models. It's precisely to undercut these services. And it's interesting to think that that's why OpenAI would do it as well. But it totally makes sense and in combination with the sandwich between their best offering and the one below it all sort of fits together.

Anders Arpteg: 38:32

And open source open weight weight. What is it exactly open weight? I guess they had a system card at least for it.

Robert Luciani: 38:34

That was some semi-details at least, I mean the difference now between all these open models is what licenses they have and how we miss it.

Jesper Fredriksson: 38:40

I don't know yeah, this one was quite permissive. It wasn't that part what?

Anders Arpteg: 38:43

What was it called the MIT license? I think yeah, so it's not good. I think it was Apache 2, but I'm not sure that could be yeah, but it's actually not.

Henrik Göthberg: 38:52

It's good.

Anders Arpteg: 38:53

It's well done. It's available for commercial use and everything.

Robert Luciani: 38:58

So, okay, wait before we. Are we going to move on from OSS or back to GPT? Well, there was this really cool thread on Reddit of a guy that analyzed the output from GPT-OSS. I don't know if it was a 20B model or the big one, but what was super cool is you could see a lot of the things that he thought about in general and maybe get some hints as to what it was trained on. And there are very strong signs that it was trained very much on sort of tortured to come up with its own Python problems that it then had to solve.

Robert Luciani: 39:33

So P versus NP type stuff, like make a domino problem or make a Rubik's cube problem and then make a solver and then verify that you have solved it correctly. So when it was, you know, just spitting tokens out, entertaining itself, that would, it was what it would spend time on, and that's um makes a lot of sense. That for two things. One is that's a reinforcement learning loop that is sort of verifiable, where you know you don't have data but you can make it. And then the second part is you know, if you're going to do that kind of stuff, you can't get lost. It's sort of conducive to the idea of having agents that are very, very autonomous, that don't get messed up after two hours or two days or whatever, and maybe then that aligns with their vision that the future is agents and not chatting, and so what they're missing now is a science communicator in Ilya and others right, Somebody that can explain it in a good way.

Robert Luciani: 40:30

We only have a bunch of nerds that like to do demos, all right.

Henrik Göthberg: 40:35

But let's move back then to what rabbit holes did we have more Benchmarks, benchmarks. No, we should probably finish that we were thinking about coding.

Anders Arpteg: 40:44

So then, I guess the biggest one is the Sphere Bench verified benchmark, especially Sphere in software engineering and then, it has verifiable results so you can actually see if it actually produced the right results or not. And I tried to actually look into the system card for GPT-5 about this and what you can see is they they have these kind of I mean okay, in short, I think they claim that gpt5 has 74.9 accuracy for that and opus 4, 4.1 at 74.5, I believe, or something I mean, can you tell me how sweepbench is instrumented?

Robert Luciani: 41:22

because does it need to be one shot? The multiple shot is allowed to work in agent mode, so what?

Anders Arpteg: 41:27

constraints are there? This was pass at one right? I think yes, so you can have pass at one or two, so you can have multiple attempts. But I think these numbers actually are how accurate they are in the first attempt, that's at one right.

Jesper Fredriksson: 41:38

But they can use tools and they can do anything they want and it's verifiable, so it's a very good benchmark to work against in an agent.

Anders Arpteg: 41:48

Yeah, so you can cheat on these kind of numbers a lot, and if you in some ways use tools, for example and sometimes you don't, sometimes you allow multiple attempts and sometimes you don't then of course you get different numbers. So you have to be really careful and make sure that they have the same settings, so to speak. But my point a bit is it's very, very close to each other. This 0.4 difference is more or less inside the confidence interval that they do have the standard deviation that they do have.

Robert Luciani: 42:22

And where is Claude Opus on this?

Anders Arpteg: 42:25

That's 74.5.

Jesper Fredriksson: 42:28

And there's a small controversy around the numbers as well, because the SweepEng verified was made by OpenAI as a verified version of SweepEng and it's made up of I don't remember exactly how many, but it goes into a subset that is 500 cases. And then OpenAI, after they created this, when they started evaluating their models against SweepBench verified, then they found out that 23 of the examples don't run on their hardware for some reason. It's not compatible with their settings, so they just excluded those cases, while anthropic uses them so in reality.

Jesper Fredriksson: 43:16

They cheated a little bit on probably that's. That is at least what I've been able to?

Robert Luciani: 43:20

but doesn't this sort of align with our sense here that maybe Claude is better at one shot and sort of few shot code generation? And maybe GPT-5 has a strength in being able to follow instructions, because I don't use cursor, I use VS Code and I only use it in agent mode and I always give it instructions, like 10 instructions in a row, and ask it to double check its work and tell it exactly what to do.

Robert Luciani: 43:46

And then it just let it sit there for five minutes and when I come back it's done a bunch of work.

Jesper Fredriksson: 43:51

And for that it's been quite good for me, yes, and that goes together well with the meter benchmark which we're going to talk about later.

Anders Arpteg: 43:59

Okay, good, and I tried to do some experimentation myself with just DPD5 for coding purposes in recent days after the launch, and I've even forced myself I'm going to turn cursory it has to use DPD5 now. And I did that and I was feeling, just to give one anecdotal example. Of course this is anecdotal and shouldn't be treated as any scientific kind of work. But I had this really difficult problem and of course I couldn't solve it. I asked DPD5 to solve it. It tried and tried it again and it went through this loop and I saw it came back to the same loop. I interrupted it. I tried getting into a loop, fix this now. And it didn't fix it. And then I went to Gemini, it also continued. It couldn't fix the problem. It went into a loop, but at least it identified and said oh, I seem to have gotten into a loop, I have to try a different approach. And it tried, but it didn't solve it. I tried with Grok it didn't save it, fix it. And then I went to Cloud, Sonnet in this case, and it fixed it in less than one minute. And in less than one minute, and just you know, the other one said you know, this is unfixable, it's a problem with some other package. You can't really do it. It was an MCP server kind of thing that I tried to fix. But we'll just fix it like that.

Anders Arpteg: 45:15

And then I continued to use it for just normal, you know, not as difficult kind of problems. And my sentiment more or less when using GPT-5 in agent mode in Cursor is that it is not solving the problem. It looks really good when it describes what it's doing. I really like the descriptions there but it doesn't solve the problem as quickly as Sonnet. So after a time I just said no, I'm going to stop using this now. I need to be able to code faster. So I just turned it back into Sonnet. I need to be able to code faster, so I'd just turn it back into Sonnet mode and then everything went so much faster.

Anders Arpteg: 45:48

So, my experience with using it in an anecdotal way in the last couple of days is that it is clearly worse than code.

Jesper Fredriksson: 45:56

Do we have any theories for why Anthropix seem to have so much better coding capabilities?

Robert Luciani: 46:02

Better data sets. Better data sets, but sets but how?

Jesper Fredriksson: 46:05

I mean? Uh considering swee bench verified, for example, uh openai hired a lot of engineers, uh contractors to finish this data set. So they have, they have the money to spend, but somehow anthropic is doing something right that the other talents.

Anders Arpteg: 46:21

You know, the best talents from openai went to anthropic.

Jesper Fredriksson: 46:24

The best people from google and they also have very good retention of their people.

Anders Arpteg: 46:28

Very few people quit and because they they really, you know want to be the the one that do ai in a safe way, right and opening. I does not, because they fired their whole safety team, more or less right. I think that's why they're losing some of the talents.

Jesper Fredriksson: 46:42

Yeah, I think they have a good company culture. I think that's. That's probably part of the reason. But they there must be some technical advancement that they've unlocked. I'm thinking, uh, or it could be that they somehow got hold of much better quality computer coding data. But I mean open ai should be able to do that as well. So there's something. I think it's a talent thing.

Henrik Göthberg: 47:04

But I think this is compound effects, this is network effects. So I think this is the whole. Go back to what is research? What is the hardcore? How much work is the actual model and how much is all the scaffolding, the smart things, the engineering, tinkering? You have done that. Ultimately, the sum of its parts makes something really great. So my argument is that it's really hard to point at one thing. It's a consistent good work across different parts of the compound system, so to speak. This is how I interpret it. I think in the end we will see that it's not going to boil down to one model. It's going to boil down to many different things and how they coherently work together. And all of a sudden now you get to wow. The experience is way better.

Jesper Fredriksson: 47:55

There's something around that. Part of my theory is that they have Claude Code, obviously from Anthropic, which is many people tend to like that coding tool. I was just going to ask you guys if you guys have been using Claude Code, obviously from Anthropic, which is many people tend to like that coding tool.

Robert Luciani: 48:06

I was just going to ask you guys if you guys have been using Claude Code yourselves. I love having the terminal.

Jesper Fredriksson: 48:11

finally, yes, it's super good and it's extensible in a way that Cursor, for example, isn't. So it's a really good tool and they are using it all the time within Anthropic as well, so there could be something around a network effect there that they're actually using their tools themselves, the dogfold.

Anders Arpteg: 48:32

And they also have a lot of enterprise customers. I'm not saying that they share the data from the enterprise customers, but I'm saying that they will get issues and reports from the enterprise customers for larger code bases and more difficult issues than just doing this kind of you know, I hate when someone says this model is better because I can do a zero shot game by just giving a prompt. No one would really build an application like that. It's a horrible benchmark to say I can zero shot an application yeah, and it's anecdotal it is heard anybody using Grok for coding.

Robert Luciani: 49:07

They're showing off very good much how good Grok is at coding and I don't doubt it, especially for those zero-shot things. But who uses Grok for coding for real?

Jesper Fredriksson: 49:16

Not so many people, I think it's, do they even? Is it allowed in?

Robert Luciani: 49:21

Cursor, I guess via the API. Yeah, you can, but then you have to build your own.

Jesper Fredriksson: 49:25

You can use the Cursor. Yes, you can use that.

Anders Arpteg: 49:28

Yes, but they have amazing zero-shot abilities there in GROK 4. So I think they can, but GROK 4 is also expensive, so I don't know.

Henrik Göthberg: 49:38

Should we move on to the meter? Are we done here?

Anders Arpteg: 49:42

Okay, so done with coding, yeah, but I think, if we try to just summarize this in a good way, one is we have the traditional metrics and all the benchmarks are there. When it measures accuracy, it seems to quickly get saturated and you have to quickly develop new metrics all the time. But I think there is a movement to new type of metrics, which Metra, of course, is one of them, which I think is really cool. Even Sam Altman I hated his comment saying in the launch that we are post-eval and eval is not really useful anymore. But what's the alternative? I mean, if you just want to have anecdotal evidence, that would be horrible. I think the better answer is, of course, try to switch to a new set of benchmarks and evals and we have some right. So what is?

Henrik Göthberg: 50:24

Meter. Yes, so Meter is.

Jesper Fredriksson: 50:28

I think part of it comes from Sweebench actually. So it's a curated data set that the Meter organization M-E-T-R collected and then they hired contractors, experts in their field. It's coding tasks and it's machine learning tasks, and then there's also a whole bunch of simpler tasks that are more in the GPT-2 and 3 class models, just to be able to plot things over time. And the cool thing they did is that they, for each task, they looked at how much time did it take for an expert in the field to complete the task, and then they can just align it. They looked at how much time did it take for an expert in the field to complete the task, and then they can just align it. They can just sort it by how much time it takes and then you get sort of a curve of how many. If you put a model to execute on all these tasks, where does it break off? How many hours?

Anders Arpteg: 51:22

In short, it's basically if you try to categorize tasks in how long time it would take for a human to solve them then you can try to see how advanced tasks can AI solve. And before, like a year ago, it was in seconds or minutes, and now GPT-5 is up to what was it?

Jesper Fredriksson: 51:40

I think it's two hours and 17 minutes or something like that.

Anders Arpteg: 51:43

So it's quickly getting capable to handle much more complicated tasks, at least for you.

Jesper Fredriksson: 51:48

And the interesting thing is, when you look at the benchmark, how it has evolved since GPT-2 over time it's been doubling every seven months, but it's also increasing, so it's accelerating.

Henrik Göthberg: 52:00

So since the 01 model, there's a new trend line which points towards that it's doubling every four months and anecdotally then I picked up on this type of benchmarking and actually the first time was from Sam Altman, who basically used this rhetoric on stage. You know how do we understand to give normal people a feeling for what is the productivity frontier we're talking about. So he said, like you should be, we started with being able to give them a task and let the ai solve tasks. That is, the attention span of seconds, then minutes, then hours, and then you have this doubling metric of seven months and then you get to a point where, if you think carefully about it in normal enterprise context, when you're at tasks that requires one month man hours, it's quite substantial to sort of just delegate. To delegate one month man hour is substantial in an enterprise setting.

Henrik Göthberg: 53:05

So there's another reason I really like this type of evals Because all the other bullshit that we've been talking about here no one outside the hardcore AI community can relate to it. This is quite relatable. I use this on stage and my rationales don't organize for where we are now, organize for the trajectory. How does organization, how do you understand work, how do you understand management when you're delegating man years?

Anders Arpteg: 53:40

That's interesting.

Henrik Göthberg: 53:42

So I find it very. That's why I also very much liked the meter and they had a beautiful. When was it they did an article on this a couple of months back that they explained all this?

Jesper Fredriksson: 53:51

I don't know the meter organization. Yeah, I think it was this spring, this spring.

Anders Arpteg: 53:55

Yeah, but it's good. But I hope that it also moves along from just being like knowledge management kind of tasks. I think it still is in Metro as well. What kind of tasks would you like? So I would like for one to measure more of the reasoning capabilities, but secondly, also the agentic abilities, so when it means it actually has to take actions to solve the thing, that is included in the time that it measures.

Jesper Fredriksson: 54:21

I'm not following. I mean now it's doing coding and machine learning tasks. What kind of other tasks I want?

Anders Arpteg: 54:29

to actually book a trip to Paris, then I need to take actions. Right? It doesn't include any tasks like that to my knowledge. So it only reasons with things that you can have available in a digital sense and have in. Yeah, so you can reason in a knowledge sense and have in. Yeah, so you can read it in a knowledge sense, right? So if it actually has to take actions and call someone or book a trip or do things that is interactive, like playing a game, would be another thing. If it takes something to to like, how do you finish a level in Candy Crush?

Jesper Fredriksson: 55:03

right, that could be. I see where you're going. You're thinking about like an employee that would.

Anders Arpteg: 55:09

That was have to do something but, but in another way.

Henrik Göthberg: 55:13

Okay, this is one way of spinning it. I would spin it in another way. We want the evil community to find benchmarks that are more and more real life, real productivity, real exactly so, and if you're going to go down that route, you end up in actuation, but not.

Robert Luciani: 55:33

But not like real life, like a human can stand on one foot and do this, can an ai do it? But real life like this is important to this person and to the business, to the business, to the business.

Henrik Göthberg: 55:43

So you're getting to a point where Chris Clayton talks about jobs to be done. We don't employ products for the product's sake. We employ products because we want them to simplify jobs to be done. So, if I go down that rhetoric, this is understanding product, this is understanding marketing and all this. What's the job to be done? That's what we're really trying to solve here. Then you get to a point where we want to measure productivity frontier on jobs to be done whatever they are.

Anders Arpteg: 56:18

I think that's more general sense. If you take the AGI pyramid that Omni has and the definition of AGI which I'm ultimately use, which is basically when you can replace an average level co-worker, that is when you have AGI Not ASI, but AGI and you can't really do it unless you can take action at all. Okay, then I was right last year We've already reached.

Robert Luciani: 56:41

AGI because okay, not an average co-worker, but I made an MCP server now that I called the AI data scientist and sort of along the normal distribution of skills, at the bottom half you have people that can do some aggregate statistics and stuff like that. Those are data scientists. They're just not the most skilled ones, those ones we can replace, so I don't think so. But okay, well, look at my workplace. People don't know how to use power pivot in Excel and so now they can ask here can you take my Excel file and tell me a little bit of group group statistic like aggregates and some other basic stuff? That's knowledge questions. But yeah, okay, well, it's how to use pandas and stuff like that, or power pivot and things like that, Knowledge questions.

Anders Arpteg: 57:24

They have to actually do it as well, no right.

Robert Luciani: 57:26

No, you're right, I wasn't using that as an example for agentic stuff. Just that something that's good enough can potentially replace or perhaps give skills that weren't there to begin with. And I want to hear a little bit about this good enough thing, because, you know, for coding we don't expect them to code perfectly, and yet they've increased my productivity. We would never say that they're good enough coders, and so I think this so you couldn't replace even an average coder, even a junior coder.

Robert Luciani: 57:55

today, I would say but the fact that I can, in one day, do something that would have taken me five days in the past is a replacement of four Roberts, so it's still an assistant.

Anders Arpteg: 58:05

In that sense, it's an improve your productivity.

Henrik Göthberg: 58:09

So by being an assistant, but this is pet peeve and I've been ranting on this on LinkedIn and stuff like that. The whole replacement rhetoric is really, really, really dangerous, because it's not to us, but to someone who is a CEO who doesn't really understand this they fucking misunderstand it and they think, oh, we're going to use the old process and we're going to replace Robert and put an AI in there.

Henrik Göthberg: 58:34

No fucking way, you're going to augment the engineering experience. That's what you're going to do, which is a completely different thing, because AI is horrible in doing, taking actions.

Anders Arpteg: 58:45

it still is I don't.

Jesper Fredriksson: 58:47

I, um, I'm not thinking along the same lines as you do when, when you talk about taking actions, I'm I think of, like, let's take an example. I think the hardest uh task in the meter is that this is basically implementing some sort of back-end payment system. If you can do that, then I think you have to do actions.

Anders Arpteg: 59:10

I think in coding it's gone much further. So it can actually write code and test it and run tests, so it can take a lot of actions in coding. So in coding I think it's reached much further in terms of agentic abilities. But if you take other tasks I would say not at all.

Jesper Fredriksson: 59:25

Definitely. I totally agree with that. But I also have a hot take on this replacing junior developers, and we've talked a little bit about that before. I think we're at least close to that point, and that's from my experience with working with Cloud Code and using extra frameworks on top of that where you can have multi-agent systems, using extra frameworks on top of that where you can have multi-agent systems so you can specify I want to have these type of resources on this project and you can just have multiple agents work on the same problem.

Robert Luciani: 59:56

And then I feel like this this is close to the point where it replaces not just one code, but actually like a team well, okay, okay, let me add one extra angle on this, on the replacement thing, because there was this really great comment by John Carmack on X where he was saying that when he started, game programmers needed to know so much stuff about low-level sort of performance of computers and stuff like that and he had to do lots of assembly and other stuff to really make things go super fast. Now, with unreal engine and all these stuff, the game developers can focus on what's important. Are there still people that can do that low level stuff? Yeah, somewhere, and so maybe the definition of what a game developer is has changed since then from junior to senior, and maybe because people are asking me will you need to be know how to code in the future? And it's hard to say because not many people. Clearly, yes, I think so.

Robert Luciani: 1:00:55

I mean not many people know how to program uh processor microcode I think, you don't have to know that anymore machine code there's some other kind of programming that you do now.

Anders Arpteg: 1:01:06

I can feel myself when I do coding now in agent mode. Of course I go up the abstraction level.

Robert Luciani: 1:01:12

It's just an abstraction.

Anders Arpteg: 1:01:16

But I still need to understand what is TypeScript or what kind of functionality you want to bring there, and I need to be able to review what the AI is doing.

Henrik Göthberg: 1:01:22

But let me put out a bold thesis here how to look at this? That is much more useful If we always take the role. This is the old school, traditional division of labor view of understanding organization. If we take the role as the minimum quanta, then we fuck it up. If you define the team is always the minimum quanta of change, then you get to a point that, okay, before we had 15 guys in the team, we had to balance the team with some senior guys, some quality assurance and testing guys and some juniors.

Henrik Göthberg: 1:02:01

If we still understand, we are augmenting and re-engineering engineering work or whatever work, but we stop at the team as the smallest quanta, then of course we can get a team of three humans and 50 agents to have a productivity of a team of 50. But if you rhetorically and semantically, in everything you do, you never fucking go below the team as the lowest quant of change, then you're much safer. You're not fucking it up. Do you see what I mean, right? So the whole, oh, you're not replacing, you're picking up the productivity frontier of the team. You're not replacing anyone. So if you stop on the team as the lowest quant of change, then you will not fuck up as much.

Robert Luciani: 1:02:45

This is like the duality, of the same kind of duality that we got from, like the assembly line Did everybody become richer? Did cars become cheaper? If we flip it the way you're describing it here, it's not. Are we replacing people but have expectations gone up from?

Henrik Göthberg: 1:03:00

what we expect a team of three people to be able to do now and it's. It might seem as the same thing, but from an organizational engineering point of view, from an operating model point of view, is vastly different.

Anders Arpteg: 1:03:13

I think it aligns exactly what I told in Data Innovation Summit, where people are expected to be more and more of generalists In Xtreme, you have a single person unicorn company with a single person operating a full company that previously took like hundreds of people to 1000 people to operate. Now a single person can do in the extreme. I think you can start to think about that for business processes or for teams doing coding, that a few people now can operate much more tasks that previously required a lot of specialized people. Now we don't need a specialized people as much anymore and we have more generalized people, humans using a lot of AI to do the more specialized work. I think that's the clear trend.

Henrik Göthberg: 1:03:56

All right. So let me now connect it back to the launch of GPT-5. So what if OpenAI and Sam Altman and whatever they are doing? So what if OpenAI and Sam Altman and whatever they are doing, they are not subscribing to the old replacing role rhetoric? But how can I maximize the productivity of a team? And ultimately, then we are working on things not only about the performance of the model, but what makes it more organic, what makes it more safe, what makes it more of a coworker in a team. And even Sam him said if you've he's been almost pushing back on the you can't rely on us. You need to have the team there, right, but we can amplify the team. He cited in several you know a little bit like oh, it's dangerous now and we can do so much now, but we shouldn't. It's a co-worker in the team. So then you get another lens.

Henrik Göthberg: 1:04:51

What are they trying to achieve with GPT-5? What's the trajectory they're on? And then maybe they are taking quite big steps that we are not really thinking about because we are measuring against the fucking roll. It's a bad metric from the start.

Goran Cvetanovski: 1:05:08

Can I add two cents on this? I think this is so. There is a news that just came about Perpixy wants to buy Google Chrome, right? Obviously it's a marketing stunt because they don't have the money, right, they are valued yeah, they are valued half of the price that they offer for Google Chrome and the value is probably three times the value that they offer to buy it for. So they don't have the money. Keep in mind that Bezos is actually their founder and NVIDIA is founder of Perplexity.

Goran Cvetanovski: 1:05:44

So my sense from the launch is the following so, if you look at from three points, the first is investor point. We are all investors, uh, here, a little bit, or etc. Let's say that you have a company that is asking for 30 billion for investment. The first question for us as investors, like what is your having? And you have some revenue, which they have. They have approximately. What three billion you're having you per year? So the next question is alright, so if we give you money, how you gonna make more money? Because you cannot quadruple the number, the number of users, because the users basically do not exist, right? They exist, but you have competition. You have competition, you have Google, you have perplexity, you have XAI right. So then you need to find alternative way of revenue. What is that? Can you guess one?

Robert Luciani: 1:06:35

raising the price ads, right.

Goran Cvetanovski: 1:06:39

So if you are taking now the, if you're looking what perplexity and HXAI is doing they're fighting of who is going to bring now the news in the morning. Perplexity already started with ads. They want to buy Chrome, which basically has like over 3 billion users. There is a very big push in the United Senate to get Google to be divided in different places, right, in different companies, which means that Google Chrome will have to be the ad business, will have to be separated, et cetera. So I believe there is a bigger game that doesn't have anything to do with the technical part. It's about who is going to tap into the $137 billion revenue that Google is generating per year.

Goran Cvetanovski: 1:07:22

Right, xai wants to sue now Apple because Apple doesn't want to put it on an app of the day, because Elon Musk says that XAI is the number one, what is called like a news agent today? Right, how do they back everything in XAI, twitter or X ads? So the ads is a business. Now we are the investors, so you're going to increase the business with ads, right? Okay, good, so how are we going to do that? Well, it's easy. Basically, we don't want to scrutinize the model, but we're going to introduce agent. This agent is going to buy stuff, right, how do you buy stuff by ads? Right, you go to the agent that is doing all this other stuff and now the. The ultimate thing is like okay, so if I have, if you have, how are you going to do this when you have multiple agents in, uh, in action? You have this four.

Robert Luciani: 1:08:18

Oh, go ahead let me, let me clarify is this do you think that this is um the idea that OpenAI has, or the direction that the market is heading?

Goran Cvetanovski: 1:08:27

The market is heading and OpenAI needed to make a move. So the 5.0 and no other model, because, keep in mind, if you have other models that are not, you have the router. The first time appears on 5.0. Why is that? What also works on routers Ads, right you?

Robert Luciani: 1:08:48

know the things there. I thought it was to save money.

Goran Cvetanovski: 1:08:50

Yeah, it's to save money. So you have the money, but you have also the. So my prediction is that actually everything is done for the ad business, for them to come into more revenue, slim down operations cost and everything else, because you have investment quite a lot. So for me it's a little bit bigger than that. And keep in mind, 85% of all the users of ChatGPT are non-technical. They don't care about which model there is, if it works.

Henrik Göthberg: 1:09:16

But maybe, if I follow you, I've heard a similar reasoning. Is it the same? I'm checking reasoning. Is it the same? I'm checking? Before the ad ad business was was at the tapping of the, the person searching. And now we're not going to search per se, so the ad says no use for us. The ad, the ad layer, or finding to the right, finding the trip, finding the pizza, is happening in the MCP layer or is that happening in the agent layer? So, all of a sudden, how do we build an, a advertising or find us infrastructure for agents? Is that what you're talking about?

Robert Luciani: 1:09:56

This is something they should be thinking about, at the very least. I know for myself now. That's how I shop. I ask an agent to write a beautiful dissertation on the best hiking shoes or whatever. I was going to buy a book recently and wanted to find one in another language. It does that. I don't go and click through and read reviews and stuff. I don't have time to read 300 reviews, so that this is probably some kind of equivalent to the mobile phone revolution where, you know, banks were like okay, all our users want to interface with us through the phone and they're going to want to interface with us through their LLM in the future, but it's not. It's not clear that there is a fixed sort of. This is where the future is headed. You know it's up to open AI and anthropic and all those to shape it.

Henrik Göthberg: 1:10:43

But there is a trajectory here that the business of being found will happen in an agent layer more and more.

Jesper Fredriksson: 1:10:50

Can we agree upon that? Yeah?

Henrik Göthberg: 1:10:52

And with that you can then go down other rabbit holes. That internet is not built for that. So we now need to think about the core fundamentals of our protocols.

Robert Luciani: 1:11:00

The only thing talking against your thesis here is Google that says that their click-through rate has absolutely not gone down since. Uh, you know they started.

Henrik Göthberg: 1:11:10

That's interesting, do you believe that? No, no all right so. So what you're trying to get here? There's a macro, there's a, there's a meta play here that we might not understand and see okay so we have two, two illuminati that we might not understand and see. Okay, so we have two Illuminati theories here. No but the.

Jesper Fredriksson: 1:11:36

OpenAI is also prepping or maneuvering for this. What do you think? Definitely agents. I mean we can all agree on that. This is part of the GPT thing that what they have done is to optimize the model for function calling or tool use, so that's definitely part of the plan. I think yeah.

Anders Arpteg: 1:11:56

One thing I think is sure is that if we go to the shopping behavior and buying behavior, that will change a lot in the coming years. You know, going from doing google searches in general sense to just do knowledge management, I would argue you know, do the research, figure out which, what, what to buy, you go to actually let it buy. If you call that so, you, you're not going to actually just let it do the research, you're actually going to do the action of buying something in a more agentic future. So if think that and then thinking the perplexity placing a bid on Google Chrome, I must say I think also, I agree the antitrust case against Google is something that could force them to do this. I don't think Google would ever want to, but they may be forced to. So that may be a reason why they are going potentially to succeed, but still it will be the big agent providers that will win here, I would argue.

Anders Arpteg: 1:12:54

I think, potentially that they want Google Chrome to get the data simply.

Goran Cvetanovski: 1:12:58

Oh yeah, that is for the usage, yeah.

Anders Arpteg: 1:13:01

I think that's the major thing I mean. Otherwise they could just use Chromium, which is open source. But if they just use Chromium they don't get the data. But they do get the data to Perplexity. They can train their own agents in Perplexity to do this properly. I think that's the main reason to do this kind of acquisition.

Robert Luciani: 1:13:17

I think for the people listening here do you all use Perplexity and like it.

Henrik Göthberg: 1:13:21

I do and like it. I use it quite a bit.

Robert Luciani: 1:13:27

What is the use case of perplexity versus research mode in your favorite agent?

Goran Cvetanovski: 1:13:31

It's getting tight there, but perplexity was much better in research before. So if I wanted to do a research right now, I would go to perplexity and not touch anything else.

Anders Arpteg: 1:13:40

Really, if you would do research, you would go to perplexity before something To figure out what something is, a product is, a company is, a person is. It's actually surprisingly good.

Henrik Göthberg: 1:13:50

I have a really fun anecdote. I use perplexity. We've been fighting with the Värmdö kommun that they need to build a sound wall and what happened was, I realized ah, I have an angle why they need to do it because they did restructuring of the, of the of the road where we live. They changed the situation and they were referring back to no, no, we don't need to do another test or evaluation of the noise levels because we've already done that. But they did that before they changed the road situation. I realized that, but I didn't have any data, so I I typed into, into perplexity what was the changes on kagosh wagon when they built this and this and this, and what are the decisions, and, and and and that and it could sort of. Then you know from the.

Henrik Göthberg: 1:14:45

American, you know. So for me, when I say research, I don't mean proper research, I mean sort of finding stuff, market research. Figuring market research stuff out. Wow, it succinctly took it what I asked for.

Anders Arpteg: 1:15:00

I got it From all the public files and I think they've known that perfectly.

Henrik Göthberg: 1:15:04

They've done it for a while. They're really good at that and they've done it for a while and what's the next proper step for that?

Anders Arpteg: 1:15:11

Anyone that wants to buy something. Obviously they want to have the best provider that can do the proper research to know what to buy. If they just can put it into action as well, having the more agentic mode, that would be amazing.

Henrik Göthberg: 1:15:32

But for that they need some data. But I don't know if perplexity is better.

Jesper Fredriksson: 1:15:33

I just like it. I haven't benchmarked it, I just like it the way I think.

Henrik Göthberg: 1:15:35

I think it used to be, uh, incredibly useful, but, as you're saying, I think they kept caught up listening models are catching up this is just a silly anecdote.

Robert Luciani: 1:15:41

Maybe the audience thinks it's fun as a demo. I made um, an mcp server for booking train tickets, and I gave it, uh, as a. I demoed it to a company whose job it is to make mobile phone applications and they happen to have sj and a couple of other customers and they made exactly that app. It is really hard to make an application that the user doesn't find confusing. That app, it is really hard to make an application that the user doesn't find confusing and for the user to give some plain text info. And then the llm uses mcp, which is self-error correcting and a whole bunch of stuff in the background is pretty slick it's quite clear that that's the future.

Anders Arpteg: 1:16:18

Yeah, it's gonna go that way if we may just finish an old topic we went into a rabbit hole but just to close off the topic of metrics et cetera and what we were speaking a bit about, we have a couple of different types of metrics. One is purely the accuracy in being able to complete a task in some way. The other is more of a competition kind of metric. I mean I think LM Arena is that it will never saturate metric. I mean I think lm arena is that it will never saturate. It will always just say, compared to these two models, which one do you think is best? And you can always have a winner there. It will never saturate.

Anders Arpteg: 1:16:53

And then we have the more mature kind of metrics, which is more, given tasks that take x amount of time for human, how much can ai solve to 50%? I think is the definition there right. So when an AI can solve a task that take X amount of hours, that means at 50% success rate. That means that is where AI is and that can never also saturate. As long as you can find longer tasks and more complicated tasks to do. I think still they can improve it in terms of agentic abilities, and I think today matter is very limited to knowledge. I think still they can improve it in terms of, you know, agentic abilities and I think today Metra is very limited to non-agent abilities you can evolve.

Henrik Göthberg: 1:17:30

you can, with the same logic, evolve to more test cases easily. Yeah, I think so.

Robert Luciani: 1:17:34

The terminology here, I feel, is slightly ambiguous, because in my team we just switched to a new user interface and this user interface for sort, for sort of providing AI stuff to to our organization, and in this interface it has the word agents.

Robert Luciani: 1:17:51

And an agent is basically a prompt like, uh, do it in this way, and what you do then is you link one prompt to the next prompt to the next prompt and so it can do sort of a sequence, but importantly, a serial sequence of tasks, and that's not really what an agent is. An agent, in my opinion, should plan how and in what order, and there need to be dependencies and there's a graph and that kind of stuff. And so the question is is it going to? I think maybe that's hard to explain to people and that's why the simpler version of agents is coming up. But I guess that ties in a little bit to people and that's why the simpler version of agents is coming up. But I guess that ties in a little bit to this meter thing where you said that it could either be longer or more complex, right, so like deeper or wider also.

Anders Arpteg: 1:18:35

Yeah, true.

Henrik Göthberg: 1:18:36

But this is a big problem that we have so many meanings of one word here.

Anders Arpteg: 1:18:43

But I think we can evolve to different type of metrics that doesn't saturate, and I think, you know, we can evolve to different type of metrics that doesn't saturate, and I think that's really interesting. And of course we have the whole synthetic data kind of issue where we can simply like, actually, the absolute zero paper from a Chinese university did, which I think is amazing. It actually trained itself to both answer questions but also to generate questions and then it had this kind of GAN kind of style where it actually competed against itself to generate more and more complicated questions and at the same time try to answer more and more complicated questions, and it didn't need to have any kind of human involved to just improve in its reasoning and accuracy. It was without pre-training, right? Yes, it was pre-training for the language model. It's just the post-training that was done without any human data. So the absolute zero is in terms of no human data for the post-training.

Robert Luciani: 1:19:34

I don't want to sort of lure you into a discussion that I know you would easily be lured into. But the question is if we think about post uh, rlhf uh era uh, with people coming up with oh, we can keep this um progression going by doing, uh, reinforcement learning with these cool tricks. And then we have sort of another camp that is like no, we need to do multimodal stuff or something like that. And you know, I've always thought of language as a sort of a bootrap.

Robert Luciani: 1:20:06

And then we just need to be more clever after we've bootstrapped it to make it super smart. But you know, we can at least get it close to human with just language maybe. So yeah, right, jepa.

Henrik Göthberg: 1:20:18

Everything, all roads lead to.

Robert Luciani: 1:20:21

JEPA.

Anders Arpteg: 1:20:24

When you when you will under yes. Well, can we just try to close the metrics topic and then?

Anders Arpteg: 1:20:29

we can add this as an, I think you know, because I think there are a number of types of different types of metrics. So so one, of course, is the accuracy one. Now there is the competition, and then saying that we can have also, yeah, like playing chess I mean, there was an interesting thing in just recent week where they actually put GPT-5 against GROK-4 for playing chess. That's fun, and that was in the new I'm not sure if it's, I know it was Demis Asaibis who spoke about this and it's called. It was launched, I think, in iClear in 2025. Now this Game Arena benchmark. So it's a new way where AI plays against itself or other agents and, in that way, trying to see who is the most intelligent one, and that's also a metric that will never saturate, right.

Jesper Fredriksson: 1:21:15

What happened? By the way, I forgot the GPT-5 win.

Anders Arpteg: 1:21:18

Yeah, GPT-5 won against Grock 4.

Jesper Fredriksson: 1:21:20

I just remember, as I said, before this talk that Magnus Carlsen was watching the game and was laughing out loud when Grok4 gave away a queen.

Anders Arpteg: 1:21:32

But of course, if you do a proper AI-based system like Stockfish or something, no human would ever have a chance. But now for these kind of generalized AI models, our foundational models, of course they play chess very poorly and I think that's interesting.

Henrik Göthberg: 1:21:47

Yeah, we're going, so off the rails here there is a red thread here somewhere.

Goran Cvetanovski: 1:21:52

What is it? Benchmarks that are enjoyable.

Robert Luciani: 1:21:56

We're talking about meter here and it's related to agents following instructions over time, sort of, where we think that everything is headed and it's hard to say what's driving that change. But the type of training required to get really good at those benchmarks is a post RLHF type training. So we have been building chatbots up until now, because that's what people have liked, and tuning them to chat really well. Now what? What do we tune for? Well, programming is pretty given and then you know people are like oh yeah, asking questions and answering questions is also sort of. Maybe we can also automate that, and you know that's what everybody's looking for. The holy grail, uh it. Without having supervision, how can you do reinforcement learning? We're coming up with more and more clever ways to do supervised learning.

Henrik Göthberg: 1:22:46

So what is the trajectory for tests?

Anders Arpteg: 1:22:50

Okay for tests okay.

Henrik Göthberg: 1:22:53

So we are saying now, on the one hand side, more real life or more actuation, more agentic in line with meter.

Anders Arpteg: 1:23:01

We are saying but meter isn't that agentic yet, but I hope it is in the future.

Henrik Göthberg: 1:23:06

No, but it shows a pass. I think it is, but what is what?

Robert Luciani: 1:23:09

is the math test called that. Uh, they can solve four out of the 40 questions, or?

Anders Arpteg: 1:23:15

whatever the hardest, the last test what is beautiful?

Robert Luciani: 1:23:18

human last exam.

Henrik Göthberg: 1:23:19

Yeah, yeah I mean that is awesome. What is that? Can someone I read read about it? But could someone summarize it if the listener hasn't heard about it?

Anders Arpteg: 1:23:28

But it's supposed to be, you know, super, super complicated questions that only the top experts in the human world, so to speak, can answer, and it's in different fields, everything from it is in different fields. Yeah, it's math and physics, biology, computer science and more.

Robert Luciani: 1:23:46

And I think it's partly. My understanding is it's mostly pure maths stuff, but the point is Very scientific.

Robert Luciani: 1:23:55

There's a difference between how, pure maths, people do stuff and if you ask them, everything that we do is arithmetic plus and minus, and what they're doing is formal system stuff, and a formal system is provable. But you have to be very systematic about it and you have to make these gigantic analogy leaps, uh, that are quite quote-unquote creative. And uh, for the, for the ai to have solved four of them, which these people were like no, this is going to take forever. It's very cool.

Robert Luciani: 1:24:25

And we should comment, because you said something about how long ago GPT stuff came out and I think you made it longer than it is right. I mean it's barely been three years total from the chat GPT revolution.

Anders Arpteg: 1:24:41

Now the human last exam I think is fun and I actually have to look it up here a bit. But GPT-5, I think they scored 25% accuracy now on this one. Gemini 2.5, 21, 22%. Opus 11%. Grok 4 above GPT-5, slightly, with no tools. But if you do Grok 4, heavy know, heavy was this kind of four agent running in parallel and they also limit to just text-based because they have apparently multimodal kind of thing where you have to understand images and diagrams and whatnot. But if you ignore them and just have text, grok four scores 50 fucking percent that's amazingly high.

Jesper Fredriksson: 1:25:24

I wonder are? Are these benchmarks open?

Robert Luciani: 1:25:27

there's always this, this risk of being trained on the benchmark leakage no, but the point is these questions are sort of open not open research questions but they're so difficult that there's no published solution. They're they're. It's like uh fermat's last theorem. It's just an arbitrary question where you know if you were to make a solution to it, there'd be five people in the world that could verify that are experts in modular forms and a couple of other things that are so niche, and that's really awesome.

Anders Arpteg: 1:25:55

But it is a good question. I'm not sure if I mean because I think there are verifiable results in this one.

Robert Luciani: 1:26:01

So somewhere there is you can test Okay, yeah, sure, right. So in this is you can test Okay, yeah, right.

Anders Arpteg: 1:26:06

So in this case you can say if it's right or wrong.

Goran Cvetanovski: 1:26:09

And if it were?

Anders Arpteg: 1:26:10

to have leaked, you know, of course, then you would never be able to separate, you know, having memorized the results from actually recent to the results, right, and it could also be just that people are talking about the benchmarks.

Jesper Fredriksson: 1:26:27

If they are open, then I mean we have a worldwide web. So if there is one expert in the world that talks about it on X, then it's already.

Robert Luciani: 1:26:30

But that's fair game, isn't it? Because we're talking about real problems, I mean it's still.

Jesper Fredriksson: 1:26:37

It's learned from reading the answer in that case.

Henrik Göthberg: 1:26:40

But if we try to close the topic on benchmarks, let me close with a provocative question. Should normal people care about benchmarks in relation to how they choose what to use?

Anders Arpteg: 1:26:54

If the alternative is anecdotal evidence, I say absolutely they should care about proper scientific like objective benchmarks.

Robert Luciani: 1:27:02

There's some stuff that can't be benchmarked objective benchmarks. There's some stuff that can't be benchmarked, so I'm going to comment just now on how good suno and other uh services have have happened just in the past it's so good it can't do as good music as I can, but they're very close.

Anders Arpteg: 1:27:15

I mean, it is shockingly good yeah, of course, some things are harder to measure than others but, it doesn't you know, if you were able to measure also subject a subjective thing like taste in music, it still would be better to have benchmarks than not have it.

Robert Luciani: 1:27:30

There's lots of things you could benchmark, like how high-fi the sound is and stuff like that, but even taste.

Jesper Fredriksson: 1:27:36

Can I say something to answer your question? I think should normal people care. If you're talking about the average chat-ship to a user, it doesn't matter because it's already so good. I mean the normal type of questions that non-experts in our field, if you're not talking about programmers, if you're not talking about mathematicians, like everyday kind of stuff, just to get this work thing done that would take me like I don't know half an hour, but I can do it in like a couple of seconds with Chatshippity.

Robert Luciani: 1:28:05

That kind of stuff? Why is a feed in LinkedIn filled with people complaining about? They say, draw a map of Europe and the name of every country on it and it spells Russia, wrong, or something like that that is all what my wall is filled about. And they're like you see. So they're still dumb. People say this is PhD level intelligence. Why can't it spell Russia correctly? How have we sort of forgotten about the normal user and just sort of skipped ahead to all the cool stuff? So this is a segue, right?

Henrik Göthberg: 1:28:36

So this is wrapping up the first question, and I want to answer it myself because I have a pragmatic advice.

Anders Arpteg: 1:28:43

Okay, but don't go into the future trajectory. No, no, no, because that the future trajectory. No, no, no, because that's a second. No, no.

Henrik Göthberg: 1:28:49

I think the segue here is that, wrapping up on benchmarks are they useful is my final question, and I have an advice around how to think as a normal user and then from there we get to where you were hinting at, and then we can get to the trajectory where you were hinting at and then we can get to the trajectory.

Henrik Göthberg: 1:29:07

And the anecdote here is that I take an example from real life in enterprise and we have a basic question or should we use Power BI, or should we use Tableau or whatever fucking source application for something? And we have expert firms, gartner or whatever, who's done their quadrons you know very, very professionally benchmark stuff, and and then in realize if you, if you work in any large enterprise, with all the craziness going on with it, with the tech depth from hell, all the different sometimes crap data, all the different variations and combinations, it's more than 10 years ago that we said you know what we need to do a bake-off, we need to do a prototype, we need to see how it fits in our specific context, and I think you can take that down to any user, consumer. You know what are you using it for. It's vastly different than what I'm using it for.

Henrik Göthberg: 1:30:08

I'm not a coder, you're a coder, okay, I'm caring for marketing research in this vein. Blah, blah, blah. I'm using it for designing cool t-shirts. You know whatever. So, whatever you are caring about, whatever you are doing about, I think it's the time where you kind of narrow it down. There are three or four or five. You need to try them out and then you, you will feel anecdotally for yourself what you prefer, and I think that is more useful.

Jesper Fredriksson: 1:30:37

And they, they are very capable, all of them, they all, in my opinion, is a little bit like some.

Henrik Göthberg: 1:30:41

You know, you get into these semi stupid questions oh, why we should? Enterprise are like we should go there. Just fucking pick one yeah.

Anders Arpteg: 1:30:49

But then you should always pick the cheapest one. Wouldn't that be the argument then?

Jesper Fredriksson: 1:30:52

No, you should pick the one that you like the most. I think all of them are equally capable.

Anders Arpteg: 1:30:57

I don't fully buy this argument.

Henrik Göthberg: 1:31:00

No, I think this is a good argument, because if you want to optimize let me rephrase it you should optimize it for your value, for you.

Robert Luciani: 1:31:14

So if you are planning to do something where you have a lot of API calls, it goes out the door to go a couple of different versions. That's why my dad uses DeepSeek. I'm serious, I asked him do you use any AI? And he's like, yeah, I use DeepSeek, it's so smart, it knows everything about the Roman Empire and so on.

Henrik Göthberg: 1:31:28

So the Chinese are? They're cheaping what they sell out to.

Robert Luciani: 1:31:30

No, but I mean, you said the cheapest one as a suggestion DeepSeq comes in and it's good enough.

Henrik Göthberg: 1:31:39

But once again, there's no right or wrong. Then you need to balance that cheapness against what is the downfall here. Why is it cheap? Maybe, yeah, but I still maintain. You kind of need to understand this from your own perspective, and the only way you're going to get there is by looking at it from your perspective.

Jesper Fredriksson: 1:31:57

And all models are not just models. They have different kind of tweaks, some have advanced voice mode, et cetera.

Anders Arpteg: 1:32:02

If I were to choose what to pay money on. I would never choose my own anecdotal kind of evidence by looking at the fridge or looking at the car. I would like to have some hard data on saying this is some kind of test that I can trust that can make me choose. This is why I should put money in something. I'm a scientist at heart, so for me at least this does me at least I agree with that if you don't have any chance to try it but it's hard but,

Henrik Göthberg: 1:32:30

if el gigante had a way that you know what we have, a setup now that you can have 10 fridges and we come and we fix everything for you and you can have 10 different fridges and then you pick the one you that you liked of.

Robert Luciani: 1:32:42

we can't do that with a fridge, but in reality, if there is no barrier for me to try it, I still think that it has merit Connecting what you two are saying here, which is, each user has different requirements from it. So if you want to write poetry, then you go to the poetry benchmark, and if you want to program, you go to the programming benchmark. I think that's what the future is.

Anders Arpteg: 1:33:06

And Kork is really good at mathematical abstract reasoning. If you want that, you should choose that. If you want something that's really good from being able to phrase yourself in some language, you should use GPT. If you want to have the best coder you should use Krok.

Jesper Fredriksson: 1:33:20

I tend to think of it as the same way as I think when I'm buying a computer, it's like I'm buying a Mac.

Anders Arpteg: 1:33:26

That's not because it's the cheapest, it's because you don't have any good sense or taste. That's because I'm the wrong example by the way this is why I want to sit on this table, oh there's a Mac there.

Jesper Fredriksson: 1:33:39

I want to be on this side, so there are clearly different preferences in how you make decisions.

Anders Arpteg: 1:33:45

Yeah, okay, cool. Perhaps we should try to move into the more futuristic kind of trajectory. You know where are we going in some sense. Just to open up a question there If we take I listened to François Chollet recently like he made some kind of statement who?

Henrik Göthberg: 1:34:03

is he Use for the listeners?

Anders Arpteg: 1:34:04

Yeah, he's the one that created the Arc AGI framework for testing AGI in a super hard way and trying to really find questions that humans are good at. That AI is really bad at today and he's been actually like a deep learning skeptic for a long time. He's a Google engineer as well. He also wrote a paper on intelligence how to measure intelligence properly which I really like. He wrote the Keras framework.

Robert Luciani: 1:34:30

He's famous for Keras, but he's never actually done any super research in anything.

Anders Arpteg: 1:34:36

But he's a famous person, but he's actually been a skeptic for a long time, especially in deep learning, which is interesting and he's also been a skeptic that we ever reach AGI. And now he actually changed his mind and said you know, in last year, single last year, I changed, saying that I think AGI will be reachable in five years, which is very surprising for him.

Robert Luciani: 1:34:55

Nice statement.

Jesper Fredriksson: 1:34:57

That's a powerful walk back from. We will never reach AGI to five years.

Robert Luciani: 1:35:00

We have to be fair here, though, all the people that are walking back and they're like you know what? I changed my mind. No, you didn't.

Henrik Göthberg: 1:35:05

You got smashed by the sledgehammer of evidence but okay okay, yes, I agree, I loved how you pictured that in my head.

Anders Arpteg: 1:35:14

He had to retract. It's like the people on ML Street Talk they are forced to agree that, okay, deep learning is actually useful.

Jesper Fredriksson: 1:35:23

When will Jan de Koon walk back on LLMs?

Henrik Göthberg: 1:35:28

No, never.

Anders Arpteg: 1:35:28

That was too deep that was like a black hole, Okay let me just try to go back to the topic of trajectory, and I just want to say something that Francais really said in this talk, and that was that which is something I agree on is that the LLMs in the past at least before 0103, et cetera is mainly reasoning by finding templates of reasoning. So it's memorizing templates how reasoning has been done in the past for similar tasks and then it's applying that template to do apparent reasoning, which is very shallow it's still reasoning, but it's very shallow and it simply finds a template that's similar to the problem you're trying to solve. And that's why the Arc AGI that he's designing is not working, because there are no similar kind of reasoning templates out there. Now he said that, okay, I've seen a change. I've seen, actually, that it can adapt. I've seen that, without having to retrain the model, it can actually, at test time or at inference time, be able to adapt to a new type of problem it's never seen before.

Anders Arpteg: 1:36:33

That's why he's now saying there is actually fluid intelligence in these kind of AI models and that's why he reduced significantly the time scale for AGI, and I think that that's really a good statement and I actually do agree with that. And people are a bit confused when they say an AI model can reason, but it's shallow reasoning, but it's getting better. Now the question then, of course, I still like the open AI kind of AGI pyramid with the five levels, with the lowest one being conversational, let's call it, or knowledge management, as you call it, and then reasoning, and then autonomous, you know, being increasingly autonomous.

Anders Arpteg: 1:37:13

Yeah, autonomous in taking actions, but I can go into that. And then you have the innovation and organizational, and humans should be focusing on the top two more and more in coming years, right? Is that a trajectory? Would you agree that this is a good future trajectory to consider that this is what we will see? Ai is really good at the bottom layer today, better than humans, I would argue, much better. And just to motivate that, I can tell you you can put like 10 prompts in Gemini in one million token window today. You can read it in seconds and have more or less perfect recall from all these 10 books in seconds. No human could ever, ever, ever do that. So in terms of memory management, it's Pyramid, the Maslow Pyramid here.

Henrik Göthberg: 1:37:59

Oh, that's another pyramid.

Anders Arpteg: 1:38:01

It's improving all the time, so reasoning is getting better and better. But you shouldn't be fooled this reason is still worse than you are.

Robert Luciani: 1:38:09

We allowed to get into philosophy here yet very much.

Henrik Göthberg: 1:38:12

I mean, now is the time, I believe that, um, we are.

Robert Luciani: 1:38:16

We have inherited a lot of terminology, ideas and other stuff from neurobiology and from philosophy to our disadvantage, and the reason is, a lot of those things are from our human hobby of looking at something complicated and cataloging it, whereas in reality what we see are complicated emergence phenomena that come from very simple rules. And if we know those simple rules, they tend to generalize. Well, and you know now, what we've learned with machine learning is precisely that, that we can have these general purpose architectures that learn anything. And then you're like well, I wanted to learn this one thing, so I'll make the architecture just a little bit more specific to it. But general purpose rules are the thing.

Robert Luciani: 1:39:00

And so if we tie to Francois' idea here that there's different kinds of thinking, there's fluid thinking, there's this kind of thinking, or there's fast and slow thinking, it's not clear that there's any physics explanation for that. There is such a thing we have identified it as such but the sort of the emergent or the source principles that are necessary for thinking, it's not clear that you need to have two separate systems or anything like that. And so he's surprised that, oh, it does generalize. Yes, because it's the same system that does the fast and the slow system that you're looking at from above, but it's irrelevant whether you can identify.

Anders Arpteg: 1:39:40

But great comment. If I may just come to that, I would say it is two different systems. And let me, if we take the human brain and we know we have long-term memory, we have shorter memory and we have some kind of working memory at least three different ways that the human brain works. And in the working memory you can perhaps, you know, think about two things or three things at the same time and lose it in minutes. Short-term, it's a bit longer. Long, long term, it's very hard to influence but it lasts a lifetime or less.

Anders Arpteg: 1:40:09

Now, if you take an LLM of today, you have the parameters which are trained offline and pre-training mode, and that's the long term. Of course it's very hard to change. And then if you take the prompt that you have, potentially that's the working memory that is lost in minutes, know minutes or something. But we're really lacking the middle part, the memory part. We're starting to see it a bit, since chattypt has the memory feature. Even in cursor, when you write it it actually adds stuff to its own memory there are some research um architectures that are experimenting with ways of achieving that right.

Anders Arpteg: 1:40:43

Yeah, yeah, there's an old one from 2018-ish or even earlier the Neuro. Ah, I forgot the name now, but there's been attempts in trying to say memory.

Robert Luciani: 1:40:53

now I think Mistral has what is it called. There's the Mamba architecture and a couple of other ones where they're playing with. I'm just saying that people are trying to do with recursive layers and that kind of thing.

Henrik Göthberg: 1:41:02

Are you on two different sides of the argument? Or is it nuances, because you're saying there's only one system, or trying to think about this philosophy and you're coming in no there are many systems, but at the one hand side it's also well, it's one architecture and it's different angles now of what you need to have in that architecture. Even the brain is one brain, or you can understand the brain as several modules.

Anders Arpteg: 1:41:23

The brain is certainly different modules in it.

Henrik Göthberg: 1:41:29

Yeah, but it depends on where you define the system. The brain is one brain, but the one brain is constructed of several.

Anders Arpteg: 1:41:33

It's also all a matter of what kind of abstraction level you're putting yourself on you can say the whole world of humans is one organism but it's.

Henrik Göthberg: 1:41:39

But. The logic here is the brain, one architecture.

Jesper Fredriksson: 1:41:43

The brain is made to be power the brain is made to be power efficient, so what?

Robert Luciani: 1:41:47

we could potentially have is a single feed forward, fully connected neural network doing everything, but we can't build that one that emulates the brain, but we can easily say that the human brain can be broken down in different pieces and we can see they have different functionalities.

Anders Arpteg: 1:42:02

So it's easy to see that. But if you take the GPT type of LLM, which is the decoder part of a transformer, you can see it's not possible to break down. In that sense it's the same type of blocks just being repeated. So in that sense that's one component, that is one part that's very similar with a similar kind of functionality.

Anders Arpteg: 1:42:26

I would argue that we will see more and more of these components. We already see it today with the. I thought this one thing but more that you have a scaffolding, as people call it. I hate the term scaffolding, but anyway they have scaffolding around the LM to make it actually do reason in different ways and take actions in different steps and become more agentic.

Robert Luciani: 1:42:44

But that should be done in latent space now, yes, of course, now we're there.

Henrik Göthberg: 1:42:47

Now we're there. Now we're closing the loop.

Anders Arpteg: 1:42:51

But would you agree that we will see this kind of more complicated architecture in the future?

Henrik Göthberg: 1:42:57

For sure. I think, even to the point of Jan Lekuna, the LLM route is right or wrong. I don't think it's about that. I think it's about we've come quite far on one part of one module of the architecture, but the diminishing returns to continue here, versus starting adding the other modules or expanding the architecture to the latent space or whatever. That's where we're going to go. And this is very interesting, because then, of course, what are we going to measure? What are we going to look at? What are we going to go? And this is very interesting because then, of course, what are we going to measure? What are we going to look at? What are we defining on? So if we have benchmarks, that is, only looking at one part of the architecture, is that really useful or are we only seeing diminishing returns here? Now we need to construct benchmarks and we need to understand the evolution of GPT doesn't need to go further down that route. It needs to go somewhere else of GPT doesn't need to go further down that route.

Robert Luciani: 1:43:46

It needs to go somewhere else. We're already seeing small, small signs of that, where there was a neat paper that came out I think, maybe before the summer really that analyzed transformers in vision and they found that the filters that it was designing were very different than biological ones. So, whereas we've found a lot of success in adapting convolutions for image recognition, transformers have found other architectures or setups that we don't see in biology that are performing very well or sometimes better, and so that is just like on the architecture substrate. We're seeing reasons where it's maybe there's another path that's not very human, like I think what Jan is going to say in the future is yes, but this thing you have is not an LLM, but but wait, and he would be right.

Henrik Göthberg: 1:44:39

But then we come to the next topic how far should we keep staring at only the LLM, and when do we need to start looking at, you know, the architecture of whatever? You know the AI compound system?

Robert Luciani: 1:44:56

Like there's two big driving forces I'd like to hear Jesper's thoughts about this because, I know you're very much into agent mode as well and things like that.

Anders Arpteg: 1:45:04

And I must say you know the term scaffolding seems to be increasing in popularity a lot and I would like to hear what you think. For one, it means, if it means something, that we are adding some kind of logic around it or some kind of functionality around it that is either heuristic, human coded, or it's even some learned behavior. That can be. But what do you think about? You know?

Jesper Fredriksson: 1:45:26

Yeah, I'm going back and forth with this. It's obvious that the scaffolding camp is gaining a lot of traction right now and I alluded to the use of cloud code and the types of scaffolding I see there, which is very good. But then I listened to a podcast with the inventor of cloud code where he was specifically asked by the interviewer of what he thinks about scaffolding in the future and like how much this memory type structures and etc. And he sees it as it's all gonna. He said it changed when he started working at anthropic. So he used to be in the scaffolding camp, but when he started at anthropic he was like, no, everything is going to be in the model, so so I I'm now leaning towards that and I think to to support that argument.

Jesper Fredriksson: 1:46:25

I had a discussion with, I think, claude or some AI in the weekend specifically about this, because it's something that I'm thinking about a lot, and I got to the conclusion that if I pushed Claude to use the bitter lesson like, what is it that scales? We're pouring all this money into the LLMs and we know that we have this trajectory and as long as everything continues to scale, as long as we get predictable gains by putting more money into the system, then we will just continue to do that and there is no fundamental bottleneck to the system. It's just scaling the context size and scaling parameters and, most importantly, the data. We need to solve the data problem.

Anders Arpteg: 1:47:15

So do you truly think that we can reach AGI by just scaling up a GPT kind of style transformer block?

Jesper Fredriksson: 1:47:22

I think what will happen is that we will. We will get to a point where the economic gains of the system it's all about money, because it's a major investment that we're making. As long as we continue to put in money and we get something out of it, because it's so predictable the rewards that we're getting, I think that we will get to a point where we can do economically viable work and, to allude again to the steps that you mentioned from OpenAI, the levels we're now at a gentic and we're sniffing on innovation.

Anders Arpteg: 1:48:03

We're sniffing at a gentic.

Jesper Fredriksson: 1:48:05

We're definitely at a gentic.

Anders Arpteg: 1:48:06

I would say we're sniffing at a gentic. I agree with you there. We haven't solved it. No, no, we haven't solved it. But that's the level that we're sniffing at it. It's still, it's still. We haven't solved it.

Jesper Fredriksson: 1:48:11

No, no, no, we haven't solved it, but that's the level that we're at.

Anders Arpteg: 1:48:13

We're working on that we're working at reasoning as well, and the only thing we're sold, I would say, is the bottom one. Please let me continue.

Jesper Fredriksson: 1:48:19

So what I, what I wanted to say is for uh coding we're at the level where we're truly agentic.

Anders Arpteg: 1:48:26

That's that is it can improve still, but we're but we have solved enormous challenges.

Jesper Fredriksson: 1:48:33

If you look at the pace of innovation in agentic coding in a year, it's like miles it's amazing, but it will continue in coming year, right Even for coding, definitely so.

Anders Arpteg: 1:48:45

So we have more to improve.

Jesper Fredriksson: 1:48:46

Definitely so. So we still have a ways to go go but what I wanted to say is after agentic, we're already starting to sniff onto innovations. Maybe so, for example, alpha evolve I think it's called this system that can improve on known algorithms which google did mind released, yeah. So once we get those uh reinforcement systems into play, then we will have something that can.

Anders Arpteg: 1:49:15

Can I just try to put another spin on that, what you were just saying, and potentially disagree. There is this kind of idea you can simply scale things and in theory of course a transformer could map any kind of problem and function that you can think of. But just as well you can say there is a mathematically proved, theoretically theorem called universal approximation theorem that says that even a single hidden layer neural network can map and work with any function whatsoever.

Anders Arpteg: 1:49:48

It may be infinitely wide but it can. But it's just extremely inefficient to train Extremely you would never. That's why we have the convolutional networks, the recurrent networks, the deep learning kind of system that we have.

Anders Arpteg: 1:50:02

Yes, and transformers. So transformers made it extremely more efficient. I think that's the premier reason transformers were so successful is because the efficiency Not really in the power of what they can model is really the efficiency. I would argue that the Transformer did improve a lot, but it's not an efficient way to continue to improve in terms of agentic abilities.

Robert Luciani: 1:50:24

You know what the Transformer is it is offloading various parts of doing stuff to machine learning. So not only does it machine learn the sensitivity to the variables, it machine learns the features it machine learns the norm and higher and higher up that stack. It's a second-order statistics that no one has, so the more we can figure out how to offload our sort of problem solving stuff as a machine learning optimization.

Anders Arpteg: 1:50:54

It's generalizing convolutional networks and recurrent networks at once. But still I would say that just as it improved in efficiency by going from a single feed forward network to a recurrent or convolutionalal or a transformer network, there is still potentially improved efficiency gain you can get.

Jesper Fredriksson: 1:51:14

Definitely that's why I don't argue against that. Well, you did. No, I'm not. I'm saying. I'm saying this is a paradigm that we seem to, that seems to work and as long as it does. I think that's what will happen. We will continue to do it and then I agree but let me finish. I agree that it's still not known how far the transformer will scale I'm not saying that, but I'm saying we will continue as long as it scales.

Anders Arpteg: 1:51:38

let me quote andy messes up as a bit here and he says you know he was asked a similar kind of question you know how much can we scale by this, adding more data and more parameters to the model and more compute to it? And then he said of course we will see improvements if we scale more data, more compute and more parameters.

Anders Arpteg: 1:51:55

But you can also compare it to algorithmic improvements, and he said in history and he said in the future, algorithmic improvements is 10x more than scaling in terms of data, compute or parameters, and I certainly agree with that. So I think if we find new architectures, new algorithmic improvements that improve the way that we can handle not only long-term memory, which transformers are really good at today, but also improve on working memory and short-term memory, that will be more valuable than just scaling data and parameters and compute. Let me summarize and just a single example neuromorphic computing.

Henrik Göthberg: 1:52:33

I was going to go there, I know Okay.

Jesper Fredriksson: 1:52:36

Because, this is not an either or it was partly no, it isn't Because first, it's an observation.

Henrik Göthberg: 1:52:43

As long as it works, people are going to keep pouring money on it.

Jesper Fredriksson: 1:52:46

That's exactly what I'm saying.

Henrik Göthberg: 1:52:48

And so don't expect this to stop If we hit the ceiling. That's another topic. But the ceiling is not there. So don't expect anything to change, don't expect nothing. The ceiling is there in terms of energy?

Anders Arpteg: 1:53:00

I would say the energy is the ceiling.

Henrik Göthberg: 1:53:02

Now we're getting to a ceiling and then you're adding to. Okay, so we know, if we want to understand and predict the future there, we know, if we want to understand and predict the future, there is one trajectory here. This is a movement here. This is a train going. It's not going to stop. However, at the same time, we all know through history of invention that we will have algorithmic leaps. The interesting part now where in the stack will these leaps happen? So, if we take the example now, the trajectory is going here. So what part of the architecture are we thinking coming first? Maybe pneumorphic computing changes the whole fucking game on the lowest level and then basically we can continue on the upside with no difference, or the algorithmic change is higher up in the abstraction, in the latent space.

Anders Arpteg: 1:53:54

So we don't know right, but we know it's going to happen. I think we're bound today by both compute and energy, so we need algorithmic improvement. I would say.

Henrik Göthberg: 1:54:04

On all levels.

Jesper Fredriksson: 1:54:06

Eventually it's going to be a bottleneck topic. Let me go back to the meter case as long as it continues to scale. If we continue on this path, then at December 2027, we will be able to do things with the LLM that would take a human one week to do, but it would assume that we can scale in terms of compute and energy that we can Not.

Anders Arpteg: 1:54:28

From an energy point of view, no would assume that we can scale in terms of compute and energy that we can Not from an energy point of view.

Jesper Fredriksson: 1:54:33

No, yes, we can. Where do you find that energy? That is what I found from I think it's called Our World in Data. They estimate until 2030. We can continue to scale.

Anders Arpteg: 1:54:45

But we're only getting like a gigawatt data center.

Jesper Fredriksson: 1:54:48

Please, let me continue so up until 2030, according to the best source I've found, we will be able to scale the current paradigm, so that's why I'm thinking up until 2030,. We will continue to pour in money as long as we don't hit any fundamental.

Robert Luciani: 1:55:02

I think there's a good question to ask here, which is one thing that has been particular about AI development for the past 10 years, let's say, has been that it's been primarily driven by industrial research, right? Not academic research, the people that have access to the hardware and companies and so on. Okay, and so you have, on the one hand, all the scientists that are interested in research questions neuromorphic computing, architecture, design and so on and then you have the selective pressures, let's say, or the business pressures of what pays the bills. And you know I'm thinking about is it recall in Windows that lets you rewind backwards Microsoft, you know one sort of the consumer or maybe business applications that are.

Robert Luciani: 1:55:51

This is where the money comes from. Is that pressure going to be higher than the interest from the scientists in sort of which direction we want these things to go in? So, for example, I don't think we've come to the energy quota of music generation models, and what if that pays the bills for everything more than coder models? It doesn't, but you get what I'm saying. Like I wonder if the money for quite a while can push us forward in directions that fill out the gaps while researchers work on more fundamental problems.

Jesper Fredriksson: 1:56:26

And when we get to the level where the AI can do the research Ideally, all right.

Henrik Göthberg: 1:56:31

So let me a little bit philosophically, a little bit like a rant, end this trajectory discussion with. This is all bullshit and it's very simple. You know we had this conversation the first, second, third, fourth industrial revolution. Okay, and if you look at the data now technically in what the capabilities are, what we are talking about here is on the bleeding edge. Uh, the fourth industrial or the fifth or the sixth industrial revolution technically has already happened. But our inabilities as a organization societies to flip that into operational value is the real bottleneck. So how much should we really talk about this stuff and then basically feed the silicon valley machine? And when should we flip this whole conversation to? You know, what do we need to do with agentic and all that to make it work? And then maybe that's an organizational topic, maybe that's leadership topics. So you used to take us back to earth a little bit.

Robert Luciani: 1:57:33

I can feel the AGI.

Anders Arpteg: 1:57:34

That's all I'm going to say but I think it's a good point. And we can certainly see the big race between the big tech giants and also the countries, Of course China and the USA is really racing hard here. Just one fun stat.

Robert Luciani: 1:57:47

People think that China and the US are on par with each other, but at least for right now, in terms of compute power, US dwarfs them at 10x.

Anders Arpteg: 1:57:54

Yeah, of course, but just an interesting stat. They found really innovative ways in DeepSeek and other models to still circumvent that.

Henrik Göthberg: 1:58:02

I'm just arguing. Which race should we be in? Should we be in this tech race or should we be in the adoption race?

Anders Arpteg: 1:58:08

I think what we have spoken a bit about before, henrik, is also the top tech giants kind of race versus what normal companies and people actually make Societies and how you find value from it. And there is a race and you can see the motivation for it because you can think of the innovators or the first mover advantage potentially for someone getting AGI which can lead to some kind of explosion potentially I don't think so but still. But you can understand that countries are fighting because it will have huge impact on their economy and the national security.

Henrik Göthberg: 1:58:43

Geopolitics, all the layers.

Anders Arpteg: 1:58:45

So it's really high stakes and really need to fight on this kind of super, super expensive kind of solutions that no, no, ever normal company or person would ever use. But still, they will try to achieve it and also from a tech company point of view, they also want to win because if they are the winner in whatever kind of competition, they will get the users, so they will also try to win it. But I don't think that's where the value comes Right and I think that's a really good point. So if we can let the X number of countries and companies fight for it, then we can try to scoop in where the value really is.

Henrik Göthberg: 1:59:20

And I think we've been trying to push this a lot, me and Anders, even to the point where you get to this arbitrary conversation should we build our own cloud in Europe or whatever? And yes, we should, but how? It's more interesting, right? Should you fine-tune something else or do you need to build it from scratch? Should you build your own LLM? Blah, blah, blah, blah, blah, right, and here I think there is a much more important conversation to be had, which, then? This is why I'm coming back to AI compound systems, because all of what we're talking about in theory, how it will work, this is the game for the groks and for the open AIs, but for normal people, it's all a done deal. We should all try to build AI compound systems. We should simplify and make use of the simplest model. We should practically understand where we need to use the LLM and maybe even minimize it because it's tricky and build systems. This is my argument.

Robert Luciani: 2:00:16

You've been in the consumer and business space with the application of this technology, is your impression that there is a sense of urgency and the reason I asked that question, just to sort of give you context is my impression is that europe is not missing resources, is not missing talent. It's missing a sense of urgency from you know, a threat to their right or to their existence. Basically, you know if, if the problem was that we needed GPUs, zenzact would have level four self-driving right now and all the other GPU. We have resources and we have talent. We just don't have that killer instinct that's necessary to get stuff done.

Jesper Fredriksson: 2:01:01

Yeah, maybe we don't see the sense of urgency in the AI train. I think we're starting to see it, but we're not.

Anders Arpteg: 2:01:08

But I don't think Europe or Sweden or most companies should ever get into the AI race. I think they can, you know, have a big benefit of not taking that kind of fight. I think we can actually gain so much value for a company, for society, for people, if we simply try to focus on the value.

Robert Luciani: 2:01:26

You're only saying that because we can't take a fight. If we were to fight there, we would lose.

Anders Arpteg: 2:01:31

But still, it's not that we have to be sad about it. I think actually we can be positive about it.

Henrik Göthberg: 2:01:38

Okay, I hear Okay. Anders was now using or in his head. Having AI race for him meant the ultimate bleeding edge AI race. What we need to do? We need to have urgency and feel the killer instinct about fundamental innovation on the productivity frontier, and the value and what the fuck that could be. If it's process digital AI, who gives a shit?

Anders Arpteg: 2:02:02

Let them invest their hundreds of millions, because if we recalibrate, then we can better understand.

Henrik Göthberg: 2:02:09

Do we really need our own cloud or do we need something in some ways? Do we really need our own LLM or do we need something?

Anders Arpteg: 2:02:15

The one that will be best in finding true value for a company or for a society or for a country. They will win in the end, I would say. But still, of course you can see that the frontier will win in some sense.

Henrik Göthberg: 2:02:27

But this is now why the AI commission report is problematic.

Henrik Göthberg: 2:02:32

Because you know if you look at who is in the construct of this, it's, on the one hand side, some politicians on some CEOs and a huge bunch of researchers who really want to compete in the AI race in terms of AI research in Sweden. Now, we need more researchers, fuck. No, we need more engineers. We need AI engineers. We need people who can build AI products, and there is not a single paragraph from engineering in the whole report. We need to educate users, consumers Fuck that.

Jesper Fredriksson: 2:03:04

We need to have engineers in terms of I'm starting to be more positive, at least for the close vicinity. For Stockholm, I feel like we have our lovable in Sweden. Yeah, we have a couple and I think that could be a way for us to win, in our space, the AI race to build those products that people will use. There is, of course, the political risk of, let's say, that somebody wants to shut down all the American models for EU.

Anders Arpteg: 2:03:36

I'm sorry I have to break for toilet break, but I'll be right back. Sorry, I'm trying to, but okay, I'll be right back.

Robert Luciani: 2:03:43

Are we trying to say that engineering is easier than research?

Jesper Fredriksson: 2:03:48

No, that's not what I'm necessarily saying. I'm saying, uh, that we don't maybe have the resources, uh I think that my point is. I think we do yeah you might be right, so I'm I misinterpreted your question. I thought when, when you said, uh, are we doing enough? I I meant on like an organizational level, on a company level this is very go to. But of course I don't see and I'm happy to be proven wrong that we have the economical resources to build an open AI in Europe. No, maybe.

Robert Luciani: 2:04:24

But I mean, let's, for the sake of argument, say that some organizations might have enough to get close. You know, we've seen some. What are they called? Mistral in France, some companies in South Korea, chinese companies managed to do a lot with very little, because they have their ninja team of super hackers, exactly. And then we're like, okay, we don't have the ninja team of super hacker researchers, but we have the engineers. Yes, and so my question, then again rephrased, is do we really believe that engineering is any easier? No, and then we're like well, we have some good companies. And it's like, yeah, but you know, it's like we have some customers in Europe and we have 10,000 customers, not 400 million customers.

Henrik Göthberg: 2:05:17

No, but we had Svekir Jansson here, who is at RISE, heading up the data science or AI center of excellence in RISE. He said it so well. So what happens with Silicon Valley is that they are churning out engineers who have an understanding for what great looks like, for what it should look like. They come from the Googles and then they do their stint there and then they start the next startup and it just grows a critical mass of engineers that is not only sort of researcher but applying almost research to solve real problems. And what is interesting I love, with the lovable example, this is the difference in 2025 in Sweden compared to when we were with Spotify in 2012. We are now seeing the second generation of AI startups with great engineers educated at Spotify or Klarna or whatever, and now we get in. So this is the first time where you get any resemblance of the Silicon Valley phenomenon, with engineers being churned out.

Henrik Göthberg: 2:06:23

This is vastly different to churn out high stakes engineers compared to churn out researchers, and we are asking for more researchers in academia or whatever, but we need to churn out high-stakes engineers. However, we can do that with more lovable or with with with actually incubated stuff happening in Volvo, whatever, I don't care, that's where it happens. And and then it grows from there because you get that engineering culture in and you can and you kind of understand. Well, it's not only about the model. We need to build the system and the system can has six risk vectors. It's the integration, blah, blah, blah. And then we're getting to real stuff, I think you know.

Anders Arpteg: 2:07:01

Of course, it's super cool with lovable and unicorns being developed in sweden and europe, but I think what I would be really excited about is if the average company, an average person, start to gain value from AI. That is what I hope we would spend investments in and I think that's actually a very positive thing. That means that we have a reasonable path. It's something that we can do in Sweden and Europe, and that would mean that we are instead of creating these kind of extreme classes of companies and people, as they do in the US, which is the average company in the US, I would say, is far below the average company in Sweden, but the top one is far above so much bigger and bigger economy than most countries. But what you said, I think the point is, if we then simply say, let them fire it out, let us instead find value that really benefit most of the companies and most of the people, that would be an amazing future.

Henrik Göthberg: 2:07:56

But isn't that a very, very interesting topic? How you need to frame and really articulate more specifically the focus within an AI commission report or within a digital strategy, because if you look at it right now, it doesn't make those choices. And that's the problem to me, because if we're not making those choices, we're going to shoot our money off with a shotgun for bullshit.

Anders Arpteg: 2:08:21

If we try to build another frontier model, we're going to lose that money in seconds.

Henrik Göthberg: 2:08:27

Because what you're saying now is like we're going to aim and really it's the only articling we're going to aim and really it's the only articling. We're going to be the best in the world of using AI. We said somewhere but what does that mean? Does it mean that we will have to take elements in AI? No, it means we need to be best in the world to build AI compound systems in a safe and smooth way.

Robert Luciani: 2:08:45

We don't make fun of the prime minister when he uses AI. Oh yeah, what do?

Anders Arpteg: 2:08:50

you think about that. That even came in the sofa of TV4.

Henrik Göthberg: 2:08:54

I couldn't believe it, wouldn't it be?

Robert Luciani: 2:08:55

worse if he had said AI. What's that? Oh my.

Henrik Göthberg: 2:08:58

God. No, I mean it's okay.

Jesper Fredriksson: 2:09:02

He's an adult. He's hopefully a well-educated adult. He knows what happens if he uses chat, gpt or any system, so I'm trusting that he makes his own judgments.

Henrik Göthberg: 2:09:14

Yeah, but the question was not about how he uses it, but it was really even more dumb that he's using it.

Robert Luciani: 2:09:20

Yeah, it was just like. Should he be allowed to use AI?

Henrik Göthberg: 2:09:28

No. So I commented on this like look, that's the stupidest thing I've ever heard. You commented on LinkedIn. I commented on LinkedIn, but if someone has said how he uses it, then it would have been a fruitful conversation.

Robert Luciani: 2:09:39

But if you should use it.

Anders Arpteg: 2:09:40

Yes.

Robert Luciani: 2:09:41

Yes.

Henrik Göthberg: 2:09:41

Stupid.

Anders Arpteg: 2:09:42

I would never trust a person or a company that doesn't use AI. I the core point right.

Henrik Göthberg: 2:09:47

It's the core point, but then of course he needs to know what the fuck he's doing. That's a good debate. How is he?

Jesper Fredriksson: 2:09:52

using it. You could also argue that you shouldn't be using mobile phones or the internet.

Henrik Göthberg: 2:09:57

Why don't we use Liberty Cave? It would be so much easier.

Robert Luciani: 2:10:01

My impression is that budgets for AI in the past summer and right before have loosened up magically, really really fast, like whereas your prices yeah, and companies and everywhere. Last fall the, the wallets were really really closed. And now hospitals, public sector and companies. All of a sudden they're like what are you basing this on? Just my anecdotal sort of impression of what people calling Robert, something like that I wish it were true.

Henrik Göthberg: 2:10:32

I wish I could see some numbers on that?

Anders Arpteg: 2:10:35

That would be nice, are we?

Henrik Göthberg: 2:10:36

wrapping this up. How do?

Jesper Fredriksson: 2:10:38

we wrap it up.

Henrik Göthberg: 2:10:39

Yeah, how do we wrap it up?

Robert Luciani: 2:10:40

I think we already wrapped it up GPT-6 when I wrapped it up with this old bullshit.

Goran Cvetanovski: 2:10:45

Okay but let me try to say what.

Anders Arpteg: 2:10:49

With the promo for season 11.

Goran Cvetanovski: 2:10:51

11th of September. Oh, yeah, yeah, yeah.

Anders Arpteg: 2:10:53

Okay, okay, yeah, so this was an improvised podcast. We weren't planning to do this. We just thought you know why not, given all the discussions about E55, and why not simply having fun with some friends talking about AI. But we're starting the proper AI After Work podcast on September 11th. So, talking about AI, but we're starting the proper AI After Work podcast on September 11th. So look up for that and we're going to continue and we'll have some awesome guests. I'm sure we're going to focus a lot on the open AI stairs or the pyramid going more agentic more and more time and time again, but let's make a promise to find also the adoption angle and what we are talking about.

Henrik Göthberg: 2:11:30

Let's really find the engineering angle and making it practical. That would be amazing. But we already have some really amazing guests.

Anders Arpteg: 2:11:37

Yes, Very good, cool, but perhaps then let me try a question to all of you. One question, okay, two questions. One who will be the winner in the Frontier AI race Are?

Robert Luciani: 2:11:52

you talking about like the best, like who's going to have the best AI by the end of this year? No, in five years, in five years. Like who's going to have robots in?

Anders Arpteg: 2:12:01

five years, but okay, you can take one year if you want, but I was thinking five years. Secondly, who will be the biggest value finder for AI? What I mean with that is Society.

Henrik Göthberg: 2:12:18

Who will impact positively society best?

Anders Arpteg: 2:12:20

Yes, which country Basically?

Henrik Göthberg: 2:12:23

Country or LLM.

Anders Arpteg: 2:12:26

No, no. Which country will find? I'm making it up as I think it, but which country or continent, if we?

Jesper Fredriksson: 2:12:34

take like a.

Anders Arpteg: 2:12:35

US, europe and China and Asia, russia. Whatever which one will be the best off in terms of happiness and societal benefits From AI, from.

Robert Luciani: 2:12:49

AI.

Robert Luciani: 2:12:50

Okay, I'm ready with my prediction and I don't have a lot to base it on, but only one small thing. I think xai is going to do very good. Yes, and the reason is I always think that people that are not afraid to embarrass themselves have a greater chance of doing revolutionary stuff, and we know that they're very that, elon and everybody is just not afraid, and I think that is good. Secondly, who's going to benefit the most?

Robert Luciani: 2:13:20

I think it's like a tide that's going to lift everyone. Meaning the people at the bottom are going to be the most well-off. They're going to understand the law better. They are going to be the most well-off. They're going to understand the law better. They're going to have knowledge at their fingertips. They're not going to be fooled by as much. They're going to be able to get information. You know, I was talking to a guy yesterday that has friends in Africa and he was saying that now that everybody has a spokesperson with them in their phone, that that's going to be very powerful. So it's going to be the equalizer in some way. Yeah, I hope that it's like a democratic force.

Anders Arpteg: 2:13:52

Yeah, good, and the winner of the AI frontier race.

Robert Luciani: 2:13:56

XAI.

Jesper Fredriksson: 2:13:58

XAI. Yeah, Wow, that's a bold prediction, yeah.

Robert Luciani: 2:14:00

I'll be the dark horse.

Jesper Fredriksson: 2:14:03

Yeah, so my money would be on Google. They have the most resources and they've made good progress recently. They also have a breadth of things, which is a good sort of hedge. I mean, I like the genie type of models, the world building models, so my money is on Google, just from sheer resources in compute and in money. They also have talent, but they all have, and I can't improve on your answer on the lifting all boats. I think that's a very well-made observation. I think the ones that will be that, um, uh, relatively speaking, the ones that have the least, will gain the most.

Anders Arpteg: 2:14:56

That sounds like a good good, um, that's a good outcome. Right and rick any thoughts. Who will be the frontier winner? And, uh, valley winner, potentially salesforce?

Henrik Göthberg: 2:15:09

no but Salesforce no, but it's such a good one it used to be interesting. I will go with an open source story on this. I think there comes a point and it's a little bit down to the whole B2C versus B2B where we put the effort, we will get to a point where the costs of doing some of these proprietary models will be too high. So there will be a shift to build models in a way where the fundamental model is open source underneath much, much more magnitude moreier models would be open source.

Anders Arpteg: 2:15:50

No.

Henrik Göthberg: 2:15:50

So okay. So who will win the AI race? How do you define that? The frontier race, okay. So I was not thinking. Okay, I answered the question wrong. I am not answering it as a frontier race, but as a market penetration race. So I think there is a dark horse here in open source, but I was actually not answering the right question. I was asking market penetration race. So I think there is a dark horse here in open source, but I was actually not answering the right question. I was asking market penetration that I think on a B2B scale. There needs to be some rethinking here. Frontier race was the real question.

Anders Arpteg: 2:16:24

Sorry, it's two questions. One was the frontier, the second one was basically value race.

Henrik Göthberg: 2:16:29

So the value race in terms of tech technology, I think open source and you know you can make your pick, maybe lamas or something like that.

Jesper Fredriksson: 2:16:38

You know, the john lecun gang that's interesting after after uh zuckerberg poached all the researchers for billions of dollars yeah, he, he, he has paid for the talent.

Henrik Göthberg: 2:16:50

Let's see if we can get something out of it. And he has the Jan LeCun. When they do the algorithm leap, when do they do that? So let's see, right. So I'm betting on that. Maybe better odds on that one if I'm going to put money on it.

Henrik Göthberg: 2:17:07

When it comes to society, the dark horse here is Europe, because, in my opinion, if you have a balanced understanding for democracy, risk and regulation and stuff like that, and if you can harness that in a good way, you can build a more sustainable innovation and maybe what you're losing on in the short game you're just you're building each block in the right way so you can step on it and it and it doesn't fold underneath you. But that's such a long shot because it sort of ignores all the problems of Europe. But it also highlights that you know what. There is a race in terms of value for society that I don't think you can do unchecked and you can then frame it. It's really fucking hard to live on a continent with hundreds of countries almost, and to make that work, and maybe that can play to our benefit if we can sort of figure that out in this space, but that's, I don't know.

Anders Arpteg: 2:18:22

It's almost visual thinking, maybe I don't know, but it's a fun, it's just an angle cool and you okay yeah, you thought, you thought you're gonna get out of here I must agree with um robert and I think you know the speed always win and there is no one faster than elon and his tesla companies and spacex and neural link and whatnot. I think that's hard also because they have the whole stack, they have the hardware. They have the whole stack. They have the hardware, they have the software, they have the data, which is like at least 10x, if not 100x, more than any other companies have to build this. They have the robots, they have the humanoids, which is going to be the trillion type of market.

Henrik Göthberg: 2:19:04

It's going to feed each other from all angles.

Anders Arpteg: 2:19:07

And they're thinking the full stack here and they have something working in all of the levels there that no one else has. I think it's Just if Elon can stay the F out of the politics, then perhaps they will actually win in the end when it comes to value for society. I'm really worried about the geopolitical trend or direction we're seeing right now. I think it's super dangerous if we abuse the current techniques for that kind of purpose. My biggest scare is that we will not be able to manage that before we have AGI, be able to manage that before we have AGI. I will actually be more happy when we have AGI, because then we can have AI supervising other AIs, but we don't have it today and then we have to have humans to do that, and that is going to be really, really hard. So then if some human assisted by AI for warfare, for whatever, it's going to be super, super dangerous. So I'm hoping that we will come over this kind of coming five ish plus years without something really bad happens.

Henrik Göthberg: 2:20:16

And if that happens, if we come above this time, this period, I think it's going to be beautiful, but this is a political answer. The point was which continent or which do you think is coming out on top or finding finding value and you were talking about lifting the game that that means Africa or some people. All of a sudden, we see shifts in balance.

Anders Arpteg: 2:20:37

I mean, I think actually Sweden, europe, has a good chance here and because we're if we were to do it properly, we don't need to win the frontier race. We can still win the value race and if we do that properly, it can be a really beautiful society. We've done it in the past. I think we can do it in the future. We just need to stay alive until we can do it but, but.

Henrik Göthberg: 2:21:00

And a friend of mine who is from another culture lives in sweden. He said something. You know why we will win in the innovation game in such a way, when this works is that what is our sort of culture of consensus and the way we are making decisions?

Henrik Göthberg: 2:21:20

in some ways, we have a better understanding for how to distribute agency out in an organization and therefore we will be in a much closer to our cultural values to make this work, whereas you have societies which are extremely hierarchical. The tricky one of distributing agency that we now want to give to agents has not at all been solved.

Anders Arpteg: 2:21:46

We've been really good in distributing wealth in the past and if we continue on that which US and China has not been, then we can have a good future as well.

Robert Luciani: 2:21:55

Europe has been coattailing on the US for 78 years. We should stop doing that we should stop doing that.

Henrik Göthberg: 2:22:02

Yes.

Robert Luciani: 2:22:02

No more medicine, no more technology. We just get rid of Windows and Apple and everything. I think we can have one last question, which is utopian or dystopian? Okay, the standard question how many are utopian about AI? Meaning not, we're going to be in a utopia, but everything is going to be much better.

Anders Arpteg: 2:22:25

If I start, I can say what I said again If we just manage to survive until we have ai that can supervise other ais, then I'm glad. I'm really scared about the short-term future, meaning coming five, ten years. If we can just survive over those, I think it will be a super beautiful future for humans I think it's gonna be a rocky road towards utopia, but I hope that we will get there.

Jesper Fredriksson: 2:22:50

I think the challenges are the inequalities in those that know AI and those that don't. That's a big challenge to solve, but I think there's going to be some time just as the industrial revolution when we have to adapt as a society and there will be those that are quicker and there will be those that are slower, but eventually I think we're going to be fine. We've managed to survive many different shifts before and if we survive then I think it's going to be more on the utopian side if you take the long, long, macro, historic view on this empire's common goal and it's fucking shit and crazy.

Henrik Göthberg: 2:23:37

But if you look at the fundamental society and how we live now versus a thousand years ago, it it goes, it goes up in the end, but with the rocket road, and the rocket road might even be a world war. But so it's all about scale in terms of temporal scale here, but ultimately I think it's utopian in comparison to where we are. What I think is the main problem beyond this sort of before we reach AGI. This is one argument. The other argument is the AI divide, the distribution of not wealth but also know-how.

Henrik Göthberg: 2:24:21

Now which goes it's going to be a distribution of power and know-how and wealth, one and the same, connected to AI and how we solve that means also that there is a very obvious chance that in the parallel world, we have both expanded heavily in utopia, but at the same time have the worst dystopian places other parts of the world. So is the net effect then utopian or dystopian? So so someone said it really well on the pod why would we have? Why we not we? Would we not have both in different parts of the world at the same time?

Robert Luciani: 2:25:03

that's how it's always been that's what the movies are like, right you? Have the people living on the on top and then the people living underneath.

Anders Arpteg: 2:25:10

Yeah, I know cool, cool, hope we live to see it. And uh, with that, sorry for that, okay, but with that, thank you so much, robert, jesper, henrik, and let's continue the after after work with some more off-camera discussions. Thank you so much, thank you thank you.