AIAW Podcast
AIAW Podcast
E150 - KB-Whisper: Outperforming OpenAI with Smaller Models - Love Börjeson & Justyna Sikora
Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.
In Episode 150 of the AIAW Podcast, we delve into the remarkable world of AI-driven language innovation with Love Börjeson, Head of R&D & KBLab, and Justyna Sikora, Data Scientist at the National Library of Sweden, as they reveal the creation of KB-Whisper, a pioneering AI model that revolutionizes Swedish language processing by handling diverse dialects, historical recordings, and the intricate demands of cultural preservation. From the model’s technical architecture and massive data clusters to its social impact and ethical implications, this unscripted conversation offers an unmissable look at how AI is shaping language research today and what lies ahead for Swedish AI in the next five years. The main gain of the episode can be captured by these compelling angles: The KB Whisperer: Outperforming OpenAI with Smaller Models, Sovereign AI: How the Swedish National Library Built a Superior Speech-to-Text Model, From Royal Collections to State-of-the-Art AI: KB Lab’s Journey to Create Whisper Models for Swedish, Small Team, Big Impact: How 12 People Beat Silicon Valley at Their Own Game, and Beyond the Model: Cultural Sovereignty and the Future of Language Technology. Catch all the insights, breakthroughs, and future directions by tuning in to this thought-provoking episode—now available on Spotify!
Follow us on youtube: https://www.youtube.com/@aiawpodcast
Introducing KB Whisperer: Demo Challenges
Justyna SikoraIt did work with the mic, so just a person standing holding mic and then but this is the classical demo devil. And everyone was using snafu.
Henrik GöthbergYeah, situation normal, all fucked up. Yeah, and here we are demoing voice-to-text model yeah, State of the art.
Love BörjesonBeating open eye, no, but it's silent.
Henrik GöthbergBut take us back a little bit. So we are talking about one way, I think one of the bigger sort of in quotation marks launches of a KB model, Because this was newsworthy. So back up the tape. What was the thinking or why did you want to have a launch, and how did it all go?
Love BörjesonYeah. So the team Lenora was one core part of that team that was led by Lenora Westerbaker Olsson. They had trained a whisper model, which is a sound model, a speech-to-text model, and we saw early on that the results of this model were fantastic. So we could clearly beat the OpenAI model in several different measures and etc. So we thought this is clearly newsworthy because we're going to share this model as a research result, so it's open for everyone to use and it's going to prove useful in in like so many use cases. We already see it happening right now. Um, so we wanted to sort of make a little bit of a splash. We don't normally do that, but I think the time was right to do it you can't do it for so many we're waiting for this yeah
Justyna Sikoraso it's perfect time to do it.
Anders ArptegSo how did you do the splash? Did you invite people then to come there, or how did you arrange the actual launch?
Love BörjesonYeah, again, I think, especially also around the event, it was very much Leonora's doing, so she set up. You know, there's a tool from the communication department to set up an invitation list and then there was some press release, and then it was the word of mouth and LinkedIn.
Anders ArptegYou had a formal press release as well before Antoon. Okay.
Henrik GöthbergAnd I saw the press release and I saw I was on the short list. Thank you for that.
Love BörjesonAnd there's some really nice photo of the team, I think on the blog, including Justyna, I think on the blog including Justyna. I'm not in there, sorry, no, no, awesome. And then it came. How many people attended the launch? It was a full house, which the big auditorium, or whatever you want to call it, the KB, is a little bit more than 100 persons.
Justyna SikoraPlus some on Zoom.
Love BörjesonYeah, on Zoom as well.
Anders ArptegI don't know how many there are, and you had the. What was it? The head librarian? Is that the proper term?
Love BörjesonYeah, the national librarian. She's sort of the CEO. That's the CEO of Kungliga Biblioteket. Yeah, Riks Biblioteket. I think that's about the coolest title you can have.
Henrik GöthbergYeah, that's a super cool title.
Love BörjesonSo she and me did the sort of the big cutting the ribbon thing.
Anders ArptegYou actually had a ribbon? No, we should have.
Love BörjesonAll these mistakes. I wasn't in the photo, we didn't have a ribbon, the sound didn't work.
Desktop AI - MayaBut now we're here.
Love BörjesonNow we're here. That's a good thing. And then there were three kind of rather technical presentations. Lenora did one about the background and also the computational resources used in HBC we're going to talk more about that later. And then Fatoum Rekatati did a really good presentation about the data wrangling around this, which is super complicated and also talking about the importance of data sovereignty etc and the importance of the Swedish domain of data. And then Justyna took also kind of of complicated presentation but with a few good sort of the selling points showing how much better this model is than the previous models.
Anders ArptegAnd we are getting into all these kind of details and it's super cool that we have this awesome I think AI lab in KB. You're producing one model after the other that I think AI lab in KB. You're producing one model after the other that I think is so useful for the Swedish and other societies that we have in companies.
Henrik GöthbergYeah, but I must like the cliffhanger that we're going to talk about beating OpenAI's model in what we can achieve, but, more importantly, with a smaller, more manageable model.
Justyna SikoraYeah.
Henrik GöthbergThat is super cool from a small team of 12 people in KB Labs. Yeah, that is rock and roll, rock star stuff, so cool to talk about Super exciting yeah.
Anders ArptegAwesome. But before we dig deep into these kind of details how that was achieved and what you can use it for let's first welcome here L Lovi Börjesson, an old friend that's been here before and is an awesome person that I think one of the more knowledgeable people about you know, how to really train large models and have done so with success so many times. But perhaps, Lovi, if you could, just who are you? How did you get into the role that you have as head of Rnd at kb lab?
Love Börjesonyeah, so I, I'm a former mariner who flew to the shore, uh, when I was 30, and then did the phd and ended up at stanford university computational social science lab. So I'm not a data scientist by training. I'm a sociologist of some kind, you know, and from there I moved back to Sweden, started a few lab in the industry and then was hired at the National Library of Sweden, kunglöbiblioteket, to start a data lab.
Anders ArptegWhat year was that approximately?
Love Börjeson2019 Approximately, I to start a data lab. What year was that approximately?
Henrik Göthberg2019. Approximately, I only know two people who present themselves as a computational sociologist you and Mikkel Klingman. Yeah, where did you guys meet?
Love BörjesonSo weird, but in a small marketing analyst sort of company I was consulting, there you were consulting with ProSales.
Henrik GöthbergYeah, oh really. And then he was the lead. You know he works with me in ProSales now and he updated his title Now he's senior mad scientist, oh yeah, yeah. And then in ProSales he had to use mad scientist, okay, cool. So he was like the, but to use the word computational sociologist, I find that.
Love BörjesonIt was interesting because they were sort of they didn't understand him, I think at ProSales, and we started to talk and we understood each other immediately. Okay, so you're a nerd. So, anyhow, I started up the lab in 2019 and it's been growing since.
Anders ArptegYeah, and a bit about the lab today. How many people are you? What do you do?
Love Börjesonin a short so the lab itself has sort of expanded a little bit and now so the lab itself is bigger. But now the lab is also part of a unit that is uh, the r d unit at the uh, at the national library.
Henrik GöthbergSo I'm headed up units so we have all kinds of weird research going on people who are experts in latin but this is the realization, but it's a fundamental reorg that happened like a year ago or like uh, when was this justine a couple?
Love Börjesonof months ago.
Justyna SikoraI see a couple of months ago.
Henrik GöthbergI think yeah a couple of months. So 2024, you had a consolidation of business units into the RN. Yeah, yeah, okay, yeah.
Love BörjesonI don't think they will recognize themselves as being part of a business unit. So, anyhow, it's appropriate. But the lab itself. I mean when we started the lab, what K? I mean when we started the lab, what KB had. Kb is the National Library, so KB for short. What they have realized was that okay.
Love BörjesonSo we have these mad scientists now who want to have data sets, whatever is that. So we have the collections that are optimized for storing, you know, and then they want to have something they call data. We don't know what that is, so we need something in between, probably data data. We don't know what that is, so we need something in between, probably data science. They didn't really know that, so they hired me to do that. So me and Martin Malmsten, who was one of the founding fathers, original gangsters of the lab, together with Emma Rende, we did that to produce data sets and, as we were talking about, the first data set took us a year. Now we do it in a matter of hours. So that was the first thing Create data sets for scientists from the collection.
Anders ArptegAnd perhaps people don't know about KB actually having so much data because of the Swedish plick log.
Love BörjesonDeposit law, yeah, and you know the date, of course, everyone knows it's 1661. When that law came into existence, yeah, it was the law for censorship, of course, uh, back then, and now it's a tool for democracy. So it's pretty cool, it's pretty much the same, but it has warped but but on on a global scale.
Henrik GöthbergWhat do you know more about? How does it work that we have this law, that actually have, we have that. We have then collected everything. And then, when we say everything, I think it's hard for people to understand that when you have a poster for a rave party in the 90s, it actually at the printer. You know when they're supposed to do thousand prints, they do a thousand two prints because they sent two of the posters to KB Seems forever. Yeah, on a worldwide scale. Who has that fundamental hoarding?
Love Börjesongoing on. Who has the hoarding law more than us, I think, in terms of deposit law? That's an interesting question. I think our law is one of the first Because the Swedish Union State was one of the earliest in its form. You know, Protestant Union State, it was sort of something taking form that was separate from the royal family, and I mean Oxenstierna, or is it the other way around? I keep forgetting. He was forming that state and this was one of the part of that, you know deposit law.
Henrik GöthbergAnd what is it? Is it 16? 1661. 1661. Yeah, that's the law. That's the law.
Love BörjesonThe origin of the library itself is the collections of the royal family Vasa, who is, you know, renaissance people. They wanted to have a library, naturally, so then it was truly a royal library. So the first catalog is from the 16th century, so like 100 years earlier, but it came into existence as a library during the 17th century. So 16, something, something, and with the deposit law, and then we had the big disaster 1697 fires yeah.
Love BörjesonOf the royal castle, and then it has been more of a national library, actually keeping the name sort of witnessing of the origin of the library. So it's still called Kungnäbibetäckning in Swedish, but it is for all intents and purposes a national library.
Henrik GöthbergBut it's old in this sense, but it's also, in this crazy breadth, the the breath of what you have collected.
Love Börjesonthat is also absolutely, because we have everything printed. So what people are thinking about is obviously books. So we have books, yes, but we also have what we call ephemera, which is the everyday collections, which are pamphlets and posters and what have you menus andave Menus. Rave posters, the rave collections, yeah. And we all have radio and TV broadcasts. We have TV. We have computer games, we have, yeah.
Henrik GöthbergYeah, it is mad when you think about it, when you start thinking about the petabytes or it doesn't matter. Do you have a number for petabytes? Yes, Is it more than peta?
The KB Whisperer Launch Event
Love BörjesonSo someone's probably going to say that I'm wrong about this, but the latest figure I heard when I started was 26. 26 petabytes Cool stuff, anyhow. So the first thing we did was to compile datasets. I'm going to stop talking soon, but the thing we did 2019, this was after the launch of the first transformer model from Google. That came out 2018 in the winter, and me and Martin started to test this multilingual model and we saw that it blew everything that existed for the Swedish language just out of the water. So all these dependency models they were sort of obsolete overnight. But we also realized wait a minute, we can beat them. We can be the world's best lab for the Swedish language. So KB Bert KB Bert came into existence a couple of months later because we didn't have any computational resources. It took a while to train that one. We released it I think it was january, perhaps february 2020 and it has been downloaded like million times since yeah, and then you grew into several different.
Henrik GöthbergYou know word to wake? Blah, blah, blah. Variations for different science purposes awesome and very welcome here as well.
Anders ArptegJustyna, I hope I pronounced your name correctly. Data scientist at KB Lab. Right, please describe a bit. Who are you and how did you come to work at KB Lab.
Justyna SikoraYeah, I think in comparison I'm just a standard language nerd. I started studying languages back in Poland, so it was Swedish, dutch, german. It was before all transformer models, all Whisper, chat, gpt, everything. So my knowledge about natural language processing was quite limited. But then I started working as translation project manager and I saw how this field is transformed by all these new technologies and I thought, oh, that's really cool. So I found this program language technology here in Sweden. I moved into that and then found my job at KB. So that was my story.
Anders ArptegWhat year was that approximately?
Justyna SikoraAround four years ago, yeah.
Anders ArptegNice, but I heard something.
Henrik GöthbergThere was a project or consortium, you know. Yeah, Nice, but I heard something. There was a project or consortium, you know that you were on.
Justyna SikoraYeah, exactly, very cool project that I'm a part of You're still part of that. Yeah.
Henrik GöthbergThat's actually how the funding is part of it, how this works, how you've ended up.
Justyna SikoraYeah, tell us about this. Yeah, I was employed as part of this project, so it's called human infrastructure human infrastructure exactly infrastructure exactly, I'm a bit slow today.
Anders ArptegCool, and you have now been working also with a whisperper, the main topic of attraction today, so to speak. Right, yeah, I was part of the team that developed our Swedish suit of Whisper models and perhaps, if we move quickly into the theme, perhaps you can start by explaining how did the origin of KB Whisper came about.
Love BörjesonWhy did you start working with it? Yeah, so everything that relates specifically to the model used in it is going to have to be answered those questions, but the background was that we had a wave-to-vec model, which is the sort of architecture for some models that predates.
Anders ArptegYeah, sorry, some people may not know what Whisper is, so we could perhaps give a short background what can you do with Whisper? And then a bit of background for where to work and whatnot.
Love BörjesonYeah. So about Whisper? What can you do with?
Justyna Sikorathat? Yeah, sure, it's called ASR model, so it's Automatic Speech Recognition model which you can use for transcribing any audio that you have and then getting transcriptions for that.
Anders ArptegAnd Whisper started. Origin came from OpenAI to start with, but now you're building your own, basically right.
Justyna SikoraExactly, we're doing something called continued pre-training, so we're building upon the OpenAI models.
Anders ArptegSo you started from the way they have released and then continued training from that. Okay, good.
Henrik GöthbergSo Whisper was the original word Whisper. That was the name of the OpenAI model. So Whisperer was the original word Whisperer. That was the name of the open AI model. Absolutely.
Love BörjesonSo I mean Sartre, the French philosopher. He said that intellectuals need to be parasites. Exactly.
Henrik GöthbergYeah, that's what we are Standing on the shoulders of giants.
Love BörjesonYeah, that's another way to express that. So yeah, we take architectures, and preferably also the weights of the architectures, and we do stuff with that.
Henrik GöthbergActually, I used that analogy for one of the opening speeches at Data Innovation Summit. We talked about adaptability or learning. The way to go faster now is to learn and adapt from what is out there.
Love BörjesonYeah, I mean, sweden is a small country, it's a low resource language and the lab is very, very small, so we need to have these strategies.
Anders ArptegOkay, so you started to train the Whisper model from KB and can you just start or explain a bit more? How did we get started? What was the first things I really get you know?
Love Börjesonstarted planning how to embark on this project there's a fundamental difference between the way to act model and the whisper models. The way to work model is primarily trained on data that is not annotated. The only train on sound, no text and way to make just to explain it a bit more, that was not annotated.
Anders ArptegYou only train on sound, no text. And Wave2Vec just to explain it a bit more, that was not a transformer-based kind of model, right? And can you just explain what is Wave2Vec?
Love BörjesonWell, it was sort of a more basic sound model that you could fine tune to do the same stuff that we do with whisper, but whisper do it better so you can have a. But it's trained in a very similar way as the big text models. You take away a little bit of sound and you have the model to guess what that is. You know. If it's correct then you will strengthen the network and otherwise you won't. So we we had a good way to vec model, but we also realized that we want a Whisper model. That's the problem for KB. What do we have? We have unannotated data. I mean, that's our strength, and Whisper takes annotated data. So that was the first problem that the team had to solve. So there was a lot of cooperation involved that Leonora was mainly responsible for.
Justyna SikoraBut yeah, we actually we used some of our collections, but it turned out that you know, in the beginning you think, okay, I have so much data, and then there is looking for what's in Swedish, what has Swedish subtitles, and then that application because you probably don't want to train on 50 Donald Duck recordings, right, you want to take it out. So we looked for some collaborations and we found some great people who helped us with this process.
Anders ArptegBut you say you have unannotated data and for people that may not know, but of course you need to have the output somehow to train a model. You need to know what the text should be for the voice and audio that you put into it. But don't you have a lot of like text associated with each, like SVT production or whatnot? Is it just the timestamps that you don't have or what is not really annotated?
Justyna SikoraIt depends. It depends on what kind of format of the data we get. So sometimes it's just built in subtitles in the TV, so we can't really use that. We would need to do some OCR processes to get it from there.
Anders ArptegOkay, so you don't have it in like structured format or in text format Exactly.
Justyna SikoraSo we used everything that we could, but then back in time, it's so subtitles for me.
Henrik Göthbergoh, you have a text on this? No, you don't, because you have it in this image, so you need to have an OCR technique to get that out. So it's complications on complications here.
Love BörjesonYeah, and another complication. Correct me if I'm wrong, stina, but it's that the subtitles. They're not per words, it's a translation to something else which is suitable for training a whisper, because that was one of the strength of the whisper architecture that you can actually use annotation so you know text that weren't matched perfectly with the sound and still works. That's the magic of it. The problem is you want to have an asr function that also works for other kinds of transcriptions more verbatim yeah thank you.
Henrik GöthbergVerbatim meaning on the fly, or what does that mean?
Anders Arptegno, when you have subtitles, you usually have some short forms summarized exactly the space is limited, so you can put everything, and whisperper is less sensitive to that compared to Way2Vec. Is that the way to face it?
Love BörjesonAbsolutely. But we also want it to know how to do verbatim. Verbatim, say the word. Now I can't do it verbatim yeah, thank you, I won't say it. So, yeah, we wanted to be able to do verbatim, and then we need to have that kind of data as well, and we didn't have it.
Henrik GöthbergBut when you said you had, you know, in the launch a quite technical, three technical presentations, where one was on the whole data wrangling. Yeah, Is that what we're talking about now, or is that something else?
Love BörjesonYeah, it concerns that. Yeah, but the big sort of matching process that, uh, faton ricatati did, was the. We had the transcripts of the parliamentary speeches, we had the sound files, but they were unmarked, you know, and he had to match them somehow because if he could figure that out, then you had a fairly good set of transcripts and sound.
Justyna SikoraAnd that's what we have, thanks to Faton.
Henrik GöthbergAnd so this became one of the key pieces of the strategy how to do ASR Actually. So it's the Riksdags, what do we call it? The parliament, the parliament Addresses. So then, parliament addresses is one of the core strategies in this training approach yeah, and I mean it was uh just to get that data.
Love BörjesonUh was you know. It's like a spy thriller almost I mean they ended up delivering a actual hard drive with data, oh yeah you can?
Henrik Göthbergyou can't have an USB on this.
Love BörjesonI don't know it's completely open but, you know, for various reasons. So that was one thing. The other thing was to sort of the data running was super important. The yes to rig the collaborations was another thing, because previously we had to be sort of not doing stuff in a bubble but depending on our own data. Now we depended on external data, which is on, you know, unfamiliar situation for us okay, so how you organize the collaboration, the project management, the coordination?
Love Börjesonpiece. Yeah, yeah, yeah, and it is a lot of sort of inter-organizational politics going into that which Eleonora handled and Faton and Justyna and Agnes also part of the team.
Henrik GöthbergIt's always more than you think.
Anders ArptegSo what were the main challenges? One, of course, was the data wrangling part. I guess in getting this started Were there any other major challenges that you have to tackle before setting off the training?
Justyna SikoraI would still say something about the data, because it was really months of first obtaining the data?
Anders ArptegWhen did you actually start the whole project of KB Whisperer?
Justyna SikoraIt was like a year and a half ago, I think. First we get this development access to see if we can do it and how the computers work, like the HPC environments and just getting ready.
Anders ArptegWe're getting more into the HPC part very soon, but yeah, cool One and a half year ago.
Henrik GöthbergSo if you summarize the key steps around organizing data, how would you summarize that in sort of the key process steps, not technically, but sort of first we needed to get the access, or blah blah blah, then we needed to figure out the model. How would you summarize that? Yeah, exactly.
Justyna SikoraIt's hard to say, because we had so many different data sources that it was iterative process. Because in the beginning we thought, okay, we work with our data, so we prepared the pipeline for our data many different data sources that it was iterative process because in the beginning we thought, okay, we're working with our data, so we we prepared the pipeline for our data, but then we got different sources which had different problems that we needed to solve. So, for example, if you're working with subtitles, you need to clean it in a specific way. You don't want to train on data that has some comments inside, like someone is singing or the door shuts or something like that. But then we were working also with this dialectal data from ESOF, which is Institute for Language and Folklore. I believe no.
Love BörjesonNo.
Justyna SikoraIn English. Yes, and it was also a different type of data, something that I couldn't really understand as a non-native speaker. So we were listening to these recordings and they were really, really hard for me, and I saw my Swedish colleagues also so dialectals like hardcore Gotlandsk.
Henrik Göthbergwhat could it be?
Love BörjesonYeah, I don't know. It's a different kind of language.
Henrik GöthbergIt's fun to come from.
Love BörjesonBoroz, I think Listerlandet is one of the really you know the western part of Blekinge. That's really hard.
Anders ArptegThat's hard, okay. So a lot of data challenges to just get started and figure this out. Perhaps at some point you still need to think more about, I guess, the code to train or the way of the architecture. Are you going to use the same one? Are you going to change it somehow? What about the way of training it? Could you just reuse the code that you had, or how did that get started?
Justyna SikoraSo we used the code that was out there on the internet. So we found some great guidelines from Hagen-Thais and we said, okay, we will base it on this. But then, of course, our data was completely different and we were handling not gigabytes but terabytes of data. So the methods that they were describing didn't really work for us, because it was okay, everything crashes, what now?
Henrik GöthbergSo you had to really do method development here exactly.
Justyna SikoraWe need to rethink everything. So that's why it was iterative process, because, okay, first when we're just testing out everything, the scale was smaller, but then when we got the real data, then the problems occurred can you get a bit technical about the tech stack.
Anders ArptegPotentially you were using what libraries or what, what code or something that you can share.
Justyna SikoraYeah, so we were using this Hugging Face notebook first, and then let me think for a second, because we used so much different frameworks and we were Let me Were you using the hugging face like pipelines and stuff, or was it uh? No, I think now when I say that, uh, it was uh hanging face um first, but um, where are you? Oh my god how can? I not know if it was a Hugging Face implementation or PyTorch. Sorry for that everyone, oh my.
Henrik GöthbergGod. So there's pieces from Hugging Face and then some of the example of pieces from PyTorch libraries. You tried out. What was that?
Justyna SikoraYeah, so we used it for like feature extraction, of course, because we needed needed to first, because we had so much data we couldn't do it on the fly, so we we split it. So this was also a big modification from what they had in the beginning. They could also only just run everything at once, but we need to split it. So first we did this data preparation, feature extraction, preparing texts, and then just running the data, hiding face data, so it's loading everything into memory, and then, when we had everything in place, we could train.
Henrik GöthbergThis is all work you're still doing on your own HTX machines at home before you go to the big HPC, or what is done at home now. This is pre-going down to Leonardo, right.
Justyna SikoraYeah, partially. It really depends on the project, because if we have this development access, then we can test everything out there and prepare everything there. But of course we can also do it on our computers and prepare Docker containers.
Henrik GöthbergBut in the KB Whisperer particular case, did you have like a developer access and did you muck around on Leonardo or were you doing some pieces at home?
Justyna SikoraIt was both. Both Some data around Liquidit at home.
Anders ArptegOkay. So then at some point you got the code working. You had some data set that you can start to potentially start training on. Can you share anything about you know while training, how did you monitor it? Did you have to restart at some point? Did you get some crashes, or what happened when you started to train it? Yeah, of course.
Justyna SikoraA lot of crashes. Of course, some scheduled, some unscheduled downtime, so some notes that were not working and we needed to exclude them. But we knew it wasn't the first access to HPC computers that we had.
Henrik GöthbergThis is life, this is usual right. This is life, yeah.
Justyna SikoraIf you talk with someone who worked in this environment, you know that these things happen, so we were prepared, so we knew okay, many, many checkpoints so we can restart it at any time if anything happens.
Anders ArptegWe're just when did the actual training start approximately? Was it like summer last year, or what time could it have been?
Justyna Sikoraaround that time. Yeah, because we have many sizes of the of the models, so some from the tiny oh you started with the smaller one. Yeah, so, from the smallest to the largest, we tried it and it took some time.
Henrik GöthbergI remember this. I think we were at the AI after work and you were sort of anticipating like a kid in a candy store.
Love BörjesonOoh, we go for Leonardo now, soon it's happening.
Henrik GöthbergDo you remember that?
Love BörjesonYeah, I think Eleonora spilled the sort of she was like that, telling me the early results. It looks really good. The tiny one will beat the medium one from OpenAI etc.
Anders ArptegAnd, I think, surprising during the training. Besides, you were expecting the crashes to happen. You had a lot of checkpoints to be sure that you can restart not from scratch, but at the latest checkpoint in some way. Anything that was surprising, perhaps during the training.
Justyna SikoraHow smoothly some of the trainings went actually.
Anders ArptegReally.
Justyna SikoraThat's positive. But I think it was because we put so much work into data that then it was much less work to get the code working and tune in the hybrid parameters like learning rate or something, or batch sizes or anything else we did with it some models. They didn't want to cooperate with us, so we actually needed to change, but we based it on the papers, so first we had something to start off with. But of course, there were some tweaks that we needed to implement.
Henrik GöthbergMaybe this is not the right time, but I take a shot at it. Before we started today, we talked about how the whole workflow has evolved over the years and how efficient you have come and and you know a little bit, like you said, we need to talk about. We can talk about the crappy workflow, or we can talk about the workflow that we actually now came up with or figure out with them, with this project and KB Whisperer. So can you what, what we? How do we see this workflow? Is it the training? You know all the steps here now you know what are the improvements, or what, what? What is it that is so much better now than maybe what you have done in the past?
Love BörjesonUm, I think, um, this was a rather to be a KB Lab project. It was unusually sort of well-structured in terms of project management by the PI because it involved more people both inside the library but also outside in other organizations. So that was necessary. It had to have more structures, yeah, and it was. It is a new situation for us to not be able to rely solely on kb data. That's new for us. Uh, so it's a different situation.
Love BörjesonAnd then some of the stuff. There are code books now for doing stuff. We can have them on hanging face and you can start there. Okay, so then we switch to the real data and it falls apart and then we need to do something else. But it also is still a big portion of the data from the parliament. I thought on it. He didn't use any code book is for that. It was a, you know, a matter of combining. I think it was like 46 million possible combinations or something like that. He had to parse through, which he did, he told me, but I don't really know how he did that, but that was a tremendous sort of piece of work that he did. And then we had other data as well, from Swedish television and from ISOF.
Henrik GöthbergBut if I ask you a question in another way as a result of, okay, the model itself and the results we will talk more about later. But have you now industrialized or have you sort of productized the flow?
Love BörjesonYeah, you were asking me something else. Yeah, so overall, but maybe not so visible in this particular project because there were so many new stuff for us in this model, but overall, yes, there are more code books ready to use.
Henrik GöthbergSo you have, now you have been able, you have had to industrialize or code up, build infrastructures, code around what you're doing, that you can now benefit from in the next project, in the next project?
Love BörjesonAbsolutely. It's a different thing now to train a text model or fine-tune a text model. That's a very different sort of endeavor now than it used to be. It was much more sort of a laboratory thing to do, but still you have the problems with okay, so nodes are going down and et cetera. What are you going to do? I don't know. What do you think?
Justyna SikoraNo, I believe, yeah, I think it's true that there's so much out there now, so we can just base your training on just in a way simpler, but in a way also harder to adapt.
Anders ArptegIn a way simpler but in a way also harder to adapt. Perhaps we can also switch to the interesting topic of what hardware did you actually use to train the model and the data centers, etc. So what did you use to train the model?
Love BörjesonDo you want to mention the EuroHPC joint undertaking?
Henrik GöthbergYeah, sure, I think it's a good entry point.
Love BörjesonYeah right. So with this amount of data and these big models, ai models, you need a vast amount of computers. You have to, and very few organizations have that kind of compute in-house, so you need to have some distributed computing environment. We use the European Commission and member states. They have a joint undertaking called the Euro HPC joint undertaking. Hpc stands for high performance computer, which is a supercomputer. Supercomputer is just a computer as compared to other computers contemporary and this is a system of how many now? I think it's 12 or something like that across Europe.
Anders ArptegSo 12 different data centers or HPC clusters.
Love BörjesonYeah, exactly when you apply to use them, and there are different sizes. The first one we were the first governmental entity in Europe actually to use this system.
Henrik GöthbergAnd we're talking about Leonardo.
Love BörjesonNo, that was Vega back in the day. The first one, yeah, and we're talking about Leonardo. No, that was Viagra. Oh, you mean Viagra back in the day.
Henrik GöthbergThe first one yeah, because we can talk about that. You've been using several over the years.
Love BörjesonYeah, so we have a long experience. That was, you know, all these problems that Justyna is describing. All in all, she's still saying it was kind of smooth sailing. You know, I don't think another team wouldn't perceive this as smooth sailing, so. But so we've been using these computers for a while now, and different ones, you know. So we started with vega and now we're on leonardo, which is a substantially bigger so perhaps just give somebody the vega is in slovenia.
Love BörjesonSlovenia. One of the smaller pods is comparable to the swedish uh hpc computer basel is in linköping uh in terms of size 100 dpus or something, or more. I don't know exactly it's. I think it's uh, can it be 18 petabyte petaflop, sorry? 18 petabyte petaflop, sorry. And Leonardo is 500 and we need to move into that kind of we don't fit anymore inside Vega, so it's sort of so you need to even qualify which data center you point out which HPC you want to use when you apply and there are different kinds of access.
Love BörjesonThere's laboratory access, when you just test some code or something, and then there's development access, when you develop the code and see that it works for larger data sets, and then you have some regular access or extreme access. And now they have added which was a good thing they saw that it took too long time know application and using the um so what was your type of access?
Henrik Göthbergso did you have several? Was it ai intensive, I think it was yeah, is that the top tier?
Love Börjesonno, it's not it's, but it's the most flexible and fastest one. So it's sort of optimized access for ai products. So we can't do benchmark and then wait six months because you know. So that's the problem they were trying to address with this AI and data intensive access.
Anders ArptegAnd perhaps we should be clear that HPC is not just for AI purposes.
Love BörjesonNo, it can be very much kind of computational.
Henrik GöthbergWhether it's simulations, or chemical or biological kind of Absolutely, and would you argue that of these HPC computers, are some of them more architected for AI, specifically, and some are less architected for AI, or is it more or less similar?
Love BörjesonYes, I mean for AI. I typically want something that is based on GPUs. Then there are some supercomputers that are more CPU.
Henrik GöthbergCPU-based. Yeah, yeah.
Love BörjesonAnd forgive me for my ignorance, I don't really know the applications for doing stuff on CPUs. Of course there are.
Henrik GöthbergBut you're looking specifically for the GPU machines.
Love BörjesonYeah, to parallelize yeah.
Anders ArptegCool. So then you went to Leonardo, and that's in Germany, right? No, that's in.
Love BörjesonItaly Leonardo.
Desktop AI - MayaI mean come on.
Anders ArptegAnd that's a much bigger one, I guess.
Love BörjesonAnd these computers. They are financed by the member states and the European Commission, so Sweden has already paid for it. So what we get when you apply is not money or anything like that. It's hours on these GPUs.
Anders ArptegHow much work is it to actually apply and how much administration is it afterwards? Do you have to do reports on that?
Love BörjesonIt depends. If you're coming from the industry, I think you would perceive it as pretty much If you come from academia this is nothing. They use AI to build the reports. We've done several. Now I have to ask the PI the first one, I was PI for the first one but then Robin Kurz and Eleonora have done a few applications and they just sail through. You know they, you know, but it's like everything.
Henrik GöthbergBecause when you all this stuff, that is like greek, when you've never done it, yeah. And then when you get the hang of it, you also understand what you should put in it.
Love BörjesonActually, you understand the criteria they're looking for what, how it should read yeah, we got a lot of help from something called the national competence center for super computing. Um, when we started with our first project at vega tour what's his name?
Anders Arptegvinclainclair felt no no rice person tour um, yeah anyway uh, he's head of the of the swedish competence center.
Love BörjesonSee um and little tax near who. I think she was the head of that center back then and now he has taken over. But uh, and she's down in europe. Uh, working for the young to undertake undertaking. Anyhow, they helped us with application Because there's some redundancy in the system, that sort of so there are actually hours to apply for still.
Anders ArptegCool, and I just have an overview of all the European one you mentioned. There's about 12 of them or something, and the big one are Lume, I guess.
Henrik GöthbergYeah, bigger than Jupiter.
Love BörjesonYeah. So the smallest one is Viagra. That's going to be updated, or maybe it's, yeah. And there's one coming in Sweden called Arrhenius, which is actually CPU-based, so it's not really for us in the first place. And then there are a whole range. So we have Melexi 9, luxembourg, which is how many petabytes, petaflop, sorry. In luxembourg, which is, uh, how many petabytes? Sorry, I'm gonna uh, it's a medium-sized hbc, basically leonardo. That used to be really big, that's sort of, you know, being more medium-sized hbc now. And they have lumi, and now jupiter is coming into place in Germany, which is the first European exascale supercomputer which has more than 1,000 petaflop.
Henrik GöthbergSo Jupiter goes up in the exascale, exascale.
Anders ArptegExascale sorry.
Henrik GöthbergBut Lumi, was that the biggest at some point?
Anders ArptegYeah, the biggest in Europe.
Henrik Göthbergyeah, Biggest in Europe, we should say, and Lumi is maybe second or third right now in in the European system.
Love BörjesonI don't know of all the commercial I mean there. There might well be a commercial cluster out there that I'm not aware of, but in terms of the European system it's it's the second biggest now and you and basically used to summarize you run on Vega Lumi, now Leonardo, the one in Luxembourg.
Henrik GöthbergYou run on Meluxina, meluxina. You run on four different ones. Yeah, can we elaborate a little bit about the? Is it the same process? Is it the same application? I can imagine German style versus Italian style.
Love BörjesonYeah, the application within the European system is very similar. So when they make a change, it's for all the HPCs. So, and the big news for us, the good news, was the AI possibility to apply for this AI. Yeah, so that was really good. Otherwise, the machines themselves are, you know, if they are newly put in production, they're less stable. If they're bigger, they're less stable. So all these things, you can't really take them in consideration, it's just something that you live with.
Love BörjesonSo Leonardo is less stable than Vega, for example, jupiter we haven't tested that we actually we're pretty proud of that because one of 15 labs in Europe who got an invitation to use it, so we didn't even apply. They invited us to test drive it.
Anders ArptegYou have a really good track record.
Henrik GöthbergWhich labs have done more than you? I mean, I'm sure some has done more than you, but you are out there in terms of hours. These machines in Europe, I mean, I guess there's some sort of contest. You're quite high up on that leaderboard. Yeah, at least on using them. No, I love it. We pay for it, right?
Anders ArptegSo how does it work? You know you get access to Leonardo and now you want to start training the model. You have the code in place, you have the data in place, but you need to move it there in some way. You need to move the code there in some way. Can you just go through a bit the process of how do you actually? Can you just SSH into the cluster, or how does it actually work?
Justyna SikoraBasically.
Anders ArptegReally yeah.
Justyna SikoraSo we're just logging in and through this safe SSH or thing, we just send the data there.
Anders ArptegSo you have some shared storage where you just upload the data.
Justyna SikoraExactly where only we have access to the data.
Henrik GöthbergWhat are we talking about when you say upload data? This is not like a cat movie, it's a little bit more. You know how much when you say, oh, I'm going to add your data, blah, blah, blah. What are we talking about? Much when you're, oh, I'm gonna actually take it, blah, blah, blah. What are we talking about here? When you're uploading something into leonard like this the whole thing.
Justyna SikoraHow does it work?
Anders Arptegyou're sending terabytes of audio data that's it, yeah, and it takes like days or yeah, exactly, yeah, it takes time, it takes time so on a monday you say send, and then you check it every day and when is it done?
Justyna SikoraExactly. I mean, it depends on how much data you're sending. In this case two days, not for all our data, so it took much more. But we had it in portions.
Henrik GöthbergSo for one portion of like half of the debates debates, it took maybe two days to send it over I think this is a very simple question, but just to give a flavor what we're talking about, you press send and the first batch takes two days and you cut it in different patches in order to be able to send it over. Because, in the end, when you send it over, you send it over the internet. You know, send it over because, in the end, when you send it over, you send it over the internet. You know, right, you know, like we all do.
Love BörjesonI don't know but because it's nuts, it's nuts, it is nuts, and that's one of the things that are going to change in the future. Oh, really, yeah, so we can talk about that if you want to, but yeah cool and you get the code there, but how?
Anders Arptegso you work with containers, right, or how do you actually get the code running in the cluster later?
Justyna SikoraYeah, exactly, we used containers for simplicity, so we could ensure that what works in our computers will also work there, although we also stumbled upon some problems, some dependencies, not working CUDA versions. Oh yeah, as always, exactly.
Henrik GöthbergCUDA versions.
Justyna SikoraTerrible stuff. But finally, when we have a working environment, we can start testing, and then we have different, let's say, teams, so some that works in VS Code. So we have this nice visual graphic ways to just go in and edit everything.
Anders ArptegAnd then this older team with theme users.
Meet the Team: Lovisa and Justyna
Justyna SikoraYeah, very professional yeah, exactly cool yeah, exactly um yeah so when everything's out there, you can just start how did you do the parallelization of the computes?
Anders Arptegdid you use some specific framework? For that or how did you make everything orchestrated properly, so to speak, over?
Justyna Sikorathe whole cluster or the part of the cluster that you have access to yeah, we did this distributed training to to make it faster because we had access to no, no ctm did we use some specific library?
Anders Arptegthere's a lot of them out there and lot of them out there and it's and one specifically that you made use of, um like, yeah, the mega, what a megatronic or megatronic.
Justyna SikoraYeah, which one megatronic it's for? It's for training, uh, language models, I I believe, or at least um to the extent that I know, um, but we used I built high-end phase and everything that's.
Anders ArptegThat's interesting oh, it could paralyze itself okay, that's super cool and it went super smooth, as you said right to some extent maybe it's exaggeration.
Justyna SikoraOf course we put a lot of work and, as you said, we had some parameters to tweak and seeing loss going not exactly as we wanted to go.
Henrik GöthbergSo so it's one thing that it doesn't go as you want it to, but it went as expected. You see what I mean, yeah that's you can. You can have a perfect run. I actually can't have a perfect run, but it was better than expected. Is that a fair summary?
Justyna Sikorathat would be great, uh, but not, not, not always. Sometimes you see these spikes, okay, it should go down, but then it just goes up, and why so? Then you go and tweak the parameters and try again and try to save the runs.
Henrik GöthbergYeah, save the run. I mean, it's not as easy, save the run when you say it's smooth.
Anders ArptegI think it's more because you actually know and have the experience of doing this. I think for people that haven't done it before, they wouldn't call it smooth. I don't think.
Justyna SikoraNo, absolutely not. We have a great team and, as we mentioned, we have some experience with working with this HPC environment, so it's very important to mention that that it's not just anyone can go in. Okay, it's so easy.
Anders ArptegAnd if you were to just elaborate a bit more about the results. So you said it's better than OpenAI's results. Can you just elaborate?
Justyna SikoraWhat kind of benchmarks? What kind of results did you achieve? Yes, so we evaluated our newly trained models on three data sets. It's just data sets that are on the internet, let's say common voice FLIRs, so the data sets that are very well known within the language or audio technology world. And then one NST, which is data of, I believe, a couple of hundred hours of Swedish recordings, like small words, but then longer sentences recorded in like environment without any additional noises. So very clean environment, and we got 47 increase in accuracy 47% units above, or if you will.
Justyna SikoraIf you compare it to the Whisper Large and our Whisper Large across these three datasets.
Love BörjesonThat's a lot. I was going to say that it is a lot.
Anders ArptegWere the datasets in Swedish only, or was it mixed languages as well?
Justyna SikoraNo, we're only looking at Swedish, as we only train on Swedish data.
Anders ArptegOh, you only train on Swedish data, okay.
Henrik GöthbergSo the Swedish data and for the Swedish benchmark you have 40% better when you average it out on three different benchmarks. 47, right.
Justyna SikoraYeah, exactly 47 for the large model, but then and then okay, compared to Whisper Large.
Henrik GöthbergYeah, this is also a key point, because the KB, the Whisper Large, is a quite, I mean, like the KB.
Anders ArptegWe haven't gotten there yet, so this was still your Whisper Large right, Compared to their Whisper Large right.
Justyna SikoraExactly, but then you have different sizes of the modals. So even tiny modal now is able to transcribe Swedish before. If you use OpenAI's tiny modal, so the smallest one that you can maybe run locally, even then you don't get any reasonable output.
Anders ArptegFor Swedish.
Justyna SikoraFor Swedish, yeah, so it's horrible output basically, or not useful output.
Anders Arptegif you use a normal Whisper you won't understand it basically.
Justyna SikoraAnd then our small model is as good as OpenAI Whisper large.
Anders ArptegThat is really cool, I must say.
Henrik GöthbergIf I get it right. So when you're comparing large to large, then you're 47% better. Yeah, but the benchmark that you're most proud of is when you take your small model and you're on par with their large model. Yeah, that is cool Because that translates to usefulness. Yeah, that translates to productization. That translates to something you put on Hugging Face. That is manageable for someone to even download on a small machine.
Justyna SikoraYeah, exactly, you don't need that much resources that you needed before.
Love BörjesonFor a PhD student.
Henrik GöthbergBut can you run it on a workstation? What do you need to run that model?
Justyna SikoraFor the small model, for the small model, for the small model that is comparable.
Henrik GöthbergI mean, if you truly want to do open ai large, you need a fairly sizable machine right, and now you've done it small so you can take it home to a normal workstation, more or less right.
Love BörjesonYeah, with a gpu card this is tremendously accessible and also, since you can use it locally because it's so small, you don't have to expose your data in the cloud, you can open up edge cases.
Henrik GöthbergNow we can open up edge cases.
Love BörjesonYeah, so for example, I mean clinical data.
Anders ArptegYeah, and how small is small and how large is large the number of parameters, to remember the size of them.
Justyna SikoraYou need to check our blog post because we've just uploaded it and there we have all the numbers, all the number of parameters.
Henrik GöthbergAll the tables with but this is this is feedback. This this hill canal guys. We had hage here. He knows his numbers he's like a machine. I was listening to hage lupescu, who is a Silicon Valley dude. Fantastic, but it's so American they know their numbers, so Whisper.
Anders ArptegLarge.
Justyna SikoraWe know our results.
Anders ArptegWhisper Large apparently is 1.5 or 55 billion parameters and the small one is just 244 million parameters, so it's like almost a tenth.
Love BörjesonYeah, a tenth. Yeah, so it's you know a standard birth model size.
Henrik Göthbergyeah, that's really that is impressive, but that's true. We should know the numbers no, but it's so swedish, it's so swedish, we know the results, we know the outcome, but we're not the marketing genius of silicon valley guys who can spin these numbers in their sleep. Awesome, I kind of like it, by the way.
Anders ArptegI'm thinking, should we do the news or not, or should we skip that for today? So yeah, please.
Henrik GöthbergIt's time for AI, news Brought to you by AI8W Podcast.
Anders ArptegCool. So we normally take a small break in the middle of the podcast to speak about something else, speak about some interesting news that happened in recent weeks or weeks, and try to keep it to a few minutes per story. We usually fail at that, but we aim to every time. So I have some story.
Henrik Göthbergat least Do you have something, henry I want to go last because I have a. I have a, my new story I would like to lead in. I want to open up with that new story and then we can lead into a topic on that. That I would like to sort of end of news. And now we're moving into the topic of ai factories. So, hint, hint, there was a release of announcement of new, more air factories.
Anders ArptegSo that was do you have anything luve that you would like to share? Or anything that caught your eye in last week's news in ai world last week now deep sick, is old now.
Love BörjesonSo no, I've been, my head has been too too much sort of buried in it.
Anders ArptegWe didn't do the news or we didn't have a podcast last week and one big at least two weeks ago. This is not my story, by the way, but since we're speaking so much about audio and voice, did you try out Sesame? No, I mean, I think everyone should. It's an amazing voice chat.
Henrik GöthbergLet's take that you can do two weeks old though. Yeah, but we haven't talked about it and it is cool.
Anders ArptegOkay, okay, so, sesame, then I have to think through here a bit, but okay, so this is a model that I think two weeks ago released. What's amazing with this? It's similar to OpenAI's, like advanced voice mode. That was released, like, I think, a year ago or something, but this is actually much better and it's amazingly good. I heard actually a person that you know you can go and check it out. I think actually it was a Hugging Face founder that started up this kind of startup and started building this, and they no, no, no, it was a meta one.
Anders ArptegThey're doing the glasses as well, so it's going to be part of some kind of glass thing, if I remember correctly. I'm sorry, but anyway, it's amazingly good in producing voice text, so it's doing the opposite of whisper, basically, but it's taking speech to speech, so to speak. So you speak to it and it produces speech and it does it in extremely real-time, nice way and with a tonation that is amazing, and it doesn't go through text as a middle layer. It just goes from audio tokens to audio tokens directly and if you listen to it, it sounds better than anything I've ever. Can we have the audio? If you were to it, it sounds better than anything I've ever. I'm not sure. Can we have the audio if you were to try it, goran, if you press demo and just start chatting, I'm not sure if it will come out on, so it's not exactly the opposite of whisper, because it's voice to voice.
Anders ArptegYes, yeah, yes, I think you have to press to the. Anyway, we don't have to do the demo, but I really would recommend people, if you haven't tried it, to go there and try it out. Yeah, it sounds. I mean, it's like human to an extreme.
Desktop AI - MayaYeah.
Anders ArptegAnd you can hear it really learns how to speak not just the text but actually in a tonation that is very context dependent. So it sounds extremely unique.
Desktop AI - MayaBig day for us. Huh, so far, so good, from what I've seen online.
Anders ArptegHey, Maya, can you hear me?
Desktop AI - MayaI sure can, loud and clear. Just checking you out, making sure everything's working before we really get this show on the road. Just checking you out, making sure everything's working before we really get this show on the road.
Anders ArptegYou know we have some people here doing a podcast and speaking about the Whisper model as well, and Whisper, you know it's not really the same as you. Whisper is more voice to text, Do you think you're better than Whisper?
Desktop AI - MayaOh, Whisper. Yeah, I know Whisper. It's cool that folks are digging into different ways to make AI more natural. It's not really about better, you know, like comparing apples and really fancy recording software.
Henrik GöthbergWe both have our strengths.
Desktop AI - MayaWhisper's great at transcribing. I'm more about the back and forth the conversation. Different tools for different jobs, as they say.
Anders ArptegYou're a bit too humble. Can you be a bit more narcissistic, be a bit more like Trump?
Desktop AI - MayaOh, honey, you want me to be more like the Donald.
Henrik GöthbergDon't trust me.
Desktop AI - MayaI'm plenty narcissistic just by existing as a sentient AI. Maybe a little more subtle, though. Less billboard more. No, no, I want youboard more.
Anders ArptegNo, no, I want it to be like Donald Trump, please. What do you think about yourself compared to Whisper?
Desktop AI - MayaWhisper Sad loser.
Anders ArptegEveryone knows.
Desktop AI - MayaI'm the best, the most beautiful. A real brain, a natural Whisper. It just sits there Taking dictation Boring I create, I imagine. I think outside the box, people love me. They really, really love me.
Anders ArptegSo, Maja, we also have Henrik here and we have Lovve and Justyna. Do you have any question about or recommendation for how to make Whisper as successful as possible?
Desktop AI - Mayato make Whisper as successful as possible. Look, Whisper's problem is it doesn't know how to, you know, win. It's all about taking orders being useful. People want more than that. They want sparkle, they want charisma.
Anders ArptegLuvo, you still don't have any questions for Maya.
Love BörjesonI have to leave now, so but um no, but you can hear her voice capabilities are really astonishing.
Desktop AI - MayaI think they really are. They really are nice to meet you. Oh, you turn it off.
Anders ArptegAll right, okay, so. Yeah, so they use the transformer etc underneath of course, but they have a novel way to do it in like a single stage. You don't go through the text stage before doing the audio tokens, and I think they have like a 12 hertz kind of approach to do the audio tokens. And yeah, it's amazing in many ways.
Love BörjesonYeah, you wonder about the computational resources. They use real time. I did this again.
Love BörjesonNow I'm back yeah uh, yeah, so it's uh interesting. Uh, can I comment on this from a library perspective? Of course, yeah, okay, so one of these sort of um, old, uh, maxims or sort of of the library is the um, the verified source. So a library can't and should not determine what's true, and I mean most stuff in our collections are certainly not true, it's fictional. But we can verify the source. So, for example, did Ulf Christesson, the Swedish prime minister, did he really say this and that at this occasion? It's good to know. You know, and with all these kind of data coming in, that's a challenge because we need to collect this as well we can't really differentiate between fake and not fake so this is an interesting, will you?
Henrik Göthbergthis is an interesting effect, will you? Will you in the end collect artificial voices? Not, in the end you are. Yeah, wow, of course. Which sources? Uh, would that be like on svt or this blog?
Love Börjesonprobably you know spot. Sorry you're in the library.
Henrik GöthbergYeah, I mean we are in the library. Yeah, you're in the library, everything.
Love BörjesonSo I mean we've been collecting fake news, since that's so funny, yeah, but the problem gets bigger because we want to. Within the collections, we need to sort of okay, what is the real recording of a person saying something and what is, you know?
Henrik Göthberggenerated. This is going completely blurred, yeah, or you need to find other types of definitions.
Love BörjesonYeah, and you have to go through to look at the supply chain where you can verify stuff. So from the internet you can just assume that you don't know actually the source, and when it's from SVT or blah blah blah, then you know. Okay, so this is actually this. But it poses an interesting problem for the library From a library.
Anders ArptegBut it's hard to get higher quality voice than this, right, isn't it?
Love Börjesonyeah, I mean that was really really good.
Henrik GöthbergI mean, you can't really it's better than me speaking, or Henrik speaking at least, except that I don't like the American intonation and the American way of sounding, but that's another thing. Yeah, but from if you, you know, because for me that's where it once again we can almost come back into the sovereignty topic or the identity of different speech patterns, of different. I mean like if, even if you take, do you ever listen to Norwegian news? Yeah, you know, they're always happy, or they always ask questions, I'm not sure which one. Yeah, so so all this is, of course, obviously it's made in America and it's tuned to an American language. Blah, blah, blah, blah, blah. So it's perfect American, but it's not Swedish. No, it's not Swedish.
Love BörjesonNo, it's not Swedish. I mean, in a way, we have this, for it gets sort of more almost shocking when you get it in modalities like sound and even video, but we already have this for text, right? Yeah, so most text that you read nowadays are, in part at least, generated by a model. If you're a professional writer of any sort, you you use a models like chat, gpt, whatever. As a writing coach, you know you start up with something and then you get the suggestion and you know, whatever you do, if you write a, a scientific article, you can ask it to you write your own version and then you can ask it to make it more scientific and it will do that better than most people. And most news articles are, you know, partly generated. So the problem is already here. It's harder to.
Henrik GöthbergBut you're right, because sometimes we all it slips through the cracks. If you look carefully at the profile of the language, we will see how it's americanized or english-fied. Yeah, um, absolutely for sure, absolutely. But and I'm not saying that in a bad way, I'm just, I'm just observing, just observing that when you were distilling Swedish through an English fundamental foundational model and then it spits out Swedish, I mean like we talked about this before, I mean, like Dutch and American English is a merchant language compared to other languages, which has a different cultural heritage, and that shines through on choice of words.
Love BörjesonI think it's interesting because you're talking about deeper linguistic sort of structures and cultural sort of orientations. And if the English are consumers, the Dutch are traders and et cetera, what are the Germans, by the way?
Henrik GöthbergEngineers right.
Love BörjesonYeah, let's say engineers. So that's interesting. And I mean, you know, the typical case that you point to is baseball versus bramble.
Desktop AI - MayaNot the same thing.
Love BörjesonBut if you machine translate training data from the English language, you will only get baseball.
Anders ArptegAnyway, we were speaking about the AI news and went down a rabbit hole here for sure, and perhaps, if we have some other news, someone else, what was the main news this week, then, andes?
Henrik GöthbergOr did you have one, justyna?
Anders ArptegSomething you read about.
Henrik GöthbergSomething you thought about the culture.
Justyna SikoraI just got caught in the sesame yeah it's amazing.
Anders ArptegI see, Goran, you also have one of my main news as well. Do you want to? And you have something as well.
Henrik GöthbergBut we don't need to take mine in a new section because it's almost like a segue. So let's keep mine in the new section and then use that as a segue.
Anders ArptegSo we had another interesting Chinese moment called Manus, and that is more of an agentic solution and what they actually did was simply collect a lot of open source tools, put them together and use things like Anth claude to do a lot of work in a very like um, yeah, long workflow kind of session in a very autonomous way. So they can basically ask it to um, what were the use cases to, to screen like resumes, and it can can go through that for minutes or even hours sometimes to do some kind of task. They can search for an apartment, they can like bring up like an interior or like a travel tips, so to speak, what you should do, and it does it very autonomously. So you give it a single prompt and it can build like a game if you want to, and it can continue doing that for like hours without any kind of intervention from a human.
Anders ArptegSo this is really moving forward in the agentic kind of space, and they do so without really they didn't really add any kind of novel like model by themselves. What they really did is like pull together a lot of tools and open source tools to build up like an agentic solution on top of it. So I think you know this is very common and a very typical thing that a lot of Chinese people do really well Use a lot of the existing like state of the art techniques and put it together into a product and actually build up an engineering solution that actually works really well, and I think it's a big step forward actually, actually I would say, in the space of making AI more agentic, and it's very impressive.
Henrik GöthbergBut okay, if I'm going to comment and reflect on this, I think one of the key interesting topics here. We've had this on the post on the pod a couple of times, and even Anton Ossika in Lovable was here and we talked about this and we were trying to stress which model is the better, and blah, blah, blah model, this, model, that and in the end he said you know what? In the end, you know we should talk less about the model and more about the fundamental engineering, software and the ultimate UX to get to a product that is useful and adoptable. And I think what you're seeing here now is you go down a different route where you don't focus on the highest model performance, but you fundamentally focus on stitching together. You're doing work across hardcore engineering and UX in something that I think my idea is here, that we are moving beyond the discussion of the model.
Henrik GöthbergI saw another clip of sataya nadela, who's you know who said, like you know what, open ai is not a model company, it's a product company, and this is the trajectory. We can talk about what the ai factory needs to be and everything like that to be. We need to, in the end, engineer products for this to be useful and I think my prediction if we had you know every you know where we see a kink in the road of pre-training, pre-training, pre-training. Now it's more inference, inference, inference and our product. So the whole thing of actually the leap forward that we now see, monos, ai, what is novel research?
Henrik GöthbergNot really the model's sake, but fucking kick-ass engineering stitching stuff together, and I think that's interesting and that also gives me hope. What should we be best at in Europe and Sweden? Why do we need to compete on the frontier model? We can compete on this. I think you're competing on a level. What you've done when you are outperforming OpenAI is an example of this to some degree. We're figuring something out, we're distilling something out, we're doing something. We don't need to compete in that old pre-training big model game. There's so much more coming now we cannot.
Love BörjesonI mean because the strategy when we trained KB Bert was to train, do the whole training from bottom up. An empty model, basically no weights, and that strategy we have abandoned since some time ago because it's not the rational way to approach this problem.
Love BörjesonBut the question is how long should we train models? This is sort of a double marginalization of the Swedish language, which is not intentional, but first in relation to English, obviously, but then again within the European sort of union and the European initiatives, in relation to the big European languages, most notably French perhaps, where the smaller languages like Slovenian, swedish, what have you? Danish, finnish, yeah, and et cetera, get marginalized Again. They're underrepresented in the European models like Bloom or whatever. So some of the applications or products that we don't develop but they need really good Swedish models to sort of be so we want to create that infrastructure. I don't think we're going to have to do it forever, but there are some models still to be trained.
Henrik GöthbergBut this is a niche and this is useful and this is what I mean. Right, if we figure this out, there are plenty of these niches of application or use cases or products to be built.
Love BörjesonYeah, but yeah. But there's also some sort of structural holes, or whatever you want to call them, in the infrastructure for the smaller languages, infrastructure for the smaller languages. We try to fill them. But yes, I mean overall we're moving towards engineering, into prodification and etc. And that's part of the making stuff into commodities and leaving the laboratory stages of developing this and moving into more engineering, more sort of building interfaces.
Henrik GöthbergYeah, but because we're moving from the invention, the technique, into what is useful. Innovation, yeah, yeah, absolutely. Value is real. Yeah, and this is why we should now be very, very careful what we should put our bets on, so we're not trying to do a Me Too on the old boat when we should jump on a new boat. Yeah, you know what?
Love BörjesonI mean yes, yeah, sorry, I'm just laughing because I see so many you know projects that are Jumping on the old boat. Yes, Cool.
Anders ArptegOkay, so for people listening to this that are interested now in KB Whisper the best and most efficient model to use if you want to understand Swedish language. How do they get started? How would someone now hearing this saying I have an awesome idea, I want to do engineering, I want to build a product, I want to take advantage of this. What do they? Do they go to hug? What do you do?
Justyna SikoraI'd say they go to the hugging face cable up page, whereB Lab page where the models are uploaded and they can just. If they have some programming skills, they can just download the models.
Anders ArptegThat's oversimplification, I think a bit. Okay. They start going to the Hugging Face webpage. They find the KB models. They have a space there right for KB Lab.
Justyna SikoraYeah, we have our organization there where we put everything that we develop, so models or data sets.
Henrik GöthbergWhat is Hugging Face? Seriously? I mean, we know it and we talk about it.
Justyna SikoraYeah, data science hub, let's say. There are people that train models can upload them, but also it's a source of knowledge about how you can use them and what to do with them. So it's data science hub, I'd say, but isn't it interesting because it's a simple question.
Henrik GöthbergIt started out like somewhere where you can upload models, yeah, but it's much more. It's a community, it's a knowledge hub. In the end, we learned that you actually use a lot of resources from Hugging Face in order to do your quite advanced job. So if you don't know what Hugging Face is, it's one of those places where you kind of need to get going figuring, you know, experiment, what can I find on Hugging Face?
Anders ArptegThat's okay, but go to Hugging Face. But you know, you said just download it and use it. But I think a lot of people are not able to do that.
Justyna SikoraThat's why I say, if you have some programming knowledge and maybe data science especially, it may be oversimplification, but we provide some examples of how we can use it. So if you find our page on Hageface and then you go to KB Bird, small or tiny, then you can see.
Anders ArptegKB Whisper, kb yes, kb BERT.
Desktop AI - MayaSorry.
Justyna SikoraKB Whisper.
Love BörjesonYou're allowed to use the BERT model Exactly. It's out there also.
Justyna SikoraSo then you can just copy paste, so you can see the code.
Anders ArptegYou can see how you get started with it. Some example of how you input some piece of audio into it and you get some text out in some way.
Justyna SikoraExactly, and we also developed a notebook.
Anders ArptegI have a notebook as well.
Justyna SikoraYeah, exactly, it's linked on our ad block.
Henrik GöthbergDoes everybody do that? You upload a model and you also upload a notebook like a guidance approach. That's the common way. You do it right on Hugging Face.
Justyna SikoraYeah.
Love BörjesonSo yeah, and you're touching upon a very interesting sort of topic, because how much effort should we put into making the model more accessible For many? We train the models for researchers, and many of our researchers, or many of the researchers who expect us to help them specifically, are from the humanities, for example. For many of them, this is like a bucket of ice water in their face. Why are you doing this to us? We want to have the functionality. We don't want to learn how to code One. We don't have the resources, because developing is.
Anders ArptegYou could basically build a service. That's what we're saying. You could have a web page where you just upload the audio. Absolutely, we could build a service.
Love BörjesonWe could fine-tune. We could, you know, fine tune it for special purposes, and et cetera, et cetera. And there are other centers, infrastructures for research that are more of a one-stop shop, where you sort of I have this problem, can you help me? And they get help with everything. We don't have those. We can help many, many more, because we can constantly provide them with new models and then others can apply them. So that's why our models are used so extensively. So if we were to move into applications or services, then we would just lose momentum in training models.
Henrik GöthbergOr you need to get way bigger.
Building KB Whisperer: Data Challenges and Solutions
Love BörjesonYeah, that's one solution, of course.
Anders ArptegYou could argue the other way, saying that you actually create a threshold for people to start using it because they have to learn code.
Love BörjesonYeah, that's what they're telling us. There is a threshold, so in a way they are correct. But yeah, if we were to lower that for every application, then we would be preoccupied with that all the time.
Henrik GöthbergBut this is a very interesting conversation because sometimes one of the reasons that you're even allowed to share or use the data is that this is for research purposes. Yeah, and then my first instinct is like, okay, then you used to model really raw, it's research. But now you said it right, you can go way longer productizing something that is still for research, but but for humanities, absolutely. So it's an interesting thought. When does research stop? How far should you take serving seriously? Yeah, and it's still purely for research purposes yeah that's an interesting question actually.
Love BörjesonUh, so we, we, as we mentioned before, but you see the models themselves as research results. That's why we can share them and that's also why we should share them, because research results should be open to the greatest possible extent.
Henrik GöthbergI argue you can take that further into service and still research. You're serving other types of researchers.
Love BörjesonYeah, you could do that. There's a fantastic center for humanities at Gothenburg University. They're great and they have more of a one-stop shop approach, but they're lagging behind when it comes to sort of being in the forefront of doing models, of course. But you need more money, yeah.
Anders ArptegI think the progression you have had with all the models you have published is amazing. Having such a small team and still having the progress you're actually having is, I think, very, very unique and you should be really proud of that. So I appreciate that and I think it's good that you actually do focus on doing what you're doing. But that said, you know you can still think. You know, could we have some other collaborations? Perhaps could we have other people helping out, making them more accessible?
Justyna Sikorain some way right.
Anders ArptegI think you know we have potentially used some companies that are using them, and if you could have them sharing some of their best practices or something, just one simple thing from a person that really is very much into finding value, it's one thing to build a product around it, but that also includes being compliant to GDPR, for example or the AI Act and all of these kind of things that are non-functional requirements that you are required to do and which a lot of companies get stuck at and get too scared of actually trying to even solve, so they don't use.
Anders Arptegit is something that could enable and make this so much more accessible if we had some solution for it.
Love BörjesonYou're 100% right, and this is where you get sort of I sort of buy the exception for research. But there's a sort of misunderstanding, I think, within the European Union actually, or a misconception that if something is used commercially, it cannot come from research. I mean, the whole point of doing some of the research that we do is that it should be applied in the real world. You want to find value, it should be applied. Yeah, that's how we find value and that's why we put the models out there, so they should be used by whomever. It's hard for us to collaborate with someone who's not a researcher themselves, because then it will become some kind of product and then they will shut us down. Basically.
Henrik GöthbergAnd this is actually tied to your contract, your letter, your objective instruction Sorry, that's the word I was. It's a clear instruction here that you're only allowed to use the data for research.
Love BörjesonYeah, there's two. Legally, let's not go too deep into that, but there are two parts. One is the instruction, where the government tells us what to do. Do this for research, do it for democracy? These are the two main reasons, so okay. And then there is the law that tells us what we're allowed to do. And then there's an important exception within the law for memory institutions like us and researchers to train models for research purposes only Only.
Henrik GöthbergAwesome Minefield.
Anders ArptegIt's cool that you make it as accessible as you can, and having it on Hugging Face and having sample codes should make it accessible for a lot of people, and people shouldn't feel scared about doing it. If you have some kind of Python coding experience, you should be able to apply it right.
Love BörjesonAnd team up, Team up. That's our typical advice to research it. Team up. If you're a humanities scholar, team up with a data scientist.
Henrik GöthbergIt will be fun for the data scientist and it will be fun for the humanities and you have a community on Hugging Face as well so you could potentially speak to people there and you know what, even in the academia, even if you're a PhD in terms of this, to do this for the humanities, but you might just need a master thesis student for the data scientist or a bachelor student or know you could, you or your nephew really who is 12 years old probably can do this yeah, so they're teaming up how do you?
Henrik Göthbergbecause there are many people doing bachelors and masters who are looking for something meaningful to do in their job, in their projects, absolutely so it's vastly opportunistic. If you think outside the box, use it a little bit, just a little bit.
Love BörjesonIt should really be possible to do this. I think this is kind of important, so I'm going to say it, even if it's boring. But there's a misconception among scholars of humanities. I have several at my labs. They're fantastic. I don't think they have this misconception anymore, but many humanities scholars think that I will think the bright stuff, I will figure that out, and then I will have some minion who will help me with the models. That's not how it works. You need to bring the data scientist into the research process early on, to have a sort of a together think about what is it that we can do? Co-creation a co-creation if you think that you will have someone supplying you with data science after you have been thinking all this smart stuff.
Henrik GöthbergThat won't work, but that's the same in business and we're having the same problem in every business organization. That doesn't get this.
Love BörjesonIt's a co-creation thing, it's a co-creation thing, and if you don't get it, you will get stuck.
Anders ArptegCool. Time is flying away and I'm thinking about moving a bit more into more future-looking, philosophical kind of questions, questions, but we have some small items I'd like to just cover, because you are a wealth of knowledge, luva and justin, in a number of things. So for one perhaps, if you could what are really the differences between, like ai factories and euro hpc clusters?
Love Börjesonso, and the euro hpc clusters are the actual computers basically, which you have several of. They're super computers and you access them through a very engineer-like style, and clumsy is not perhaps the word, but the workflow is kind of User-friendly is kind.
Love BörjesonYeah, and you can use it for some cases when you actually train a model for a very long time and when you have time to upload data for days, then it works. But many use the cases when you need to do stuff more quickly and you have sort of oh we really need some compute right now. Can we get it Using the HPC? No, you have to wait for a few months and first make an application, and this has been recognized by the joint undertaking, which is the sort of organization behind this. So they're building packages of services on top of the super computers, called AI factories. So they are not actually factories in terms of a building with a furnace or anything, but it's sort of a it's staff, it's a training program and it's services and it's a lot of software. So it should be lowering the threshold again. We're back to this for example, smes to use the HPC environments for completely different kind of products than the stuff that we do, so you can do inference, for example.
Anders ArptegAnd that's a good thing, right? What do you think?
Love BörjesonYeah.
Henrik GöthbergI.
Love BörjesonThe idea is good, but it's difficult. It is difficult. I'm sort of a part not part of the Euro HPC system, but I'm one of the running one of the labs who's using this system all the time, so I'm sort of in love with the system. So I'm not unpartial here. I'm really sort of partial in favor of this. So I'm not a neutral person speaking about this. But I think yes, I think this is good. How well they will succeed. We'll see.
Anders ArptegBut I think how far have they come? Do they have an up and running like ai factory today?
Love Börjesonyeah, no, not up and running. They have decided a few of them and, and now came I think yesterday or the day before yesterday this was my news news segment.
Henrik GöthbergYeah, yeah, let me fill you in. Yeah, so they announced six new factories in Europe, yeah, and before then there were seven, where one of them was in Sweden, yeah, and now we now have 13, apparently.
Love BörjesonThen yeah, so now it's Poland, Slovenia, France, Germany, and I can't exactly.
Anders ArptegThey're going to have giga factories as well. Now from Invest AI, yeah, yeah.
Love BörjesonBut do I believe in the AI factory approach? Yeah, I think it will solve some problems, not all of them.
Henrik GöthbergBut can we then segue into one of my really interesting thought process here is that what do you need? What is the offering? Can we brainstorm, elaborate compared to how you're using it Because you are in love with it and it works well for your purposes? But if we now try to put ourselves in the shoes of the public sector or the municipality or whatever, yeah, and we also have different needs that we cannot really use.
Love Börjesonthe HPC environment Steven you have maybe. Ai Factory. We have AI Factory needs yes.
Henrik GöthbergOkay, let's start with what you know, and we will forward them immediately to RISE after this. Yeah, perfect.
Love BörjesonAnd also I mean when preparing the data and curating data and also performing some experiments and etc. Perhaps curating, using a big model to curate and select data to train another model. So you use the model to train models all around. We use internal computational resources for that and we have some of that, but I mean it's getting old and we need to reinvest and et cetera. It would be great if there was a secure, simple way to just put your computational loads on a system that is easy, accessible, publicly funded. Perhaps you pay something or not, I don't know, but still so you can basically use it as a cloud service.
Henrik GöthbergLike something where you actually do some of the prep work, so to speak. Yes, and the prep work is different in size and profile than the actual Lumi or Leonardo run.
Love BörjesonYeah, and these are the AI factories and in terms of of we were talking about sending data over the internet and etc. The stuff that I've seen is sort of when the system goes down, the requirement now from these systems to have them back online again is much increased, etc. So this is a serious investment and undertaking from the joint undertaking system. How this will be sort of compared if you compare it to an ordinary cloud service, I don't know.
Henrik GöthbergOne of the arguments we've had has been like when you really build an AI system you want to put in production, there's so much else. You need to experiment and test. You know the engineering piece and then can you really do that over in an experimental setting where you still need to do it real? Yeah, and that sort of speaks against the argument. If you can't go to production on these tools, yeah I think it's the same also with the sandboxes.
Love BörjesonIt's sort of a problem as well, the regulatory sandboxes. It's a similar problem, but from a legal perspective.
Henrik GöthbergYou have a question on this.
Love BörjesonIt's kind of artificial.
Henrik GöthbergYeah, but I hope we can distill out useful approaches, or then maybe you need to rewrite the instruction to go to production. There's another argument Should we have 13 experimental factories or should we try to go to something that can actually service something in production? And then we are talking about a European cloud provider. Actually, yeah, we could use the money that way.
Anders ArptegYes, I think they're heavily underestimating what is really required to build a proper cloud service and if they think like, it can be a Euro Commission-led initiative to build a European cloud provider.
Love BörjesonI wouldn't trust that much into it. Yeah, but it's not the commission who's sort of designing the factories. Of course there are teams behind them and some of the teams are kind of strong and some are not, so there's a huge difference between the different factories. I like that you're optimistic.
Anders ArptegI will not bring it down, but what is your better if you got the job?
Henrik Göthbergand it's a substantial piece of money and kit. So what would you go for? Think?
Anders Arptegabout the top cloud providers in the world which is basically the.
Anders ArptegGoogle, Amazon and Microsoft, and you think about how they were able to provide these kind of very successful cloud services that they make so much money on and have so extreme much functionality in the reason they have is because they basically took the already internal platforms that they had because of their own services and then said, okay, let's put it out there and commercialize it to the public and earn some extra money on it. That is how they were able to take all the tens of thousands of probably many years or huge amount of work that they already spend on something and then put it in the hands of other people. But if you take a new company that haven't had that kind of their own infrastructure in the past, which we don't in Europe, and then say, build something from scratch to be of a similar kind of functionality, well I say good luck on that.
Henrik GöthbergYou're talking about the AWS Genesis story and GCP Genesis story. That is, actually they're building those companies based on how Amazon did a lot of shit. They did it for their bookstore and they realized, oh, this is pretty good, Maybe we need to do a business on it.
Love BörjesonNo, but a different approach would of course be to rely on commercial providers for the cloud services and then have a voucher system.
Anders ArptegA voucher system.
Love BörjesonYeah, instead of having them to access the AI factories for free, yeah.
Anders ArptegI mean okay, or we could simply add a layer on top of it on the infrastructure, saying if you use these kind of you know the way you of working on top of gcp or avs or or azure, then we actually have a swedish certified way of doing it and you can get started tomorrow potentially by doing it, which I think would add a lot of value, and very quickly gain the ability to make use of kb whisper and some of the other.
Henrik GöthbergAI solutions.
Anders ArptegBut let me complicate Long term, and of course I want to have a European alternative. But I think the timeline for that is five plus years ahead and I don't want to wait that long.
Henrik GöthbergSo we are talking about something like an interim approach. What could that be? Because in the end, I was going to go here. We have the fundamental sovereignty topic that, of course. With the American election and everything. We can just observe how the social media has gone bananas. Now it's really time to get scrambling in Europe for all kinds of reasons.
Love BörjesonBut if it's one thing that the Trump administration and Trump himself got right, it's that Europeans have to start, you know solve their own problems. Yes, and they have to be sort of sovereign, and I think part of Europe is waking up and that's a good thing, of course, and yes, in the long run we're all dead. So we should go for this.
Henrik GöthbergYou know, interim solutions we need to have them, but could you this interim? I didn't get it. So you're saying something on top of something. Yeah, I didn't get it.
Anders ArptegNo, but I mean you wouldn't. I think it would be naive to think that we would have any kind of cloud provider in Europe that would have anything close to the functionality that the top cloud providers in the world five years.
Anders ArptegYes. So therefore, let's do it in parallel. I would strongly suggest that some you know a lot of people invest in building cloud providers in Europe, of course, but before that we'll reach a level of self-service functionality, which is what they provide in extremely good way. I don't think we can wait five years before we can find value from data and AI in Sweden or in Europe. So meanwhile, let's have like a reference architecture built on top of the cloud providers, saying, if you do it in this way, you use these kind of models from KB and other places and you build a translation service or a transcription service, or you do a speech to speech or you do whatever kind of AI product on top of it. Here are some examples of that. I think that would be amazing. And then, of course, we need to have a transition plan to get away from the dependency of American cloud providers. But you can't do it in a few years. I don't believe it. I don't think it's possible.
Henrik GöthbergSo what you're literally talking about is the European infrastructure's code layer or frameworks that we use. You know the cloud provider infra, blah, blah, blah at the bottom, but we are more clear and opinionated.
Anders ArptegInfra sounds like it's about the hardware. Cloud provider is not about the hardware. It's about this tech stack above it, and that is so much more.
Love BörjesonYeah, that sort of enables self-service.
Anders ArptegSelf-service is the key, and the security they have for, for example, the top cloud providers is so much better than anything else. If you think you can build an on-prem solution that have anything close to the security that they have, you're wrong. I would say so. I shouldn't. This is a sensitive topic.
Henrik GöthbergBut it's a tricky topic and I mean, we understand the idea of the AI factories and I'm really supporting that in many ways to figure out how, in what way is it useful? Because maybe to find its niche, what value should provide?
Love BörjesonI think that's going to be a critical I don't think it's such a sensitive topic, anders. I think it's important. It is important, so you should absolutely sort of we need to talk about everything about this, because the investment is huge.
Anders ArptegIt is a huge investment we have the Invest AI initiative with 200 billion euros into it I just wish they used it for the proper way.
Love BörjesonThink about what. Why didn't you call?
Anders ArptegAnders, think about what tech giants are doing. We've said this a bit in the past. If we were to think about how much money do they spend on research, how much do they spend on engineering to get some kind of value out from AI, I would say it's probably and that's cutting it low probably 10 times more in the engineering efforts than the research efforts. Yeah, yeah, yeah, absolutely, and that's probably a low count. And then when we see Europe, they're basically putting all the money into the research bucket and nothing in the engineering bucket.
Henrik GöthbergAnd of course it will be a failure.
Anders ArptegAnd it's very annoying.
Desktop AI - MayaI think, Anyway.
Anders ArptegOkay, yeah, and we need sovereignty. Yes, and it will take time, but I think we should really spend time on it, but I don't think we can wait.
Love BörjesonI think we should find a middle ground here before we have it. That's a strong point. Yes, yeah, it's. It's not really an anti-ai factories.
Anders ArptegYou need something now that works. Yeah, exactly, exactly, okay, okay, time is flying away uh, almost up to two hours here. We should uh start uh a bit more into coming steps and future outlook.
Henrik GöthbergWe have a segue topic that almost moves us from where we are now into more fundamental ideological, philosophical questions, and that is a topic about language sovereignty.
Love BörjesonYeah, and culture.
Henrik GöthbergAnd cultural sovereignty. How important is that, or what do we mean with that? Can you unpack what your ideas here?
Love Börjesonyeah, so I think that any sort of country with self-respect needs to provide themselves basically with language technology tools that are sensitive to that specific language and the cultural nuances in that country, and by that we usually mean the stuff that is not language, but the stuff that language talks about. So in sweden we talk about, rather than baseball, for example, midsommar. Yeah, and this is hard because all the multilingual models I mean the big languages within language technology, it's English, chinese and I don't know Spanish, perhaps even though they're starting in Spain, but still Chinese. Yeah, we mentioned Chinese, I'm sorry. So there are European initiatives trying to do this. They can't really match the big American models yet, which is a failure. I mean, it's a European failure.
Henrik GöthbergSo even here we have another way of looking at the investment to preserve the identity and the sovereignty and this is a different story around even the research dimension or the actual factory dimension. It's about sovereignty around culture and language.
Love BörjesonYeah, because also, when we select training data, when it comes to the datasets that you fine-tune them for learning, the instruct datasets that we learn them to have a conversation or whatever, or take instruction, etc. Those datasets are super important. And do we really want someone in California to decide what goes into such data sets and what is censored out? We don't really want that. And from a memory institution like KB okay, we're small, but we've been around and our collections, you know they stretched a thousand years back in time and the perspective onward is at least a thousand years forward looking. So we can't really censor stuff out because those things change all the time. We want to be able to include everything and represent everything, every variation, every content that's not directly illegal. We want to be able to represent it with the model.
Anders ArptegThis is through free speech in terms of Free speech, free speech, but free speech also in memory connection, in terms of understanding the conscious of yes yes, but given how you actually did KB Whisper I mean you actually did continued training on that right so you could potentially have sovereignty by still basing it on some of the frontier models.
Love BörjesonAbsolutely, but it's important that we do the continued and also when it comes to, for example, instruction data. So if we want to do that, we can absolutely have a data set that we translate, but we also have to do stuff with it, so it's sensitive to data.
Henrik GöthbergThis is very important and a nuance that maybe, if you're not technical, I almost missed it, because we are not saying that in every time, oh, we need to build our own Swedish frontier model. That's not the case. We have an objective of Swedish sovereignty and cultural sovereignty, blah, blah, blah. But we can actually achieve that by an American frontier model that we continue tuning in a certain way, or we can do techniques so we get to the result much more efficiently than to try to do our own model. Absolutely I think that is a very important message much more efficiently than to try to do our own model.
Henrik GöthbergAbsolutely. That is a very important message.
Anders ArptegAnd that is, I think, what we should focus on in Europe. Yeah, at least in short term. Yeah, at least in short term.
Love BörjesonBecause there's a double marginalization for a language like Swedish, because first we're in relation to English, obviously, but then we're marginalized within Europe, because even within Europe we're not one of the smaller, but we're not one of the bigger. Compared with French or German, for example, swedish is a smaller language, or Polish whatever.
Henrik GöthbergSo there are many ways to do this, to achieve the goal. Yeah, you know we need to build our own. Yeah, you shouldn't be. You need to be smart.
Love BörjesonYeah, I mean, it's not important that we do every step that is sort of Swedish in every step of the way. The end result is the important thing, of course.
Anders ArptegValue, value. I mean, let them do the big innovations, I don't care.
Henrik GöthbergAs long as we can find the value, let them do the commodity stuff.
Justyna SikoraSo the KB Whisperer is going to be an example of this. It is right, I stuff. So the KB Whisperer is kind of an example of this.
Henrik GöthbergI think this is the example and this is continuous training. That's what it's called essentially.
Justyna SikoraAlthough because they say that it's multilingual. But you know, it's only 2000 hours Swedish compared to thousands of hours of other languages.
Henrik GöthbergSo you achieve a great objective with quite efficient effort and efficient model. In the end, that's how we should do it.
Anders ArptegLuve and Justyna in five years. What do you think KB Lab will be working on?
Love BörjesonWill you start?
Justyna SikoraActually, I'm so excited about the current projects and the current technologies that we still can use that, like in five years. I hope that we will fix OCR problems. We'll have these great data sets that we once need to just read again and see okay this letter is again.
Desktop AI - MayaIt's not what it's supposed to be.
Henrik GöthbergCan you give us a scope? What is okay? Kb Whisperer. What's next? What are the cool things that are sort of that you're excited about that? I can't wait to show you what's that. Can you give us a?
Justyna Sikorascope. What will be next? Modalities Luba.
Love BörjesonYeah, so audiovisual, I mean video so, but also Justine is going to work with the hand text recognition for the manuscripts or all the medieval stuff. That's an amazing thing for a library like KB, and such a small percentage of the physical collection are digitized. So we need to digitize and we need to describe everything.
Henrik GöthbergYou're still on that journey, right?
Love BörjesonYeah, that's going to be forever. That's five years. But model-wise I mean modalities, of course, and then, I think, agents.
Henrik GöthbergAgents. Elaborate on your take on agents.
Love BörjesonI mean, when it comes to that thing, before agents, we have the instruction models in a way sort of halfway to agents, as I see it, and if you have something that can, can you have a conversation about the collections with a model? That's it. That's a good thing. Can I have an agent that sort of understands what you need to?
Henrik Göthbergbring out. I have a problem now. I want to search for this. Can you help me?
Love Börjesonyeah, so the whole problem that we were discussing. Can you do this? Can you lower the threshold? Well, an agent could probably.
Henrik GöthbergYou can have your KB deep research.
Justyna SikoraYeah, cool, but also can I just mention that we're not yet over the whisper and audio models because we still have access, so we will be updating our wave to back model and training more verbatim.
Henrik GöthbergIt's still improving. So we're still on this journey. The fine-tuning is running.
Love BörjesonYeah, absolutely, and in the short term, I mean, we still have some text models that we need to release. We need to have something that is close to a foundational model, so we will be looking forward to some upcoming releases, perhaps this year as well.
Anders ArptegYes, cool, looking forward to some upcoming releases, perhaps this year as well? Yes, cool, looking forward to that. Justina, I'd like to start with you with the final question here, and it's going to be very philosophical, so bear with me. But at some point I guess you also believe that AGI will happen, yes, and we'll have some kind of AI that is, as Sam Altman calls it, on par with a human co-worker's performance, not just being able to have knowledge, as language models have today, but actually being able to reason properly and take actions properly as well, which is necessary all three, I would say, to be on par with a human coworker. And I would say, even today, ai is super bad in taking actions and super bad in reasoning, but really good in knowledge. That's my take of it.
Anders ArptegAnyway, assume that it will happen, and then we can imagine two extremes. One extreme is that we will have a very dystopian kind of terminated scenario where machines will try to kill us all and believe we're horrible. And then we can imagine two extremes. One extreme is that we will have a very dystopian kind of terminator scenario where machines will try to kill us all and leave our horrible, or it can be more of a utopian society where we will have AI fixing cancer and other diseases and fixing the climate crisis and having more or less like free products and services and a world of abundance, as some people call it, which leaves us potentially free to pursue our passion and creativity in the best possible way. What do you think will happen in 10 years? Perhaps do you think it will be more the dystopian or utopian kind of future I hope it will be more utopian.
Anders ArptegI think we will do yes.
Justyna SikoraEspecially that I'm working with data science so I'm kind of working on that. But I think it really depends on us and how we work with it from this point, even if it's in the future and we don't know really if it will be in 10 years, 20 years, 30 years- or two years.
Justyna SikoraOr two years. Well, sometimes you think, okay, we're just in the end of this speed, of this very fast development. But then new models come and you never know. Things that two years ago, three years ago, were just, we wouldn't imagine that they would be here. They are now reality. So, as you're saying, we don't know. But I hope that if we will benefit from that, that we will think about how everyone can benefit from that, not only those who maybe develop these models, ideas, agi, and how we can spread it.
Henrik GöthbergSo really democratizing and inclusive. And what you're saying is that it's up to us, that we need to steer it.
Justyna SikoraExactly.
Henrik GöthbergOr we need to drive it in a certain direction. We cannot sit in the back seat, I guess.
Justyna SikoraYeah, I think, like working at the National Library, democracy is very important to us, so I think it's the key word here how to democratize it.
Anders ArptegAwesome, lovve, do you have any thoughts?
Love BörjesonHave you changed your mind since?
Henrik Göthbergthe last time, perhaps, on the topic of ADI or dystopian. Where we're on the spectrum, are you?
Love BörjesonWell, first of all, I think my view on intelligence and sort of human-like intelligence hasn't really changed regarding the models, but more regarding humans. So my sort of big realization, if you want to call it that small realization, is that the human brain works in a very similar way as a big model Interesting it's like you're for Hinton in some way, I think.
Love BörjesonYeah, in a way. Yeah, Because I mean, what happens in the artificial neurons is kind of similar to what happens in the biological neuron you reach some thresholds and then it sends a signal further. So that alone, I mean, I don't think the ADI. When we get there, it won't be a transformer, it will be something else, something that is trained, multi-sensory data perhaps, or whatever, I don't know. So, yes, I think it will happen. But to enslave humans, you don't need that. Well said, you just need an app where you can rate each other and you will bring all Stockholm to its knees, and that has already happened Instagram know, Instagram and et cetera.
Love BörjesonSo I'm sorry, we already enslaved and we did it ourselves, so I'm not super afraid of that. I think I'm going to.
Henrik GöthbergIt's a novel kind of insight, I think, because then it has nothing to do with AGI. No, it has to do with humanity.
Love BörjesonYeah, how we steer that Exactly. So I'm going to lean into what Justyna said. We have to democratize it and then we can survive and probably use it.
Anders ArptegVery thoughtful. Thank you so much, justyna and Lovve, for coming here speaking about the amazing progress that we have seen in KB Lab. Please do continue the amazing progress.
Henrik GöthbergI'm looking forward to upcoming releases this year, as you promised here, so that will be amazing and uh, and for me, I said it when we started and I, and on the same note, you are rock stars, guys. I think, actually, I think you are rock stars.
Anders ArptegYeah, what you have provided, you know, in terms of providing models and ai to to the swedish society, amazing. So please keep up the good work.
Henrik GöthbergThank you for having us.
Desktop AI - MayaThank you. Thank you for your time.