Succeeding with LLM Projects: Expectations & Key Challenges

LLMOps

Responsible AI

Welcome to AI Watch episode 20!

In this episode, we explore the practical realities of implementing large language models. Data scientists Simon Moe Sørensen and Guilherme Costa, drawing from their experience with real-world client projects, delve into the crucial aspects of bridging the gap between market expectations and what’s truly achievable with LLMs today.

Learn More

LISTEN ON SPOTIFY

Transcript

Simon: Hello and welcome to AI Watch here at 2021.AI. I’m Simon. I’m a data scientist here. Guilherme: And I’m Guilherme. I’m also a data scientist here at 2021.AI. Simon: And today we’re going to talk about some of the difficulties and challenges that we faced developing projects with large language models. Guilherme: I think it’s important to make it clear that we’re addressing specifically large language models as a starting point. And this goes through our experience in delivering actually LLM related functionality to our customers. Large language models in themselves, they’re not a product, right? You always need to think a bit about: Are we going to use it to create a future or a product that’s actually helpful inside our organizations? So I think that’s like, I would say challenge number one is: first we think about what we want to achieve, and then we see whether and how we can use that technology to go there. Simon: I assume you’re going to go into this later, right? Diving into what does it actually mean to implement large language models in an organization? Because it is not just plug and play, it’s not a standalone product where we just put in some data and then we can chat with it and it answers correctly all of the time. Guilherme: Definitely. And they’re of course, very nice and very impressive. But as you were saying, when we go out and deploy this technology, we need to build software. And as we unfortunately know and everyone in the industry knows, building software is, it’s kind of hard, right? So one thing is, let’s say, coming up with a demo. The other thing is deploying LLM functionality as a proper software for an organization with everything that comes along. And by that we mean like how do you take care of authentication, logging, reliability? So there is an engineering side to it which makes it much harder than just saying: “We’re going to use an LLM to produce text.”. So it’s way harder than that. And also our experience, I think here we can safely now already say there’s a bit of an expectation mismatch between what we can deliver and how fast we can deliver and what the market wants, because the market has seen what an LLM can do, but we’re still not yet in this maturity in the industry where we can easily wrap that functionality around a nice piece of software and deliver it tomorrow, let’s say. I think that’s the message, yeah. Simon: Yeah, I think that’s a really important point because, I mean, just because we have a new AI tool doesn’t mean that all of the previous underlying problems that a lot of organizations have, have suddenly been solved. It doesn’t mean that everyone suddenly moved to the cloud and everyone has a full team of DevOps engineers with lots of experience, a very mature infrastructure and setup. It actually means that we’re just coming into this with a new AI tool and we have opened up more interesting use cases, I would say, and especially use cases within like human to, what can you say, chatbot, a service interaction that adds a lot of value. For example, the project that I’m working on with Rigshospitalet is very similar to this, where there’s a lot of phone calls to the hospital about very mundane and very simple and straightforward questions that have actually already been answered in a written document. And this use case is perfect for LLMs, because what we do is that we deploy using the Grace platform, we deploy the solution and then people can interact with the application through our interface and get this information from already written documents. And, I think another point about these use cases, maybe changing the subject a little bit, is about where you put the expert in the use case. So the Rigshospitalet project, you have the expert on the application side, meaning that the large language model is the one responsible for the knowledge. But you and I, for example, when we sit and we both use AI and large language models every day for our programming, when we do that, we are the expert. And that changes a lot of the dynamic of the use case. When you give a very powerful tool to someone who knows a lot about the subject, the tool is amazing. Right? Guilherme: Got it. Yes, I agree. Yeah. Simon: You don’t pay a guy to like hammer with a hammer. You pay him to hammer in the right place. Guilherme: Many times, not always, but when you do software, even in-house software, where you’re only thinking about delivering functionality for your organization, not exposing it to the outside world, you kind of control the environment in many scenarios. But in here, most of the use cases, in most of the organizations leveraging LLMs, they were not in a position to do this. What do I mean by that? I think, you know, we rely on external APIs, basically. Typically, let’s say the OpenAI’s APIs or Anthropic or whatever, but, so there’s an inherent limitation to how you use this technology is that you don’t control it and you don’t control it in two different ways. So first one is: of course you don’t control the APIs, they’re not your APIs, they can change. And second, very important, I think also: so LLMs are by nature non deterministic, and there’s kind of a bit of a challenge to that. You can put the same inputs and you can get different outputs. And that’s normally not how you think about software or AI systems, even sometimes. And also something I think we should talk about, and we now have big experience on this: how do you properly test LLM functionality? Right? Simon: Yes, indeed. But the thing about large language models is that we’re trying to put a triangle in a circular box, right? Large language models are built for hallucination. I know it sounds weird, but hallucination has come up because people suddenly see it as a source of like ground truth and everything. But the transformer technology is, in its like building blocks, it’s been made to hallucinate. And the reason why is because we can have Eminem write a Shakespeare song, right? That is a hallucination. And that is what made them extremely attractive in the beginning. But now people are trying to move the scope of the LLMs into this very square like field where you can say: “Two plus two equals four.”. But for an LLM, this is not the case. Just because we changed the use case of LLMs doesn’t mean that the technology follows immediately. Guilherme: Yeah, obviously what they were trying on was to produce, let’s call it “natural language text”. And natural language text is, by the very nature of it, non deterministic, right? And I would say we can discuss testing in this scenario, because it’s very interesting the way you go about in testing these solutions. We’ve had experience with this. So let’s say you have an LLM to reply to some predefined customer inquiries about our organization. Where is it located? What do you do? How can I get in touch with whoever? And maybe you have a predefined list of what are the outputs that the LLM should produce in each of these cases, you cannot really count that you have this exact same output. And so the question then becomes: how do we test whether the LLM is actually giving us the right answer? And I think it’s very interesting that we actually get this push from our clients, from our customers, for them, and they come back to us and say: “Okay, you’ve built this for us, but is it good? What’s the quality? How confident can I be that this solution is giving us the right answer? How do you test this on a regular basis?”. And we need to be honest, we’ve done lots of things, but I don’t think there’s like a predefined standard answer in the market already for this. Simon: No. Guilherme: When you put two different people in the room and you ask them: “Is this sentence conveying the same meaning as this one?”, lots of time, there’s no agreement, right? So we cannot really have this realistic expectation that the LLM will solve this problem for us. Simon: Any software, or basically almost any product that you develop in a company, you test it, you know, you evaluate, is this good enough? If it’s software, you make unit tests. If it’s, I don’t know, some kind of product, you go out into the market, select a small group of people, and you see how they react to it, or ad campaigns or whatever, you know, all this stuff. On LinkedIn, mostly LinkedIn, there’s a lot of people saying, oh, you can do this with LLMs, and, oh, now you have graph LLMs and, oh, and agents look into this. I mean, yeah, cool, very interesting concepts and definitely we tried them out and they work pretty nicely. But there’s a very big difference between sitting on your local PC building something cool for your own little personal thing where you have a much higher fault tolerance, than going into production in a company setting. Guilherme: So I guess what we should say on a generic basis, is that it’s hard, right? It’s hard to reason beforehand whether something will be very successful, even if we have done this kind of nice, cool demo, as you would say about one particular functionality. Simon: And this gap is what we’re trying to bridge by building, for example, the red teaming framework and ensuring that we can evaluate our LLMs, and you say: “Going from 90 to 99%.”. How do you know when you’re at 90? How do you know when you’re at 99? These are some of the most important things that every LinkedIn post I’ve ever seen, has forgotten. None of them has ever mentioned what it means to actually build it and move things into production so people can use it. If you were, if you were to start with a new project within your company, and you have already done a little bit of research to say: “Okay, this is about natural language processing.”, either in a chatbot format or making some content for the marketing team, or it could be anything. You identified the use case. It’s good. Then what do you do now? Guilherme: You need to think about functionalities and features that add value to organization, number one. So it’s of course, important for people to start getting familiar with this technology because it’s here, it will be here in the future, and it will have a huge impact, I understand that. For every client that we have, it’s hard not to imagine that they will not have a specific use case where they can use this technology with a purpose, not just try it out. That’s my first thought. My second thought will be the way you can reason about this will be, if there’s a situation where having a machine that will kind of reason like a human, or produce text similar to a human, or interpret text similar to what a human can do, then for sure there’s a potential to discuss a use case related to this. That’s my second idea. And my third idea is you need to be talking to experts outside so people who have experience on it and kind of can guide you about what you should be doing. From our experience, given that you’re going to have like 20 different tooling calls in your LLM agent, it’s not going to work. You need to split it into two LLMs. Simon: We found that through many of our customer projects, going through an iterative or “agile approach”, as some people also like to call it, is crucial for the success of a project. Being able to go back and forth not only just in development and where the LLM solution is moving, but also discussing the use case together like when you begin the project, talk with the experts who have knowledge and some intuition, not all because it’s a new technology, but has some intuition about what is possible, what is not possible. If you do that in the beginning and get some expert feedback, and you find a good middle ground where you provide a lot of value for the business and you use the right AI tool based on the expert’s opinion, then you can have projects that will be successful in the end because you will not dive into this: “Oh yeah, now we added a new use case and then all of a sudden the solution doesn’t fit.”, and the customer thinks: “Oh, but it’s just one use case.”. But the technical side, they think: “Well, this is a use case that’s completely different from an architectural point of view, from what it’s been developed.”. Guilherme: It may look the same because we’re using an LLM. But from a technical point of view, it may be completely different. You need people with expertise, but then there’s everything on top of it about making it proper software, and for that you can try to get, and you should get the right tooling for it. So, for example, we have our own, of course. And what I mean by it is, if you’re like, let’s say deploying an internal chatbot for your company. Simon: Yeah, like how do you deploy it? Where do you deploy it? Who builds the UI? Guilherme: The database? The logging? Go into the market and buy solutions for it. It’s a growing market and there are, like, some technology options that you can consider. Simon: So thank you for watching and thank you for listening. We hope you got something useful out of it. Thank you so much. Guilherme: Thank you very much.

Succeeding with LLM Projects: Expectations & Key Challenges

Transcript

More news

Harnessing AI Risk Management for Strategic Growth in Financial Services

Responsible AI is Everyday Practice at Plesner with the GRACE Platform from 2021.AI

North Denmark Region Advances Responsible AI with Danish AI platform 2021.AI and Dell AI Infrastructure

Succeeding with LLM Projects: Expectations & Key Challenges

Transcript

More news

Harnessing AI Risk Management for Strategic Growth in Financial Services

Responsible AI is Everyday Practice at Plesner with the GRACE Platform from 2021.AI

North Denmark Region Advances Responsible AI with Danish AI platform 2021.AI and Dell AI Infrastructure

Get the latest news