How MLOps and LLMOps Drive Consistent Results (Kristen Kehrer) Artwork

What’s the BUZZ? — AI in Business

“What’s the BUZZ?” is a live format where leaders in the field of artificial intelligence, generative AI, agentic AI, and automation share their insights and experiences on how they have successfully turned technology hype into business outcomes.

Each episode features a different guest who shares their journey in implementing AI and automation in business. From overcoming challenges to seeing real results, our guests provide valuable insights and practical advice for those looking to leverage the power of AI, generative AI, agentic AI, and process automation.

Since 2021, AI leaders have shared their perspectives on AI strategy, leadership, culture, product mindset, collaboration, ethics, sustainability, technology, privacy, and security.

Whether you're just starting out or looking to take your efforts to the next level, “What’s the BUZZ?” is the perfect resource for staying up-to-date on the latest trends and best practices in the world of AI and automation in business.

**********
“What’s the BUZZ?” is hosted and produced by Andreas Welsch, top 10 AI advisor, thought leader, speaker, and author of the “AI Leadership Handbook”. He is the Founder & Chief AI Strategist at Intelligence Briefing, a boutique AI advisory firm.

All Episodes

What’s the BUZZ? — AI in Business

How MLOps and LLMOps Drive Consistent Results (Kristen Kehrer)

May 15, 2026 • Andreas Welsch • Season 5 • Episode 9

0:00 | 26:32

AI is moving quickly from experimentation to deployment, but what does it actually take to operationalize AI successfully?

In this episode of “What’s the BUZZ?”, host Andreas Welsch speaks with Kristen Kehrer about the operational realities behind deploying AI, LLMs, and agentic systems in enterprise environments.

Three key insights stand out:

Production AI requires more than a successful demo
Many organizations underestimate the complexity of moving AI systems into production. Reliable deployment requires monitoring, governance, iteration, and collaboration across business and technical teams.
Clean data and knowledge bases remain essential
Even advanced AI systems depend on high-quality documentation and structured information. Weak knowledge bases often lead to unreliable outputs and poor user experiences.
LLMOps introduces a new operational layer
Managing prompts, retrieval pipelines, evaluations, and interaction quality has become critical as organizations scale customer-facing AI systems and AI agents.

A practical reminder: successful AI adoption is not about deploying the newest model first. It is about building reliable systems, strong operational processes, and the right collaboration between people and technology.

Listen to the full episode for a grounded perspective on what it takes to operationalize AI at scale.

Questions or suggestions? Send me a Text Message.

Support the show

***********
Disclaimer: Views are the participants’ own and do not represent those of any participant’s past, present, or future employers. Participation in this event is independent of any potential business relationship (past, present, or future) between the participants or between their employers.

Level up your AI Leadership game with the AI Leadership Handbook (https://www.aileadershiphandbook.com) and shape the next generation of AI-ready teams with The HUMAN Agentic AI Edge (https://www.humanagenticaiedge.com).

More details:
https://www.intelligence-briefing.com
All episodes:
https://www.intelligence-briefing.com/podcast
Get a weekly thought-provoking post in your inbox:
https://www.intelligence-briefing.com/newsletter

Andreas Welsch 0:44

Everyone thinks the hard part is getting AI into your business, but what's even harder is keeping it there. So today we'll talk about how MLOps and LLMOps help you drive better results. So join me for this one. All right. Hey, welcome back to another episode for What's the BUZZ?, where leaders share how they have turned hype into outcome. Today we'll talk about how ML ops and LLM ops help you drive better quality and better results from your AI systems. And who better to talk about it than someone who's actively involved in that and who's been working on these topics for a long time. Kristen Kehrer. Kristen, thank you so much for joining.

Kristen Kehrer 1:37

Hi. Thank you so much for having me.

Andreas Welsch 1:40

Wonderful. Why don't you tell our audience a little bit about yourself, who you are and what you do.

Kristen Kehrer 1:44

Hi. Sure. So I'm Kristen Kehrer. I've been a data scientist since 2010. I started doing econometric time series and analysis and forecasting in the utility industry. I moved into healthcare doing things like helping. Motivate patients to get their colorectal cancer screenings. And then I spent, six or seven years in e-commerce doing your traditional ML type e-commerce use cases before going off on my own where I had the opportunity to. Work for a company who was specifically in ML ops, I've helped build an LLM powered application for a healthcare consultancy. I was the head of data science for MoneyGram. I, and I love building other types of things, like the computer vision example I built to detect school buses going by my house. So I absolutely live and breathe this space and couldn't imagine doing anything besides data.

Andreas Welsch 2:46

we have a lot of good topics to to talk about today and I just came back from the Gartner Data Analytics Summit in Orlando a couple days ago, and. I saw the conversation there too was still very much about data, obviously and analytics as you'd expect. Where I feel a lot of times when we talk about AI, when we talk about agents these days, it's usually more at the business level or how do we use agents? What do we use'em for? So for me it was great to, to see that grounded in that reality of, hey, we actually still need data. It doesn't go away. But, having been to other conferences and vendor conferences, I, I know that there are lots of polished demos. They look very shiny, they quickly put together, but really the crunching point is getting AI into production safely. And that's a whole different thing. What are you seeing, what do leaders need to keep in mind as they, they embark on this journey of bringing AI, bringing agents, bringing even machine learning and predictive into their organization.

Kristen Kehrer 3:40

Yeah. So first of all, I had such fomo, seeing your photos from Gartner that looked amazing. I think what we're seeing, the big issue is that now I built my first LLM demo and it took me a couple hours, right? But when we try and take something like that to production, there is iteration in a way that is just more intense. Than we see with traditional ml, the amount of collaboration required is a lot higher. And I think that leaders aren't necessarily prepared for the amount of collaboration that is involved. They, hear that we're gonna do an AI project and they assume that it's gonna look like traditional ml where you scope it, you give it to a data scientist, they're able to grab the data. The metrics that they're working with are, trying to reduce false positives, trying to improve the, a UC or some loss function. And now the outputs. We have to be concerned with brand tone. We have to be concerned with accuracy. All of a sudden the amount of data that we're working with is different. It is not, data from a database. It might be our internal docs. It's some knowledge repository that might be text unstructured data. But then there's also the prompt templates. The retrieved context that we're getting before we send it to an LLM, the few shot examples that we're including, there's all of these pieces and all of them when they change, can change the model behavior. And so there's just so much more to manage and make sure that it's versioned so that what we're putting into production is actually auditable and that we have, some sense of confidence of what it is that we're putting out there that is becoming much more likely customer facing too than it was before.

Andreas Welsch 5:43

See I see all these great demos on LinkedIn and people say, Hey look, this is what I'm building. This is what I've white coded in an afternoon, but. I'm pretty sure it looks a little different if you are trying to do something like that or bring this kind of AI into an organization, into an organization that's exposed to financial, legal, reputational risk, where it's not just, oh, let me fix this agent if it doesn't do what it's supposed to. Let me tweak the prompt right here. If, if this thing is in production, if it's live, if you're even in a regulated industry, financial services or life science, healthcare, that's a whole different ballgame, right?

Kristen Kehrer 6:17

Oh yeah. No, it absolutely is. And because too, with the demo, or even when I was building something that the, goal was to get it live at some point, you can get some fantastic output and I've, I've created videos myself where I get fantastic output and I, I write it up or I put it up on YouTube and I'm showing, and I am helping others. I'm helping others. Understand how to build a rag system and you know what to think about when you're, creating embeddings when you're retrieving. So those are useful, but at the same time, if I'm creating something for real there's a ton of edge cases and, some of the things are really simple like. If I am going to be sending a communication to a cancer patient, I'm getting output and I know I'm gonna be, it's, a letter that may go off to a cancer patient. I know that I want, to not put a ton of exclamation points in that. As a data scientist, I can make some. Judgment around how that communication text should feel. But at the end of the day, there's a lot there that I don't have the domain expertise. You say you want that going out at a third grade reading level. What is a third grade reading level? And if you ask the system for a third grade reading level. There's multiple ways of measuring that. The LLM isn't going to know exactly what that is. So now what are we talking about? Is that the number of syllables in a word? Is it you know that we're gonna have less than 10 words in a sentence and we need to think more deeply? And this requires the subject matter expertise of the business about, around exactly how they want to communicate, which are often things that. They've learned to do through communicating with each other, but they haven't thought about how to actually instruct a computer to do that. And these are just like a lower level of instruction that we all need to work out together. And so even though I may get some great output when I try it for a different use case, it might not work for that. And so how do we. Version, all of these prompts so that when it works for some but not others, and I've gone off and I've tinkered some more and now the other ones aren't working, like we need to make sure that we're just thinking about this in like a methodical way. And the versioning is what allows us to get there and be able to audit it and see what truly gave us the best responses. For something that is like very iterative in a way that, that we haven't seen before.

Andreas Welsch 9:05

So it sounds like that's the part where you mentioned earlier collaboration is key. Again, coming back to business and technology teams need to work together to figure out what is it that we actually need to do in and how can we. Put the business requirements in technical terms and the technical capabilities to support the business.

Kristen Kehrer 9:21

Yes.

Andreas Welsch 9:22

Now we've obviously talked already quite a bit about ML ops, about LLM ops, but what are the key differences that you see between the two? Why do you have two terms even?

Kristen Kehrer 9:32

Okay. So LLM Ops is an extension of. Mops. So machine learning operations is all of the operations that give you the ability to reproduce yourself, reproduce your results, audit your results. And so this is tracking your data your data versions and the lineage so that you can see how it was changed over time and tagging that to a particular model version, your experiment tracking so that you can actually back in the day. In 2011 I'd build these bespoke models where I am, fitting a polynomial, distributed lag of price or whatever. And I'm deciding, hey, if I'm gonna include residential housing permits in my model, how many months should I lag that? And I'm fitting each variable and I'm determining how to dummy out certain, pieces of information to get a more accurate forecast. And over time we, aren't as often fitting as routinely. But you can imagine back in that case, I might have fit that model 30 times before I found the ideal model. And maybe the best model was actually the 20th time that I tried. And previously, I was writing that in a paper notebook, and the hope was that I could go back and say, okay, this one had the best loss function. I'm gonna use that. But nowadays, w. We don't have to do that because I add a couple lines of code at the top of my notebook and now I have all these model versions and at the end I can go and I can sort them, and I can sort by, what had the best loss function, what had the best false positive rate and I can go through and I can choose the, the model that was actually the best. It's also looking at CICD so that everyone is deploying the same way so that we don't have these error prone processes where people are coming up with, the tests that they're doing to get into production and they're not standardized. And edge cases are getting through and it's monitoring and it's feedback loops. And even a little bit of explainability and governance so that our stakeholders trust our results so that when, regulators come along, everything's auditable, but So that's your ML ops. And then for LLM ops, there's. An additional layer added where now we're not just versioning the data that the model is built on.'cause often we're not even training the model, right? We might be using a foundation model, but we still have to track our prompt templates, the retrieved context that we get, because. If we have a problem with our output, maybe it's that our retriever isn't retrieving the right information. Maybe it's that after we've retrieved that information and we send it up to the model, we're not getting the right results back. And being able to diagnose and do root cause analysis on exactly where things are breaking down. We also have few shot examples in there. We might have fine tuning data sets. We have user interaction logs and we want to be able to. Track all of that. And that's where, we have our ML ops and then LLM ops is the extension that takes care of all of the additional pieces that weren't part of traditional ml, but now we have as part of a RAG system.

Andreas Welsch 13:25

So I lost count of all. All the technical terms. Did he mention a lot of

Kristen Kehrer 13:30

pieces? It's a lot of pieces.

Andreas Welsch 13:31

False positives. A UC area under coverage,

Kristen Kehrer 13:34

area under

Andreas Welsch 13:34

the curve. There are a few others there. It, that sounds awfully complex. I thought it was easy. I thought that agents are taking over the world, putting all of us out of a job. And it's happening tomorrow. This sounds like real work and there's some plumbing that you need to get right.

Kristen Kehrer 13:49

Yeah. That's always been my thing when people said that AI is coming for jobs. And I always wondered if it was people who understood the complexity that said that, or, if it was somebody else, maybe it's like leaders who aren't actually in the weeds, but somebody does need to know what's going on and be able to to work with these systems and know them intimately.

Andreas Welsch 14:11

So if you're listening or watching, take note of that. It's gold right there, right? We obviously need to understand what's happening underneath the surface, at least to some degree. And I think then many of these. Claims. Many of these systems don't seem as magical and mystical anymore. When we, at the end of the day bring it down to crunching numbers and statistics and probabilities and so on. Now, obviously LLMs have been around for a while. Companies are experimenting with that. I even hear many of them are doing that with agents. They're just not talking about it so much publicly because they're afraid of backlash. But be that as it may how do you see LLMs changing? The requirements for managing that, that lifecycle. When you bring in LLM based application in into production what do you need to be aware of there? You already mentioned a few things as we were starting out, but what are some of the top things to keep in mind as you bring LM based agents or applications into your environment?

Kristen Kehrer 15:06

Yeah, I think as I already said, it's the collaboration piece. But to go into it more like if you're thinking of doing an AI project, you need to upfront understand that the demo is only going to take a little bit, and that there is going to be. Time and iteration between the demo and actually getting to production, and you need to budget for that. You also need to set up your team correctly in the beginning and bring in the SMEs, bring in the product stakeholders, and get everybody working together and have a clear understanding of. What you want the output to look like. Like we, it's back to, when we used to talk,'cause I feel like we lost sight of that for a while and things became about cutting head count or. Other stuff, but we need to be thinking about the customer. We need to think about exactly what we want that output to look like and we need to be clear about that from the beginning because otherwise there's going to be work upstream to course correct. When you've got a system that isn't working well because I have seen company. Reduce their headcount. And launch an LLM that works poorly. And then the customer reps that survived are now inundated with calls. The customers are frustrated because monitoring wasn't in place it now. And actually they were mon, they were evaluating. Monitoring platforms at the time. But there was nothing in place, right? So it became this, actually, I went through myself and looked at 300 chats and I used LLM as judge and compared it to what I was doing. And LLM as judge did work pretty well. But it's you've got this system online, you've gotta take it down. You've gotta actually do some manual. Analysis to understand like exactly how bad is it, how and how do we fix it? Where is, where's the problem here? And a lot of times what I've seen is it's that something gets put into production without. A knowledge base that's big enough because people get really excited and instead of really building out the data the way they want, the data seems to fall through the cracks, right? We need to be able to cover for a hundred use cases. We need documents that are gonna cover for this. And they're like, we'll do 50. And we'll let our old solution handle the other, 50 because we wanna get this out there and it all just seems to not be optimal.

Andreas Welsch 17:40

Lemme ask you a question. So you mentioned LLMS Judge, can you talk a little bit more about what that is and why we need it?

Kristen Kehrer 17:46

Yeah, so LLM is judge, at least in the way that I was using it, was we have all of these customer chats saved, right? And so I could go through and I could say, the LLM said, how are you today? What is your question? My question is this. And then it says, actually, we're not set up to handle that. We're gonna do this or whatever. And as the conversation goes through, you can see somebody getting frustrated, but instead of a human going through, you can just feed that to an LLM and say, where does it look like customers are getting frustrated? What type of topics are customers getting frustrated on? And it would give me back, four bullet points of it seems like these topics were handled well. For these topics it looked like people were using very short sentences and, other things that look, and it would list them, but I don't remember them off the top of my head. These other things that are a sign that people are frustrated or people used these, four letter words and and so you're able to get output from an LLM judging The output of your original LLM and it is pretty good. Yeah.

Andreas Welsch 18:59

So why do we need another LLM to judge what the first one did are, wouldn't you expect that the first one is good enough to know what you wanted to do and do it well?

Kristen Kehrer 19:09

No. So with traditional ml we're able to say, okay, this person was. Denied a loan. And we also understand if we're using explainability, what type of factors are going to make it so that this person gets denied for a loan and their metrics and so their numbers. But now when you have text. You need, to under, and there are metrics for LLMs, but it does not get at the brand tone. It does not get at those nuances of, actually this person asked a question that the LLM probably should have understood but didn't. Or, this was like. Not the proper context for what this LLM is supposed to be doing. And so an LLM can look at that and it's going to be a lot faster than a human having to look at it. But the same way, in the same sense, we don't have like a these like number metrics that we are allowed to use in ML that can say this decision was right or this decision was wrong. And so it is useful for us there.

Andreas Welsch 20:20

I see. Okay. Makes a lot of sense. So you have an evaluation basically to to say how well did it create this? How does it align? How well does it align to brand voice tone and and other requirements? Alright obviously companies have data. They've got lots of data. I think it was 80 to 90% of data is actually unstructured lives in documents. But what do you do as a as a leader if you're being asked to do some AI. And you actually find out your data isn't as clean, isn't as complete, isn't as fresh as you need it to be. Can you still work with an LLM? Can you still build a rag system? What's your advice to leaders who are facing that situation?

Kristen Kehrer 20:59

Yeah, and I think it is, it depends on the use case and what you're trying to do, right? But in the higher stakes. Scenarios where the output is needs to maybe create something that's going to go out to customers. You need to make sure that your data is clean because anything, and like I've seen this when I was building an LLM application, the company had been, creating these types of communications. For so long they had so many cases, but not all of them were general enough to go into a RAG system so that it would work for creating these communications at scale for other companies or whatever. And what they chose to do was actually have subject matter experts go through. Each of the example cases and write them in a way that they knew that they'd feel confident about feeding it to the system. Because if you're going to put multiple use cases, multiple documents for the same type of thing in your knowledge base, and if they're not exactly what you want, if they have additional information that you don't want coming up, you could either put in your prompt template that you don't want that extra thing, but now you're con, you're. Your prompt template's gonna get huge. And so you really do need clean data. And it is, a different consideration than we've had in the past where you're getting your data from a database, now it's docs. And those docs do need to be updated. And I think that is part of the bottleneck. Why, we see these systems and I've seen systems go live with 18 documents in them as it's like their live POC but then it takes forever. It's the cleaning of the data seems to be this afterthought where again, like leaders need to structure the team so that the subject matter experts and the product people who are working with the data scientists, there is time carved out in their schedule to make sure that the data, that's going to go into the knowledge base. Actually gets done. And that, that is a priority because I do see it as something that has been getting left off. And these systems, the data scientists are off optimizing on this small knowledge base. Later on, all this, more data's gonna come. And how is that actually going to impact the system? How is the system going to work on that? Or now we over fit for, this smaller subset of documents, but at the same time, maybe you can feed multiple documents to an LLM and have it come out with something, prompt it and get something that's like really close to exactly what it is that you want to be putting in your knowledge base. But the data does need to be cleaned and it's no longer really the data scientist's job, whereas largely it, we spent so much time data cleaning before, and so now you still need to spend a ton of time data cleaning, but it's not in our domain the same way it was before.

Andreas Welsch 24:16

That makes sense. Now before we get to the end of the show, I was wondering if you can summarize the key three takeaways for our audience today. We've covered a lot of ground.

Kristen Kehrer 24:24

Yeah. Yeah. So ML Ops is a necessity. LLM Ops is an extension of ML ops. There's ways in which LLM Ops is still emerging and will probably be even more useful to us in a couple years. One thing I wanna call out is that, with. Like experiment tracking. And other areas. Sometimes you can put just a couple lines of code and you get a lot of tracking out of the box, but then there's still work to do to log other artifacts and things that don't come standard. And LLM ops is the same. There's, it can feel sometimes like you get a lot out of the box and you're like, wow, this is so easy. But there is work to get the rest of it built out. And then, the. Third piece is really that this is so much more collaborative than it was with machine learning, and you need to budget for that upfront.

Andreas Welsch 25:19

I love it. Especially the part about collaboration, because from my experience too, that's where the magic happens when you bring business and technology together and each one learns a little bit from the other side too. Everyone gets a little smarter and typically the outcome tends to be a lot better than if it's only one side that things, they've absolutely got all of that figured out. Yeah. Alright, Kristen, it was a pleasure having you on the show. Thank you so much for sharing your expertise with us. And for those of you in the audience, if you're not following Kristen yet. Do give her a follow. She's awesome.

Kristen Kehrer 25:47

Awesome. Thank you so much.

Andreas Welsch 25:49

Alright, thanks. See you next time for another episode of What's The BUZZ? Bye-bye.

Kristen Kehrer 25:53

Bye.

Andreas Welsch

Host