What’s the BUZZ? — AI in Business

Fine-Tune & Operate Your Generative AI Models (Guest: Abi Aryan)

July 10, 2023 Andreas Welsch Season 2 Episode 12
What’s the BUZZ? — AI in Business
Fine-Tune & Operate Your Generative AI Models (Guest: Abi Aryan)
What’s the BUZZ? — AI in Business
Become a supporter of the show!
Starting at $3/month
Support
Show Notes Transcript

In this episode, Abi Aryan (Machine Learning & LLMOps Expert) and Andreas Welsch discuss fine-tuning & operating your generative AI models. Abi Aryan shares her insights on using large language models (LLMs) and provides valuable advice for listeners looking to incorporate this new technology in their applications.

Key topics:
- Get more specific results from a large language model (LLM)
- Improve model performance over time
- When to pursue fine-tuning vs. prompting

Listen to the full episode to hear how you can:
- Evaluate the pros and cons of fine-tuning vs. prompting
- Consider the operational cost of using generative AI
- Tailor model evaluation based on the use case

Watch this episode on YouTube:
https://youtu.be/8km8_fK-enY

Support the Show.

***********
Disclaimer: Views are the participants’ own and do not represent those of any participant’s past, present, or future employers. Participation in this event is independent of any potential business relationship (past, present, or future) between the participants or between their employers.


More details:
https://www.intelligence-briefing.com
All episodes:
https://www.intelligence-briefing.com/podcast
Get a weekly thought-provoking post in your inbox:
https://www.intelligence-briefing.com/newsletter

Andreas Welsch:

Today, we'll talk about fine-tuning and operating generative AI models. And who better to talk to about it than somebody who does that actually for a living. Abi Aryan. Hey, Abi. Thank you so much for joining.

Abi Aryan:

Thank you so much for inviting me.

Andreas Welsch:

Hey, why don't you tell us a little bit about yourself, who you are and what you do?

Abi Aryan:

So I am a machine learning engineer who's currently working on building orchestration tools for goal-based agents. I also do some consulting on the side as an ML engineer for very conventional MLOps kind of work, which I've been doing for about a few years now. Outside of that, right now, I'm working on the book and you'll see me pop up in a lot of venture capital meetings every now and then, but only if it's related to large language models.

Andreas Welsch:

That's awesome. Thank you so much for sharing that with us. It's, I think, the most technical session and episode that we have today. So I'm really excited to have you on. Like I said, you come highly recommended from the data community and leaders in the community. So I'm so excited that's working out today. Thank you. And so for, those of you in, the audience, if you're just joining the stream, drop a comment in the chat where you're joining us from today. Really curious how global our audience is again. Abi, should we play a little game to kick things off? What do you say?

Abi Aryan:

Yeah, that's good.

Andreas Welsch:

Okay, perfect. So this game is called In Your Own Words, and when I hit the buzzer, the wheels will start spinning. And when they stop, you see a sentence. And I would like you to complete that sentence with the first thing that comes to mind in your own words. And to make it a little more interesting you'll have 60 seconds to do that, in your own words and why. Are you ready to play"What's the BUZZ?"

Abi Aryan:

Yeah, I'm sorry. To disappoint, if I don't do it in 60 seconds.

Andreas Welsch:

You can go over too. Alright, let's get started then. If AI were a book, what would it be? 60 seconds on the clock. Go!

Abi Aryan:

Harry Potter.

Andreas Welsch:

And why? Why Harry Potter?

Abi Aryan:

Oh, because there are so many magical creatures and so many magical spells that are coming out regularly for each of the different animals. We have a different spell and Harry Potter people are popular and they're somehow considered better than muggles or, yes I, don't know what's the ideal word, but there's normal people in India, magicians, and I think in the same way, there're normal people in the industry and AI people right now.

Andreas Welsch:

I like that analogy. That's awesome. And it seems like there's always a sequel, right? There's always a next one coming. So as we're going through the hype cycles and the motions, there's always the next one coming, too. So I love that idea of Harry Potter. Maybe going back to something more tangible on the topic of our episode. Obviously generative AI has been a hot topic for a number of months now this year definitely. But already since last year with Midjourney and DALL-E and all these remarkable in inventions, and that progress that we've made as an industry. And now you can get large language models like GPT off the shelf from the likes of OpenAI or Anthropic and others. And you can integrate them into your applications. And everybody's trying to figure that out. And their leaders are asking them go figure out what we can do with this ChatGPT, generative AI type thing. But I'm wondering, what do you actually do when you need more specific results than what these large generic models can give you that have been trained on vast bodies of texts that's been scraped off the internet? How can you tailor them to your specific industry or domain?

Abi Aryan:

I think something we should ask even before, and it completely depends on who's the audience that we are talking to here. The first question would be: do you really need these large language models? Because they become in pretty expensive. Plus you have reduced interpretability just because of course the models were black box, but there are so many more questions because of the generative nature of these models, which lends to the word interpretability in the first place. That's the first question I would ask: why do you think you need a generative model in the first place? It's not ideal for every use case. It's not ideal for every single industry. Do you have the infrastructure to already think about what are the downsides of it? For example, there are some industries that do require more compliance. If you don't have the end answers to how do you plan to deal with the cons of generativity, and the biggest problem in generativity right now is how do we really evaluate our models? If you don't have an answer to that, then let's not even talk about fine-tuning. Let's not even talk about language models maybe. I know it's my bread butter. I'm writing a book on the topic. But I would really highly suggest. Not to use a generative model just for the heck of it.

Andreas Welsch:

I think that's a great suggestion. First of all, figure out if that is something you need to begin with, or are you just giving in to the hype and just jumping into the bandwagon, because you feel everybody's doing it, or your competitors are doing it, and you should be doing it, too. I really like that suggestion, too. Stop for a minute and think, do we really need that?

Abi Aryan:

The way I would go about is I would ask two questions: the first question is, why do you need it? Second question is, have you done a cost assessment of these models? And not just basically what is the cost of inferencing these models. But what are the other cost associated with these models. There are a few things to think about. The first thing is whether you do fine-tuning versus prompting. Let's say you choose to do prompting, then who in the team is actually doing prompting in the first place. Because we indirectly create so much dependency on that person. Will it be a data scientist who's doing the prompting versus annotators back in the day? Is that the best use of a data scientist's time to be able to do prompting? Now, every single time you have changes in the API itself or anytime you're updating the model as well. A large work that you've done, or lot of infrastructure that you've built around the prompt itself, which is chain of thought prompting, which is one of the techniques of prompting, might or might not be useful and needs to be rewritten. Is that worth the work as well? And all of those things are comfort cost, because again, you're paying people. Everybody is on a payroll for those many months until you're figuring out. So that's an important question to think of. The second question to think of would be, as there's a cost for evaluation for these models right now, because everything needs to have almost human in the loop, depending on the use case as well. If you're doing copywriting, then I don't think we have that much need for having humans in the loop, other than we have some baselines that are taken care of. The first baseline would be the toxicity reduction. The second baseline would be we do some sort of sentiment analysis on it, and the answers aren't really very strongly political, ideology based, but also they can be depending on who you're serving as well. For example, if the people that are using your platform is mainly journalist, then it's fine. But then the model requires constant, fine-tuning, constant monitoring as well. So all of these things would pile up into cost eventually down the lane. Fine-tuning has a different cost as compared to to the cost of inferencing or the prompting of the model itself, which is almost four times the cost. And unless you have a way to be able to evaluate a model, you don't really know if fine-tuning is working or not. So yes, you're spending that much money and if you don't really know what's the output out of it. And this is a question that we kept asking in the ML industry, which is, let's tie the models back to the business KPIs. So I think the same question of tying the models back to the business KPIs reiterates and justifying that cost before we eventually go into fine-tuning, before we actually discuss there's some benefit out of doing domain specific fine-tuning.

Andreas Welsch:

I think that's a great point that you're making especially the cost. Everybody's so excited about the possibilities and, we want to try this out, we need to try it out, and might overlook also operational cost in the long run, specifically around fine-tuning. So I think it's important that you highlight that. And I really like how you've described it, that we need to think more about that as we're venturing into generative AI. So there's more to it than just calling the model.

Abi Aryan:

Yeah. And knowing whether we are doing an experiment. There's some sort of like expectation attached to an experiment versus we think these models are going to change the world. Or we think we should build a large language model just because everyone else is, and somehow model creators would have some sort of moat that others wouldn't. Which is one of the common themes that I've heard from a lot of people. If we are using the OpenAI models versus creating our own models, somehow we're not as sophisticated unless we are building a model ourselves.

Andreas Welsch:

Yeah. So what I'm hearing then is maybe not everybody needs to build their own model or, needs to do the fine-tuning. But when, should you even think about fine-tuning an LLM and, what are the prerequisites that you see?

Abi Aryan:

So I think there's two levels of understanding in this. The first thing is what is your infrastructure like, or who, is the end user? There are use cases where you can use a proprietary large language model out of the box. So for example, you can use GPT-3 to be on your chat bot website if it's just a chat bot which needs some sort of interaction versus if we are really just trying to give it data that can make it domain specific for those ones we have. Chaining tools that would allow us to be able to feed data from our databases for some specific scenarios. And also to be able to save up some cash sheet results wherever you intend on saving your ineffective database to be able to pull up those questions. For example, like FAQ question answering. If you're an e-commerce company, you're not interested in giving answers on who's the president of the country right now. It's, just not relevant. So it's important to have that one distinction, which is do you think you need an out of the box model or do you want to give it more data internally? Now, if you want to give it more data, then what's your use case? If the use case is just question answering, then probably you don't need tons of fine-tuning, if you know what are the kind of questions that you give almost all the time. A lot of people are now exploring using large language models for recommendation systems. Even for those, you don't need extensive fine-tuning. Let's take copywriting. Let's say there's a service that allows you to be able to write college essays. Then that's a very particular use case. That can be trained on a lot of data, and that can require that extensive fine-tuning based on how accurate you want the results to be.

Andreas Welsch:

Awesome. So definitely weigh your options and understand what it is that you're trying to solve for. Whether it's more of a generic answer or if it's more specific to your domain or to your use case. I'm taking a quick look at the chat and, I see, we indeed do have a very global audience. So from Connecticut in the U.S. to Israel to London, L.A., Tamil Nadu. Other folks from India. Thank you for joining us. If you mention fine-tuning, certainly for certain use cases, it makes sense. But we also know it's very resource intensive. It's very cash heavy. If you want to do that to some extent as well, remember for probabilistic models you would collect data, you would improve the model during the next training. And, then you push it out again. How do you do that for our foundation models and for fine-tuned models? How does it change or what of it maybe even stays the same?

Abi Aryan:

I would say there's one key difference with generative models versus discrimative models. With the discriminative models, most of the time if you were not doing some sort of fine-tuning, you weren't really getting any specific results. A lot of effort was really spent on hyper-tuning the parameters to the extent there was almost a level of craziness with it hyper-tuning the parameters. But it depends. Most of their value add wasn't really done from fine-tuning the parameters. Sure, you could get a lame bump in accuracy, if that's generally what you cared about. But that wasn't what made the big difference. The big difference were what were the features that we were looking at? How do we do feature engineering, right? If you've done data collection, you've identified the key features, you understand the correlation as well as causation between different variables itself in the model, then that is where the biggest value is. Once you were past that initial limit, then it was more like, okay, let's think about fine-tuning our models and improve the accuracy of the model. I think fine-tuning was almost always just about optimizing for that 0.1% gain and that wasn't the most important. Sure from a research perspective, from all of us who grew up on Kaggle in a way, we enjoyed that as a practice, but when it came to the real machine learning projects, at least in the industry, that's now a lot of effort was really invested.

Andreas Welsch:

Maybe on that note, I'm taking a look at the chat and I see one of the questions here is what are the models that can be used in insurance and maybe that other domain-specific models or maybe that'll answer that addresses different use cases, different domains, different industries. Is there anything available that you know that's public or how can you maybe customize it based on your industry?

Abi Aryan:

So I think there are two kinds of generative models right now. One are the OpenAI models that are more like you can call them general models in a way. Because despite the fact they're trained on a very specific kind of unstructured data and there's a lot of data that they haven't seen, but they're constantly improving. And the second are basically large language models that have been trained on very different data sets. For example, BloombergGPT comes to mind that has been trained specifically on financial data. I've not really seen something which is specifically used in insurance itself. But depending on who the creator of the model is, how much data did they have? Access to a financial model, which is a model that is trained on financial data, is slightly better in performance as compared to a generalized model.

Andreas Welsch:

One of the other questions here is: Are there any resources available that lay out the evaluation of the different models that you've described?

Abi Aryan:

To be able to evaluate the discriminative models, we had tons of parameters available to us. And the very obvious you are looking at the f1 scores, you're doing some sort of cross-validation. You're looking at the r value. You're looking at the precision, you're looking at the recall. Those, it was very easy. With the generative models, the evaluation is something which is specific to the use case as well. So as of now, if you're seeing the evaluation frameworks out there, there are two kinds of evaluation frameworks that are available to almost everybody on the internet. The first are the evaluation frameworks like Stanford Hymn, which in a way, breaks down these are the parameters that you should be looking at. But it's very generalized as well. And they've compared what is the performance of one versus the other. The second thing is there are also some sort of leaderboards that are a little bit more domain-specific. For question answering. There's this company called AI21 Labs that has created their own leader boards as well. For example, which model does perform there on question answering, but again like a word of caution I'll give is if your model does well on the, so all the evaluation frameworks that are provided there. Our best use to be able to establish some sort of baseline only, you know, if the model is good enough to be able to perform well on question answering, but it doesn't really mean that it will answer well on your specific domain as well. There needs to be more work in terms of domain-specific evaluation that needs to be done on top of the question answering as well. So as of now, evaluation is more like an open question that way, which is like a lot of people are looking at how to evaluate your models and while there are three things that you're technically looking at, which is one is what is the accuracy of the model? The second is what is the inference speed of the model? And third is what is the latency of the model as well. While these are things that you can measure, which is, I would quote unquote put them inside what I would say performance. These are the performance metrics, but outside of performance metrics, there are so many things that need to be taken care of because the next question comes, how do we define the accuracy? What is a good enough inference speed? So on each of these, which is for on accuracy, on inference speed, on latency, establishing some sort of baseline. Is it acceptable enough for us? If we're working with a chatbot and we have a latency and our model does give us an answer in let's say 30 seconds or one minute. That's a very long waiting time when you're interacting directly with the customer versus we are using a large language model to be able to do a knowledge retrieval on a bunch of internal documents where we can wait. The fact that we are able to do natural language prompting on a lot of Excel sheets that we do have or on a lot of PDFs that we do have internally within the organization, the team internally wouldn't mind waiting that much. It depends on who the end user would be and what would be an acceptable level for each of those things. While having a higher latency is almost always preferred, but it doesn't mean that it's always the most ideal solution. Different use cases and different users as well can have very different baselines on each of these things.

Andreas Welsch:

I think that again puts it in a very good context, especially if you're familiar with the metrics that you mentioned like the f1, precision, recall, and everything that we've done before generative AI was the big thing. And then now what I heard you say is make it use case specific. If it's more end-user focused or more customer-facing, then there might be different expectations regarding latency than if, it's an internal scenario.

Abi Aryan:

Maybe I'll say this as well, as of now, there are two kinds of benchmarks that are available to you. The ones are the closed benchmarks, and as we go further maybe, there will be more closed benchmarks as well where you're able to feed your model into the evaluation framework and to be able to evaluate it. And there're open source frameworks as well. So for example, what I really mean is on the proprietary, or the closed ones are something called OpenAI wells, where you're giving out your model, you're training it, checking how your model does perform. And the second are more publicly available benchmarks, which would be things like Helmer, things like Eval Harness which is by Eluther, which are completely open source. And it depends on how you are actually able to use them. The second is from communities and from researchers where you don't have any managed architecture to be able to see everything. It's almost like they've given an evaluation framework. It may or may not be maintained. It may or may not be adapted to your particular use case. The performance might not be that well, but it gives you a way to be able to think in that category, I would say is Stanford Hymn or Eval Harness do fall into that category, but also there's more people looking at different methods as well. So for example, one of the things that a lot of people now are looking at is also some sort of scoring based methods as well to be able to see if the models are performing as well or not. But again there's not a lot of framework. There's not a lot of providers, no open source work in that particular thing. Yes, you have some things which are available, but it's, almost like it may or may not work, so it sounds but it'll give you a way to think about the problem.

Andreas Welsch:

I see. So it sounds like on one hand it's use case specific, but also depending on the frameworks that are available or which ones you choose. It might be a bit early still to get to something that's standardized.

Abi Aryan:

For Elo-based benchmarks, it's very early. What I've really seen is more like benchmarks that are around toxicity detection, that are around sentiment analysis, that are around one of those generalized use case. But I think the Elo-based benchmarks are the only ones that are very domain-specific. But the discipline around that, it's not really developed, yet. I think it would take about six months to about one year for there to be like an established framework for around Elo-based methods.

Andreas Welsch:

I see. So we're getting close to the end of the show. Thank you so much for sharing all those insights where you see we are with large language models, how you can adapt them, fine-tune them, what changes when you want to operate them. I was wondering if you can maybe summarize the three key takeaways for our audience today before we wrap up.

Abi Aryan:

I would say first is think about whether you really need to use a large language model. The second thing to think about is whether you wanna do fine-tuning versus prompting, which is a bare use case for you. Whenever you can have a chain of thought way of thinking about a problem, it's much better to be able to think about prompting as compared to when you're looking at your data and you realize the function that you're trying to optimize for doesn't really have those very obvious connections. It's a very complex function if you're trying to write the function that you want to optimize for in terms of the inputs and outputs. If it's a simple function, do prompting. If it's a little bit more complex where there's more nuances within the variables itself, then I think fine-tuning is going to be slightly better in terms of performance. Once you've done that, think about how will you evaluate your models and who is the end user for the model and what are the risks that come with any sort of evaluation? So for example, any time that you're evaluating using OpenAI wells, you're exposing your data entirely to OpenAI. And they can use that to be able to train for the models. So you're in a way leaking out your models. And this was brought to attention by Percy Young from Stanford who said this is not ideal.

Andreas Welsch:

So, there's also the IP component and data privacy component to attend. Awesome. Abi, thank you so much for summarizing it. Thanks for joining us today and for sharing your expertise with us in such great depth. I know it's an exciting topic and I wanted to make sure that our joint audience has an opportunity to hear from you and to learn from you. And especially to go deeper than the"everybody should be doing generative AI and get on the bandwagon." for those of you in the audience. Again, thank you, Abi for joining.

Abi Aryan:

Oh, thank you so much.