What’s the BUZZ? — AI in Business

Adopting Data-Centric Machine Learning For Small Datasets (Guest: Jonas Christensen)

July 21, 2024 Andreas Welsch
Adopting Data-Centric Machine Learning For Small Datasets (Guest: Jonas Christensen)
What’s the BUZZ? — AI in Business
More Info
What’s the BUZZ? — AI in Business
Adopting Data-Centric Machine Learning For Small Datasets (Guest: Jonas Christensen)
Jul 21, 2024
Andreas Welsch

In this episode, Jonas Christensen (Senior Data & Analytics Leader) and Andreas Welsch discuss adopting a data-centric Machine Learning approach for small datasets. Jonas shares learnings on improving data quality before building models and provides valuable advice for listeners looking to improve their results with small data sets.

Key topics:
- Describe the common challenges businesses still run into with Machine Learning
- Evaluate options for working with small data sets
- Advise data scientists to get more data and better data
- Deliver business value from Machine Learning at a time when everybody wants to chase Generative AI

Listen to the full episode to hear how you can:
- Work toward better data quality over improving your model parameters
- Approach data science as a team sport that requires actively influencing your business stakeholders
- Share with your stakeholders how you use the data downstream to get them excited about their contribution

Watch this episode on YouTube:
https://youtu.be/C_GlTgRmeL8

Questions or suggestions? Send me a Text Message.

Support the show

***********
Disclaimer: Views are the participants’ own and do not represent those of any participant’s past, present, or future employers. Participation in this event is independent of any potential business relationship (past, present, or future) between the participants or between their employers.


Level up your AI Leadership game with the AI Leadership Handbook:
https://www.aileadershiphandbook.com

More details:
https://www.intelligence-briefing.com
All episodes:
https://www.intelligence-briefing.com/podcast
Get a weekly thought-provoking post in your inbox:
https://www.intelligence-briefing.com/newsletter

What’s the BUZZ? — AI in Business
Become a supporter of the show!
Starting at $3/month Support
Show Notes Transcript

In this episode, Jonas Christensen (Senior Data & Analytics Leader) and Andreas Welsch discuss adopting a data-centric Machine Learning approach for small datasets. Jonas shares learnings on improving data quality before building models and provides valuable advice for listeners looking to improve their results with small data sets.

Key topics:
- Describe the common challenges businesses still run into with Machine Learning
- Evaluate options for working with small data sets
- Advise data scientists to get more data and better data
- Deliver business value from Machine Learning at a time when everybody wants to chase Generative AI

Listen to the full episode to hear how you can:
- Work toward better data quality over improving your model parameters
- Approach data science as a team sport that requires actively influencing your business stakeholders
- Share with your stakeholders how you use the data downstream to get them excited about their contribution

Watch this episode on YouTube:
https://youtu.be/C_GlTgRmeL8

Questions or suggestions? Send me a Text Message.

Support the show

***********
Disclaimer: Views are the participants’ own and do not represent those of any participant’s past, present, or future employers. Participation in this event is independent of any potential business relationship (past, present, or future) between the participants or between their employers.


Level up your AI Leadership game with the AI Leadership Handbook:
https://www.aileadershiphandbook.com

More details:
https://www.intelligence-briefing.com
All episodes:
https://www.intelligence-briefing.com/podcast
Get a weekly thought-provoking post in your inbox:
https://www.intelligence-briefing.com/newsletter

Andreas Welsch:

Today we'll talk about adopting a data centric machine learning mindset for small data sets. And who better to talk about it than somebody who's recently written the book on that. Jonas Christensen. Hey, Jonas. Thank you so much for joining.

Jonas Christensen:

Thanks for having me, Andreas. I'm so excited because we've been communicating over years. We've met on LinkedIn and formed a good camaraderie there. So I'm excited to finally be on your show. So thank you for the invite. I really appreciate it.

Andreas Welsch:

Perfect. Likewise. And now that we're a little closer together, time zone wise, I'm in Germany at the moment. I know you're in Australia, that made it for the perfect time to super excited that we get to do this together. So thanks for your time as well. But hey, why don't you tell our audience a little bit about yourself, who you are and what you do?

Jonas Christensen:

Sure. So you've already revealed that I'm in Australia. So it's actually night here. And it's also winter, not summer. So it's it's very different. From a career background, I've worked in data analytics for almost 20 years. And I fell into it randomly a lot of people did early on because back then there wasn't really a career path. There was no there were no degrees in in AI or analytics or anything like that, that we have today. But I came from a finance background, so very much looking at the numbers and data was a part of my role. And then, I took a job as an analyst in the bank. And from there on, I saw the light. I saw what hundreds of thousands of millions of rows of data could do and I was sold. And then in that time I've grown in this sphere of analytics to have various leadership roles leading analytics functions, data science functions starting them from scratch. Building them up in industries like financial services, legal services, utilities, and also consulting. And I've also done education in this space. So built the various training courses for corporate organizations and universities. And, then I've written a couple of books. One is a non technical book called, Demystifying AI for the Enterprise, which was published. I've co-authored that with a bunch of other authors, which was published about three years ago. And then earlier this year, I published a more technical book, which is includes Python coding as well, which is about data-centric machine learning. So that's part of the topic today, Andreas. So that's the background in a nutshell.

Andreas Welsch:

Wonderful, thank you for sharing. So like I said, excited to have you on and looking forward to learning more about that topic. Usually we play a little game to kick things off. What do you say? Are you ready? I'm I'm slightly scared, but let's do it. Perfect, so this one is called In Your Own Words, and when I hit the buzzer, at least the virtual one here, the wheels will start spinning, and when they stop, you'll see a sentence. I'd love for you to answer the question that you'll see on the screen with the first thing that comes to mind, and why, in your own words. To make it a little more interesting, you'll only have 60 seconds. For those of you in the audience, I would love to hear your thoughts as well. What do you think? And why put it in the chat as well, and we'll take a look at that in a second as well. Now, Jonas, what do you say? Are you ready for what's the buzz? Let's do it. Good. Let's try this out.

Jonas Christensen:

If AI were a fruit, what would it be? Exactly. Ah, you have tricked me. If AI were a fruit, it would be a grapefruit. So grapefruits, are to many people a little bit sour. They start very sour, but if you allow them to ripen, they are beautiful, juicy. And full of flavor inside and they're very healthy for you. So AI is a little bit like that. If you start your journey too early, if you bite into the fruit too early before it's ripe, you will end up with some very sour results. But if you let it ripen and you eat it in the right way and you make sure you. Go through the motions with AI to mature to the right level. Then it can give you some wonderful results, wonderful flavor sensations.

Andreas Welsch:

That's awesome. Perfect answer. I love that. Thank you so much. I didn't know that's where it was going when you said grapefruit. I, only had the sour part in mind. So

Jonas Christensen:

Many, AI projects and grapefruits start and finish sour.

Andreas Welsch:

Yes, that too. Exactly. Wonderful. Still better than lemons, right? Yes. All right, with that out of the way, why don't we jump into the topic of our episode today. And everybody is talking about generative AI, but we all know that depending on your business problem, actually machine learning is still the more accurate and the more economic choice. But I'm wondering, what challenges do you see that businesses still run into with machine learning? Maybe that brings us back to the grapefruit.

Jonas Christensen:

It does. And it's funny when you post that question, I'm thinking not two years ago, machine learning was the AI that we were all trying to do. And now it seems like the old school AI, but I think in many cases, we still haven't done machine learning well enough yet, and it's the bread and butter for many, I think. So generative AI is a wonderful tool. I think. In many cases, the proper development of that will be reserved to very specific use cases, and probably some companies that have the right kind of scale to develop those models is very costly. For the average business out there Machine learning is, still the, thing that, that, is what you'll solve your business problems with, right? So the generative AI that we know today, so such as ChatGPT or MidJourney or what have you that can create content, is, very valuable for specific use cases, but, I think when it comes to, The sort of bread and butter automation fact finding, understanding relationships in data machine learning, is still, what we'll need to do. In the book that we have written, so myself and my two co authors, Nakul Bajaj and Manmohan Goswara we talk about some typical problems that, that people run into, and we think that a data centric approach is the solution to that, of course. And so perhaps I should actually Andreas, I should define what is a data centric versus a model centric approach, because I often hear from people isn't isn't machine learning data centric by default? So the difference is, a model-centric approach is a traditional approach where you take the data that you have as, a given, right? So this is the data set you have. And your opportunity to build the best model. And that is by trying different algorithms, hyper tuning, and parameter selection, all this sort of stuff to make the model as good as it can on the data. What we are saying is, there is much more opportunity in actually improving the data. It's probably harder to do, but there's much more opportunity for uplift in that data than there is in the algorithm itself. So the typical challenges that that businesses face in this space is, you can probably categorize them into technical and non technical challenges of bottlenecks. And so in terms of technical bottlenecks, I would say that the mod-l centric approach is a bottleneck in itself. There's so much more we can get out of the raw data if we improve the data quality. And so if we don't do that's a bottleneck. And the potential of and any machine learning model is really kept by the quality and quantity of the data that you have available. So if you were the best chef in the world and you had the most wonderful restaurant, you had the best waiting staff, you had the most elaborate dining room, you had the best materials in the kitchen, and you had the best sous chef, you had the best colleagues around you to cook that meal, but you didn't have the right ingredients in the fridge, then it doesn't matter all that stuff, you cannot cook the world's best meal. And it's the same with machine learning. If your raw material isn't up to scratch, then it doesn't matter how good all the other bits are. You're kept by that, limited by that. So that's an obvious technical bottleneck. And another bottleneck is that we can't always just get more data. Can't just go and get another million rows or even another thousand rows in many examples. So a lot of machine learning problems in the modern organizations are not actually what I'd call big data, right? You might have a problem that is machine learning worthy, but you only have a thousand observations a year of a certain event that you're trying to predict. And therefore, you don't have a big scale of data, so you have to do something to that data to get as much signal out of it as possible.

Andreas Welsch:

So I'm getting curious, especially with your cooking analogy or kitchen analogy. So what I think I'm hearing is not everybody has a huge farm where they can plant the veggies that they need. But maybe you have a small patch in your backyard where you can, I don't know, plant your radish and your lettuce and whatever, right? That's the difference or one of the differences.

Jonas Christensen:

Yeah, quite right. If you think of this data set as fixed, right? So one of the things we're saying is go and get, not just get more data, get better data. So better means you actually work on improving the quality of the data that you have, because it's not always possible to just get more data. Let me give you a concrete example. So in the book, we describe an example of, a team of researchers and and machine learning engineers that attempted to predict cardiac arrests by building models on phone calls to the equivalent of a 911, right? So the sort of emergency call, and someone saying, I'm feeling this and that, then I'm feeling maybe heavy in my chest, I'm struggling to breathe and so on. And this algorithm is meant to pick up that this is a cardiac arrest, and now you need to send the ambulance. And we can't just get more data, right? We can't make up more cardiac arrests. And it's high stakes the model also has to be very accurate therefore the data quality is so important, right? If I contrast that to some of the big tech organizations. Let's pick YouTube, right? YouTube has, I would say, infinite amounts of data and that data is complex, but it's created inside their own platform. They have control over that data quality, they can manipulate what kind of data is captured. They can manipulate, lots of experiments inside the platform to create new kinds of data. And at the same time, the thing that they're trying to predict or prescribe in this example. So let's say a video recommendation is not high stakes, right? It's not a cardiac arrest. If you get that wrong and you don't send the ambulance, you're in trouble, right? The algorithm is carrying lots of data. If you get the video recommendation wrong, it's okay. You can keep experimenting and figure out why you got it wrong and what might have been right instead for that particular user. So these are the situations that that mean that although we'd like to be, most companies are not going to be Facebook, Amazon, TikTok, Netflix, what have you because the data sets are not as controllable, they're not as large and the stakes are much higher for lots of businesses.

Andreas Welsch:

Thank you for sharing that. I think especially the part around the control of the data sets and that governance and being able to influence what you use with the data and how you can actually get more data. I think that's really key. Now. I think that part then gets really interesting, right? Because if you do start with a model-centric mindset even, and I think that kind of is like data scientists saying:"It is what it is. We need to work with the data we have." How can they actually change that? Or what do they need to do to get more data to get better data to what you just mentioned?

Jonas Christensen:

Yeah, so there are many ways to do that. Our book is almost 350 pages of various ways you can do that. And so there are some technical approaches and some non-technical approaches to that. And so let me describe some ways and then maybe give you a specific example of where we've done this. We've written this book based on our own experience as well, right? So there's lots of examples of how we've done some of these things in the book. We have these technical bottlenecks that I talked about. We also have some typical non-technical bottlenecks that I think if someone's a data practitioner they'll recognize this, that the quality of data, the data quality, is typically a problem that blocks us from doing something. That's typically the number one complaint, almost universally across any business. The data quality is not there for us to do whatever it is. And that comes from a range of things. So first of all, data is typically not collected or curated for machine learning purposes. That is a secondary purpose. So it doesn't contain the information we'd like it to. So I've worked in lots of services businesses where you have an ongoing relationship with a customer, let's say banking or utilities or telecommunications or legal services. And the customer calls in. They tell you something. You make some notes in a CRM or a system. You tick some boxes or what have you. The information that's collected there is not as rich as it could be for machine learning purposes. I'll give you an example in a minute of how we've gone back and actually changed the source system to ask better questions and worked with the frontline staff to get more data out, because we needed it to have more predictive power. And that connects into another technical bottleneck, which is that the people that are collecting the data. They typically don't know what we, the data scientists, are actually going to use it for. So they don't have an appreciation of, if I do this in this way, it's going to result in this result. And the data might be worthless because I haven't done something specific to that data or I've cut corners somewhere. And the third bottleneck in terms of non-technical bottleneck is that the people themselves that are collecting or labeling data have all kinds of biases happening up in their heads. So they're interpreting information often, and they're interpreting that information a certain way based on their background and their beliefs, etc. So an example of what we've done in the past. And myself and Manmohan who was one of the authors on the book and, we established a project in a business where we needed to label some particular legal cases that had a very high likelihood of being a certain type of legal case, which needed a certain type of treatment and was very high value for the business, but also high value for the client. And, this required us to predict from textual information or identify from textual information that certain elements were available in this legal case. Now, when the client came to the law firm, they would come and have an interview style conversation. Tell us what happened. Tell us about your situation and explain the case in your eyes and what we can do for you. And all of this is very suppose unstructured. And so we have maybe 40-45 minutes of someone talking, but it's completely unstructured. So finding out the right keywords in that is possible with algorithms, but it's very complex. At the end of the day, we worked with a subject matter expert, so a lawyer, we said, okay, how do we figure out using machine learning to pick out what is exactly needed to pick these legal cases with high accuracy. We built a model, we iterated. They reviewed the output and told us when the model was right or wrong in classifying a legal case. At the end of the day, we said, look, what are the things that have to be in place for this legal case to be one of these special legal cases? And she said there are these five things. So what we did was we went all the way to the front line and we said, when that customer, that potential client, comes in with their legal case, you must ask these five things in a particular way so that it is either quantified numerically or it is answered in a yes, no format. And all of a sudden, this very complex, textual analytics problem became a very simple it's barely meant machine learning, right? You can actually just say yes and no, because we structured the data much, much clearer, right? It's an example of how a very complex task is made actually reasonably simply by just reformatting the data and rephrasing the data all the way up front.

Andreas Welsch:

That's awesome. And it sounds like it also has to do with asking the right questions or asking better questions that lead to the data that you need to make those better predictions, right?

Jonas Christensen:

That's right. So often data quality is a reflection of the underlying business process. I'd say it's almost always a reflection of the underlying business process. If the data quality is not good, it's because the business process that captures the data is either not non existent or it's not very good. And you tell me a data problem and I'll tell you a business process problem. So therefore people think, ah it's hard and it's effortful to go in and create better data at source. But actually you're creating a double win because you're getting better data, but you're also fixing a bad process. And that's my experience. I'm doing this.

Andreas Welsch:

I think that speaks volumes. Because to your earlier comment, right, every company is complaining about bad data or poor data quality, which means in turn there are probably lots of opportunities to improve your processes as well. Now from there, maybe segwaying over to the next topic of that is what can leaders then actually do in a business? And how can they deliver value from machine learning and knowing that there are smaller data sets, they should start with a data-centric approach into all of that at the time. And everybody just wants to talk about generative AI and the next chatbot, and summarizing, creating text, and what not. What can leaders do when it comes to machine learning and taking a data centric approach?

Jonas Christensen:

That's a very important question. And, I'll say that in this conversation, I've gone over the technical approaches to making your data better. You can do things as a data scientist, you can do things like data cleansing, you can oversample, undersample data to get effects out. You can use synthetic data all those are important elements. But, that's very, that's a technical approach. I think, another important thing that we talk about in this book through every chapter is for machine learning to really work for the masses so not just for big tech companies, but for companies that are in the long tail which is most of us So what do I mean by the long tail? So very simply, most businesses don't have one machine learning use case that's worth 50 million dollars. But many businesses have 50 machine learning use cases that might be worth a million dollars. But each of them take a long time to do, and each of them have potentially much smaller data sets that they're harder to reach. So how do you unlock that value? That's basically the question for most businesses. That means for me that the data science becomes a team sport in the organization because, and this is where the leadership comes into it. And this is what leaders need to do. So data quality is everyone's problem. And when you recognize that that data quality is also a reflection of the quality of the business process, I think the conversation actually changes and it becomes much more of a a business improvement exercise. It creates like a virtuous cycle of if you get better data, so you can predict better things, but you also get better business processes. So some of the problems you try and predict, they actually go away because the problem was the process itself. And yeah, I've seen examples of that many times. Yeah, so it's leaning into that conversation and bringing your stakeholders onto that and really being champions for data quality. If you're a data scientist you really have to think about how do I affect people at the top of the funnel and at the end of the funnel, right? This is a hard job, right? This is really hard job. What I'm describing, you have to understand the business. You have to get your stakeholders and subject matter experts involved. You have to be really working with data engineers and pushing them to, to help you with this technical work. And you also have to build use cases at the other end that work for the business and make sense. And so it's not an easy job, but it is what will unlock this long, long, long, tail of things that never get touched because we always say, oh, the data's not good enough and we don't have enough of it. What are you going to do about it? That's the thing.

Andreas Welsch:

So plenty of work to do. That's, what I take away from this and plenty of opportunity. And, what's also resonating really well with me is the fact that if you start to get better data or look at how can we get better data, you actually change the most likely source of the problem, which is the process. And then like you said, the virtuous cycle, which is awesome. Now, hey, we're getting close to the end of the show. And I was wondering if you can very briefly summarize the key three takeaways for our audience today, before we wrap up.

Jonas Christensen:

The first key takeaway is there is a lot of gold at the end of the rainbow, if you start working on better data quality over improving your model parameters. And some of this work is very manual. Sometimes you have the data set where you have to do manual intervention in it, but you actually find out what you need to find out. So don't be afraid to be data janitor and get your hands dirty with that stuff. So that's the first one. So there's lots of goal at the end of the rainbow with data quality. The second one is data science is a team sport. Don't sit back and say the data quality is what it is. The data we have available is what it is. I'll just take that and do the job I can with that. That's a passive approach. You have to go and actively influence your business stakeholders. And the third one is, which I haven't actually really described Andreas, if I had longer time, I would tell you some of these success stories. And when people learn what the data they collect is used for downstream, they get really excited. So when I work with call centers to say, look if you collect this data this way, rather than that way, I can use it for this and I can predict this. And therefore these people will not call anymore and ask these stupid questions. People are like, wow, I didn't know. I wish you told me six months ago, I would have asked different questions. I would have taken a different approach. So involving people throughout the business in the team sport of data science, making it a team sport, not just saying it is. But making it is key to changing the way that your business operates and takes on the value of data science realizes what it can do.

Andreas Welsch:

Thank you for sharing. Also, Jonas, thank you so much for joining us today and for sharing your expertise with us. Really appreciate it. There were lots of good and lots of good information there and some real good nuggets about the 50 million business problem versus 50x 1 million business problems. So thank you for sharing that. And for those of you in the audience, thank you for joining us and for learning with us. Jonas, it was great having you on.

Jonas Christensen:

Thank you for having me. Really appreciate it. And for everyone listening, really appreciate it. Feel free to connect on LinkedIn and continue the conversation there.