false
Catalog
Implementing Generative AI in Health Systems: A Wo ...
Implementing Generative AI in Health Systems
Implementing Generative AI in Health Systems
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Only the true scientists stay to the end. I'm so excited about this next session, and you guys are in for a treat. Because we have probably the tip of the spear when it comes to AI implementation here to talk with us from San Diego. It's my pleasure to introduce my friend, Karen Deep-Sink, who is the Chief AI Health Officer at UC San Diego, and the Joan and Erwin Jacobs Chancellor Endowed Chair in Digital Health Innovation at UC San Diego. And in his role, he's really crafting and creating AI strategy and governance for the health system, and leads lots of different AI initiatives across the innovation center at UC San Diego. He did his medical education at University of Michigan, which we're very proud of. And he has a master's degree in biomedical informatics from Harvard University. He serves on lots of different editorial boards, including New England Journal of Medicine AI, and a responsible AI steering committee, as well as the co-leader for generative AI working group for Coalition for Health AI, and represents UCSD on lots of different health AI partnerships. But that's not all. He's also a longboarder, and I was trying to convince him to actually come on his longboard, which he didn't do. Apparently, his longboard traveled from LA, where he was, all the way to Michigan, and now is back in San Diego, where, I guess, is his new home. But more importantly, he's an accomplished musician with a rap album that didn't go quite platinum, but we're happy to share it with you as he comes on board. So welcome, Karan Deepak. --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- of Oz Goodwin. I'm proud of you. Give it up for Oz Goodwin. So let's give it up. Thanks, everyone. I see you got your headphones on to make sure you don't have to listen to my droning on about rap. Excited to be here. Wanted to talk to you today about the world beyond Chat GPT. So I think a lot of folks in the room are familiar with going online, working on Chat GPT. And I want to take you a little bit under the hood of, what can we do with generative AI in health care? And how can we see if it actually makes a difference in improving clinician lives? So I want to rewind back to November of 2022, when Chat GPT launches and becomes available as a web interface, to OpenAI's GPT 3.5 language model. Now, what's interesting about this model is that, unlike the language models before it, it's the first model available to the general public that has the ability to follow instructions, not just predict the next word. So just to give you an idea of how much of an advance this is, at the time, all the language models were essentially designed to predict the next word, given the prior context. So you could use it to autocomplete words, autocomplete sentences, autocomplete paragraphs. But what you couldn't do was quite get it to do what you wanted it to do. And to show you that, here's a language model that is one of those ones only able to predict the next word. And I give it a little bit of a task to do. I say, briefly tell me the top three diagnoses. For someone presenting with dyspnea in the form of a list. So before ChatGPT, this is what it felt like to use a large language model. It went and looked some stuff up on dyspnea. There's no internet access here. It's just in the training process of the language model. Pulls out some platitudes about how I might want to diagnose dyspnea, things I might want to do as part of my workup, that my treatment might depend on whether the patient has underlying heart failure. But notice what it didn't do. It didn't actually answer my question. The other thing I asked for it to do was to give the results in the form of a list. And if you look at it, you originally think it's kind of doing it. Because it says A, B, C, D. Like, oh, this is kind of a bulleted list. And then you suddenly come to the realization that what's actually happened here is that it trained on multiple choice questions. And so after giving me these platitudes, it says, by the way, the correct answer is D, which doesn't make any sense. So again, the key advancement with ChatGPT and GPT 3.5 wasn't just making this available. It was making it so it could follow instructions. So you take that same model, and you put it through a process where instead of predicting the next word, you additionally have it now learn how to predict intended output given a set of instructions. And OpenAI did that with a process called reinforcement learning with human feedback. But there's all kinds of instruction data sets available now that you can use to do instruction tuning on your own. And you take that same model, instruction tune it, same instructions, and now actually you get something that's a lot more helpful. What's interesting here is that when you ask it this question, it not only answers your question, it actually reflects your question back to you. It says, here are the top three diagnoses for the thing you asked me about. It gives you the three diagnoses, pneumonia, asthma, and right ventricular failure. And what it also does is gives you a little bit of description about each of those conditions, which is not something that was explicitly asked for, but it's something that's actually kind of helpful. So this was the magic of instruction tuning. You suddenly have helpful responses coming to you where you're not having to give it all of the things that you want. And this really marked the first time that this was available. And what's impressive here is this actually isn't even GPT-3. This is a model running locally on my laptop that I saved. But this is essentially the level at which I think GPT-3.5 was performing when it first came out. Now, of course, people realized, how would you use this in health care means that we have to have information that's accurate. And in the first round of people playing with GPT-3.5, it could speak in a straight line. It could come up with plausible sentences, things that looked believable, things that sounded very confident. But there were still situations where it made mistakes, and people started labeling those mistakes hallucinations. And it's not surprising there are hallucinations. This was trained on essentially the entire swath of the internet. And for every good piece of information on the internet, there is a bad piece of information on the internet. So the fact that it could even speak in a straight line is impressive. But I think people really thought that language models were going to struggle for a long time in getting to anything that was remotely accurate where it could be useful because of the fact that the internet isn't a curated data source. So it was kind of surprising when four months later, after the release of the original GPT-3.5 model, GPT-4 comes out. And it not only is a little bit better in terms of getting you the sort of stylistic response that you're looking for, but it's also a little bit more accurate. And it's now able to answer step one, step two, step three USMLE questions at a level that would be typically considered a passing rate if it was a human taking the exam. So this is interesting. And I think this finally gets people thinking, OK, this can't do everything. But it can do certain things. And even though it doesn't have access to the internet, just the language model itself understands enough context that it can accurately answer questions, still makes mistakes. But now it might actually be useful. So how far have we come from the days of March 2023 when GPT-4 came out? So the next iteration of GPT-4 actually came out in June of 2023. They've been updating it every couple months. And they kind of retrain it and re-release it. And if you look at that June version of GPT-4 from last year, 2023, that is currently the 46th best language model that's available today. And what's interesting is that, and we can talk about later about how this actually ranking works, but what's interesting is that GPT-4 was rumored to be something like 1.7 trillion parameters. And so it was this really slow model running on, you know, there were chip shortages at Microsoft because of trying to have enough servers that could run this model. So it was not only slow, it was hard to access, nothing you could ever run on your home computer. And if you look two spots above it is the LLAMA 3.1 model that was released recently by Facebook or Meta. And that's actually the smallest version of LLAMA 3.1, which is something you can run on your laptop today. So I think, in my mind, the biggest kind of progress that's been made is not necessarily in getting more accurate things or being able to do more things, which we can do. There's a lot of other kind of language models that have caught up to the kind of state of the art. But what's really interesting here is that things that you had to do that were really slow, that required to go to a server that was powering these, like, you know, immense multi-trillion parameter models, can now be done with much fewer parameters on a laptop, which really makes it so that all the other tech that we use can essentially embed this technology in it in a way that's cost effective and fast. So what can we do with large language models in health care? Particularly when we're talking about text models, not necessarily, you know, signal models or other kind of foundation models. And I think some of the key things we can do, some of them require enterprise solutions that are very bespoke to the use case. Other ones are things where you really just need your frontline clinicians and staff to have some basic understanding of how prompting works. So I'm going to show you kind of a little bit of both of those things, of things where it's relatively straightforward that as long as you have secure access to these tools, you can kind of train your frontline staff to do kind of some prompting and some of the more difficult tasks. I think the thing where it's probably the most underutilized is actually this first use case, which is information extraction. So think about all the places we do chart review, right? We review charts for quality measurement. We review charts for research. We review charts for patient safety events. And so much of that is done in response to some event happening. But with LLMs, we can actually proactively read many, many, many more charts at a fraction of the time with a reasonable amount of accuracy compared to what humans can do. So I think there's a real opportunity there because that's probably where some of the lowest hallucination risk is in information extraction. Simplification, making things more readable for patients. Summarization, I think some of that can be done kind of in like a GPT-type tool. Some of that requires tight integration because a lot of our data is sitting in the electronic health record that we want to summarize. So we have to really work with it there. Generation or question answering is some of the more difficult things that large language models can do in the sense that it's prone to hallucinations. But this is also where there is integration into electronic health records for being able to reply back to patients. And we can talk a little bit about how effective is that and does that actually save time. The last thing I left on here, just for people's awareness, is that even though large language models generate text, they don't go directly from text to text. They actually go from text to a bunch of numbers and from a bunch of numbers to text. And it turns out you can actually use that bunch of numbers, which is called an embedding, and actually do prediction directly with that. So there was a paper out of NYU called NYUtron where they show that you could directly use large language models for prediction. So in a lot of cases where we were carefully curating a set of predictors to be able to feed to a model to train it, we can now skip some of that step and try to understand the implications of that, where we can have a large language model reading text, generating numbers that do prediction. And that's actually potentially as close an accuracy to the very highly manual process that's a lot more time consuming. So I told you that GPT 3.5, one of the key advances, was that it can follow instructions. So how do we instruct an LLM? How do you tell an LLM to do something? And in this case, I want to ask the question, if we have this as a snippet from a patient's chart, how do we extract the information for whether this patient has a definitive infection? So this matters a lot in our quality measurement and patient safety, because if someone has an infection in the hospital that's new, we really want to make sure it is in fact new and it wasn't something that was previously known about, because that makes a big difference for whether we consider it hospital acquired or something that was an infection that someone walked in with. And so we might want to know, is it a definitive infection? And if you fed this into an old language model or into a prediction score that looked at just the words, it would see cough and plural effusion and say, oh, that's got to be an infection, because those are two of my buzzwords that have a high coefficient for infection. But with language models, we can actually instruct it. And the way to instruct a language model is to use a prompt framework. So essentially, if you're a clinician, you've written SOAP notes. If you're in leadership, you've dealt with SBARs. And I call, essentially, RISN, the SOAP note or the SBAR of AI. So this is something I really don't think is important necessarily for leaders to know, but it's something important for frontline folks to know who are actually using these tools. So when you're trying to instruct an LLM, it's important to tell it, what role is the AI playing? Seems silly, but actually gets you better performance if you start with that. What is the input you want it to reason about? That might be a snippet out of someone's chart. That might be a document that you've copy-pasted out or that's getting right in. But that's kind of where you would put that. Then, what are the steps you want it to follow? Importantly, what is the expected format of what you're looking for? What do you actually want out of the thing? Do you want a yes-no question? Do you want a table? Do you want some kind of computational format that you can read into another program? And then, inevitably, the language model will produce something that you don't want. And then you need to add something to your prompt that says, here's what not to do, the kind of narrowing step. So I don't want you to add an explanation. I don't want you to write more than two paragraphs. Those are the sort of things where you might put some of that in the expectation of the narrowing. So here's exactly the prompt that I would use to try to get that information out of whether that patient has an infection. I would start with a role. You're an infectious disease specialist reviewing a chart. Consider the following text. I would copy-paste that piece of text, insert it right into that prompt. Answer whether there is definitive evidence of infection. Explain what led you to that conclusion. Return the results in the following format. So presence of infection, yes or no. Explanation, explanation. And I might later add a narrowing step, but there's no narrowing step here, because I haven't yet seen what my output's going to look like. So let's actually go to that and see what it comes up with. So interestingly, it comes up with presence of infection, no. Here's a patient who has a cough. They have a pleural effusion. But they also have COPD. They also have heart failure. They also have chronic kidney disease. And so there's other explanations for potentially why they would have a chronic cough and other explanations for why they might have a pleural effusion. And what's impressive about this isn't that you could do this with GPT-4, because I think a lot of folks who are playing with it could find that it could do some basic level of medical reasoning that seemed plausible even with GPT-4. It's that this is actually Lama 3 running on a laptop. And so the fact that you can do this with an LLM that can run on your laptop, and probably not too far from now on your phone, means that you can open up a set of use cases that were really out of bounds, impossible to deal with even a year ago. Another thing that you could do with it, and this is just another kind of information extraction task, but a little bit more open-ended, is put in a note and tell it, name all the potential documentation errors in the above text. So there's a lot of things happening that are wrong here. This is a patient who I've said has HFREF, which means reduced ejection fraction, but the EF is actually 60%. So that seems like not consistent. They're on PL Lasix and furosemide, the same drug. One is a brand name, one's a generic name. They're on way too high a dose of talomol and way too high a dose of ibuprofen. You hope those are documentation errors and not actual medical errors. But either way, it could lead to a medical error, even if it's just a documentation error. So you do that, and again, language model running on a laptop. And it comes up with all of those things. Says the ejection fraction is 60%, but actually you said it was reduced, and that doesn't quite fit. It picks out the fact that PL Lasix and furosemide are both loop diuretics. And even though it doesn't say it's the same drug, it said they have similar mechanisms, not clear why they're on both. Picks out the issues with talomol dosing and even gives you a maximum recommended dose, which again, could be wrong. You can't believe it. But in this case, turns out to be reasonable based on the information that it has, although it doesn't know the patient's age and comorbidities. And similarly, identifies the issue for ibuprofen. So that's just to give you a sense of things that you can do to empower kind of frontline staff, clinicians, who have secure access to one of these generative AI tools. Not even ones that you have to pay for, but ones that you can run on a company device on a laptop and do all sorts of things that are actually relatively useful. And I showed you this in code, but there are no code solutions for this that you can use that essentially look like chat GPT, but are running fully locally on a computer. One of them is called LM Studio, but there are several other ones. So the million dollar question. One of the reasons we argue we need large language models in health care is because it will save clinicians time, and therefore, it will make clinicians happier. So does it actually do that? The first set of evidence is finally coming out looking at a couple of different use cases. One of the use cases that we looked at was replying to patient messages using AI. And that was functionality that was GPT-4 integrated into the Epic electronic health record. And at UC San Diego, we ran the first kind of randomized study of looking at clinicians using this and not using this, and seeing whether it actually saves time. Interestingly, it actually doesn't save time. It actually increases time. Because now, you are not only reading the message that was sent to you, you're also reading the drafted reply to decide if you want to use it or not. That drafted reply happens to be much longer than what you might have written, so there's actually more stuff to clean up. And so circa first several months of this technology being available, it seems like it doesn't actually save time. Interestingly, there was a parallel publication that came out of Stanford that showed that it did improve cognitive burden. So I think the way to think about this technology is not that it saves time and therefore makes clinicians happier. It's that it doesn't actually save much time. It might add a little bit of time. It's a couple of seconds per message, so it's not like a huge amount. But what it does do is it might make clinicians happier So rather than viewing it as a health efficiency productivity thing as a health system, we might want to view this technology as a clinician retention technology that we reduce burnout, we make it easier for them to practice medicine, we make them not want to leave. And that has a separate ROI to it, but it's a different ROI than the way you might think we would resource this if it actually was time-saving. Another place large language models have made a difference, and this is the last example I'll give before I close out, is in ambient documentation or AI scribing. So it used to be the case that with ambient documentation, you would record a visit with a patient. That recording would get sent off to an offshore person kind of doing a manual or assisted process of dictating that into not a dictation, but turning that recording into a clinical note. And about 30 to 40 minutes later, or maybe an hour or two later, that note would show up in your chart that you could look at and sign. That was the state of the art probably a year or two ago. When these generative AI models came out, what became clear is that if you can transcribe the note in seconds, which you can, and then you can use generative AI to do some of the kind of reformatting of that transcript into what is a clinical note through a set of complex prompting strategies and other things that you might do, you can actually get a response of a note, a draft response generated within 30 to 40 seconds. And that totally changes things, because what that means is you see a patient as a clinician. You walk out of the room, or they walk out of the room. 30 to 40 seconds later, you have a note in front of you that you can look at and act on. And what that means is now you actually might be able to fix and finish that out before your next patient, which was previously never an option, certainly not after your last patient of the day. And so when they rolled this out at Kaiser for 10,000 staff members, what they found is that where it actually saves time is the time between 7 PM and 7 AM when physicians are sitting at home actually catching up on all their documentation. And so again, it isn't necessarily a productivity tool. It's definitely a reduced burnout tool, and that might lead to a physician retention or clinician retention tool. And so I think that's an important way to look at this problem. There is an ROI associated with it, but it's a different ROI than what you might expect from a productivity standpoint. So obviously, a concern that came up is, are these AI-generated notes accurate? And at least for the, I think they picked 35 notes from several different specialties at Kaiser, and they looked at the notes' quality based on several metrics like accuracy, thoroughness, usefulness, completeness. And essentially, they were scored relatively high. It's not that they were perfect, but they were higher than I think you might have expected going into it. And so this looked like a very reasonable replacement to the previous process we had with an actual clinician benefit to it. So I hope what I've shown you is that things have come a long way since ChatGPT. Yes, GPT-4 still looms large. GPT-4.0 is currently the best kind of large language model out there. But where the innovation is isn't necessarily only at the top, where Google's caught up, Anthropics caught up, a lot of the other folks have caught up. It's actually in the middle, where we've got these much smaller language models running locally that can do all kinds of tasks that we actually can now do for much cheaper than it might have cost to use something like a GPT-4 at an enterprise level. So I'll close with that. Thank you. I'm looking forward to conversation. Oh, yeah. I'll turn it around a little bit here. That was terrific. I think you guys can see why you're one of the only few people that I know of that can actually do the work, but also talk the talk and do the science. So a couple of questions I want to start off, maybe on the structural side a little bit. So how are you thinking about when to use generative AI versus predictive AI? And what are the downsides associated with each strategy? And how are you thinking about what problems you're trying to use each one for? Yeah, so I think the main reason to consider using predictive AI, which is still there, is when you want to act based on risk and you want to reallocate resources in a more fair way. So if you think about it, a lot of the models that we have, even like clinical models, like models that predict whether someone's going to have sepsis, what does the intervention of that arm of that model look like? What it looks like is essentially someone gets alerted. Someone reallocates their time to spend more time with that patient than another patient that they might have been spending time with, with the idea being that we're being more equitable in how we distribute our time. And we're actually focusing on the people with high risk, which might mean as a result, we're focusing less on folks with low risk. So any time you reallocate resources, there's implications for fairness, equity. Who are you not seeing? Who are you potentially giving more attention to? And is that the right population to give more attention to? But I think of predictive AI broadly as a mechanism to reallocate resources where a lot of the resources and a lot of the health care models are actually just someone's time. But for other things like staffing models, for forecasting models, those reallocation of resources might be we're encouraging patients to go to this other ER if we have a multi-hospital system to try to offload this ER. So I think there is still an important role for predictive AI. I highlighted some of the use cases where I think generative AI is useful. I think the most underused one, honestly, is information extraction. There's still tons and tons of manual extraction that happens for things that probably we could reasonably extract with some of the even intermediately scoring models. And so I think that's where we're focusing a lot of our efforts, not real-time stuff, but after-the-fact stuff where we have to go back and verify, hey, did something happen or did something not happen? We're also interested in someone's going to be discharged from the hospital. What are some of the barriers to discharge identified in a consultant note that we might not be tracking in our regular dashboards, but we can easily pull out if we're looking for issues with transportation identified by case management or someone needs antibiotics that wasn't accounted for in the regular planning or placement issues? So I think the answer is reallocating care. I think of as predictive AI, and you better be sure that you're looking at the issues of unintended consequences. And then extracting information or summarizing, dealing with, turning unstructured information into structured information, that's, I think, where generative AI shines, where some of the older natural language processing stuff really got you part of the way there, but never all the way there, with the one caveat that you can use generative AI to do prediction. And I think there's very little of that actually happening, but I think that's something we're going to see as an area of growth. Now, let's get a little bit more practical about how you actually, as a health AI officer, you do your job. So how are you thinking about if I should build this model and if I decide that this is a model that I want to build, how I should build it? And how are you thinking about this? Should I build it myself, or should I go outside and buy it? Yeah, so I think that's a great question. So the build versus buy thing, I think everyone in the room has probably dealt with that and been on one side of that or the other. And when I think of the build versus buy thing, what I'm often thinking about first is health system-wise, how much of a priority is it? Because the reality is that you have to prioritize stuff. And so one of the things that's been helpful to me is to be involved with our chief clinical officer, to be involved with our chief quality officer, to try to understand at a system level what are burning issues. Are there any burning issues that are in subspecialty clinics that won't be on our radar? And so you go to the appropriate kind of business owners. But I think that's one of the key things is, how is this coming to our attention? And is that actually meeting a major need that we have? If it's not meeting a major need, if it's something someone wants to do, we have to then, we may still do it, but we're going to prioritize it accordingly. The build versus buy thing to me is actually figuring out, how difficult is this thing to do? And if we were to build it, how reproducible is that build going to be? Are there going to be eight other problems that rhyme with that problem, and so we should do it? Or am I really trying to tackle something that is going to be a one-off thing that we have really no value that comes out of it beyond just that one use case? So it's really hard nowadays, because with the space of large language models, a lot of companies that come to you, they're actually selling you a really nice prompt on top of a language model and not anything on top of that. And so I think a lot of what I'm doing is trying to understand, what is the innovation that you're offering? Usually, it's something that's integration that you've thought of that we can't do. It's some workflow and change management thing you've thought of that we can't easily do. So that's where I'm not going to try to build it in-house. But where we have the workflows figured out, where we know what's information we need, and we have that information sitting in a data lake, it almost makes no sense to try to work with a vendor or partner to do it, when in that case, we can do it, especially when it's at a level where the output is in a form that I know others will be able to reuse for other things. So like sepsis alerting, I wouldn't have tackled in-house. I know we have an in-house model developed by one of our research faculty that now has spun off into a company that we're using at UC San Diego. But capacity prediction, all that stuff, our mission control, command center operations, all those models are happening in-house. Now, if I decide to build my own internal model, what is this, from a regulatory perspective, should I worry that someone's going to come after me, compliance? Or how does that actually work? Because it's within an EMR where you're deploying it. Or is it a medical device? Am I thinking this the wrong way? Yeah, so I think that when you're building in-house, regulatory issues are there. But the big issue is not that you're building it in the EMR. The big issue is you're building it without a plan to market it. So the questions that usually come up when this happens is, is this in the scope of the FDA? Is this in the scope of existing regulation? If it's a clinical trial, it might be. If it's not a clinical trial, it very well may not be if you're internally building it without a plan to commercialize. The other question that comes up is, is it a medical device? And with the recent AICDS guidelines coming out of FDA, a lot of things that predict risk are now considered medical device because they don't meet one of the four definitions that you have to meet for something not to be considered a medical device. So the reality, I think, is that a lot of those things are medical devices, but they might not be in the scope of the FDA. And I think what's important is, for health systems, is to work with your compliance, work with your office of general counsel, and figure out what is your local interpretation of that. When I've had conversations with some of the leadership in digital health at FDA, I don't think that, even though there's really great tools to use, a lot of those tools are aimed at companies, not at health systems doing internal innovation work. Now, you showed a great slide, actually, that was comparing all these models against each other. And the rate of change is incredible. Can you give me an idea of how you're actually comparing these models? What are the metrics that you're looking at? And then do you foresee that, if you have to look at your crystal ball, what does that look like in five years? Yeah, so the chart that I showed was from a website called Chatbot Arena. And Chatbot Arena is an effort that spun out of UC Berkeley, UC San Diego, and primarily at Berkeley, and a company called LMSys that's not a company, but an informal organization. And one thing they realized was that, as the number of large language models is growing, how do you actually rank them? How do you pairwise compare them? It seems like it's an impossible task. So the insight they had was to borrow from chess rankings, where two of the best players might not play each other, but you have enough pairwise comparisons of people playing each other where you can arrive at, essentially, that's the best player, that's the second best, that's the third best player. And the way they do it is they have a website where you can, anyone can go on there. It's lmarena.ai is kind of their new shortened version of the website, which was originally unwieldy. You go there. You go to the arena battle. You ask a question or type in anything that you want to do that's a medical question, whatever you think is interesting to you. You see two responses, and it's blinded. It doesn't tell you which language models those two responses are coming from. And your job is to pick out which one you like better. So you pick out which one you like better, and that goes into their scoring system. And every day, they're updating their rankings based on these scores getting put in. And I think it's something like four to five days where they have enough ratings between when a model is first released and when it has a ranking associated with it. So it's really quick. Model comes out, and within a week, we have a sense of, oh my goodness, this is the next state-of-the-art model, or oh, this was a bunch of fanfare, and it's not that good. But again, what are you rating it on? Are you rating it on accuracy? Are you rating it on hallucinations? I think if you really ask people, what are they rating it on, they're rating it on vibe. It's essentially, does it look like what I want? Maybe that's accuracy for some tasks. Maybe for other tasks, it's you asked for a list. Did it actually give you a list? And so it's really kind of a holistic ranking. It's not really granular. So I think in health care, some of the work we're doing with Coalition for Health AI is around coming up with more specific things that from a health care perspective, we should be applying to try to judge these tools on specific things like accuracy, like fairness. What does it mean to evaluate one of these tools for is it fair, is it biased, is it equitable? And I think those answers aren't figured out, but we're hoping with kind of national experts at the table from industry and from academia and from public health systems that we can work together to kind of gradually figure out and define some best practices. Now, can you speak a little bit about this structure that you have in place, like granularly? What does AI governance look like at UCSD and how are you actually deciding what project you should take on a little bit? Yeah, so I think we have three arms to this in terms of how AI is structured at UCSD. One arm is an arm that we recently launched called AI ThinkShop. So this is our intake arm. So this is run by the project management office because if we decide to move forward with something, it becomes a project. And so they're at the table. It also has relevant information services expertise there around who would this create work for, how much work would this create? It also has some AI expertise there and other products expertise there to try to figure out why this product, if it's a product, or why this model built by a researcher, not something else. So what are the alternatives? So we have kind of a process that we follow to try to figure out very early on, triage it. Do we need more information? Do we have enough to move forward as a next step? Because the next step isn't gonna be simple. It's gonna be potentially procurement, compliance, legal, build. Lot of things that are gonna happen simultaneously, information security review. Lot of things that will happen kind of all at once. So we try to handle that in AI ThinkShop. Then we have AI Governance. If AI ThinkShop is the, how do we do this and are we capable of doing this? Is this feasible? AI Governance is, should we do this? What guardrails do we need to put around our doing this? We've committed to doing this, but we need to make sure it's safe to do this. So what should we make sure in terms of who has access to this tool? So as an example, we might have a model that we use to help us triage denials to claims. We'd wanna make sure that our schedulers don't have access to that model because they might say, oh, this patient has a high risk of denial to claim. We shouldn't even schedule them. So it's thinking about those kinds of unintended consequences so that first we make sure, is this something we should do? And if we should do it, who should have access to it and for what purpose? What's inbounds, what's out of bounds? And then the last arm I would say is implementation and that's run by our project management office in the way that we implement everything else. So there's nothing unique to AI in terms of when it comes to implementing it. Yes, we have to have education, we have to have training, but anything information technology requires education, training, and usually AI is not the solution. AI is baked into some larger solution, some larger intervention. And so I think those are kind of the three arms for coming in from a new project all the way to getting it triaged, evaluated by governance, and then actually implemented and deployed. We were talking with Jacqueline yesterday and whenever I hear AI governance, I try to run the other way. So can you break it down for me a little bit? Like what is AI governance and what does it look like exactly on the ground floor? Yeah, so what AI governance looks like is making sure, so it really depends on the organization. At some organizations, the concepts I laid out, AI think shop and AI governance is kind of one thing. And when I was back at Michigan, actually that was more of one thing where we had our build experts as part of AI governance to help us figure out the, can we do this? Is this a lot of work? Who is it work for? What are the alternatives? Do a market scan and so on. So I think a lot of the functionality AI governance is one, to make sure leaders understand how we're using AI and don't have magical thinking about it. So a lot of leaders are there as part of AI governance. The other folks who are there are folks with expertise in AI ethics. We have someone who's entire, their research faculty, but their entire portfolio is AI ethics. We have folks who understand health equity issues. The health equity issues don't fall to them. We've made that very clear. They're a shared responsibility. But our intake process really focuses on, what are you trying to do? Is this actually a good solution for it? What are the alternatives to using AI? What are the potential downstream negative consequences or harms, whether that's overall or to specific subgroups? And what is your plan to try to prevent that? That's something that I think we actively try to go through. We know compliance does that to a degree. We know legal does that to a degree. But this is kind of more of an ethical, broader take on it than even compliance or legal can give to us. So we do this in partnership with compliance. It doesn't replace compliance, but I think it makes sure that our healthcare leaders understand potential negative consequences so we're not kind of magically thinking our way into using AI for things where we need to think twice about, is this really the right thing? What harms could this have? And what does that look like? So I bring something to your group. There's a bunch of people and they vote. How does that work? Yeah, there's an intake form. We have a meeting where we would bring you in with a smaller planning committee where we'd walk through the intake form and make sure you understand what you've put there. One of the things that's illuminated for us is we have a question of what are potential harms? And the first several intake forms we got, the answer was none. And so it was a matter of walking through with people. So what workflow are you planning to change? Who does that affect? And I think people thought, oh, that's a back office function. That has to do with billing. But I said, well, whose bill does that impact? Does that impact a patient's bill? It's impacting someone. And so I think getting people to understand there's a human person on the end of that model who's gonna get impacted was something that I think it really helped us do. So we do that kind of as a planning committee beforehand. Once we've gotten to a stage where we think we've worked out some of the initial issues, we bring it into an AI committee meeting for discussion. So you present, you walk through the intake form. We go through and try to identify do we feel comfortable with this? Do we have all the information we need from our vendors? So a key question we have is for any predictive AI model, we need a list of all predictors and outcomes. That's a basic transparency thing that we won't work without. We've had vendors say, we can't give that to you. And I tell them, we don't need your algorithm. We don't need your secret sauce. But we do need to know what things you're using in your model. And if you can't give that to us, we can't move forward. And most vendors kind of understanding that our interest is transparency, that we wanna know we're not doing something unethical, will usually tell us, and 90% of the time, the decisions they made are very reasonable decisions about what to include. But that serves as an enforcement mechanism for like a principle that we espouse, which otherwise would have no enforcement mechanism within the health system. So I think it goes to that. And then yes, then it goes to a vote. And depending on the health system or depending on the environment, you could decide kind of what that voting threshold is. For us, we have a 2 3rds must vote in favor of something for it to move forward. I named some folks who are part of AI governance. I didn't name others. Some of the other ones are patient experience. We have a patient experience officer who's part of that. We are aiming to enroll a patient as an advisory member that will be identified by our patient experience officer. And then we're also going to have labor relations and probably some of our union reps as advisory members, again, to make sure that they understand what technologies are coming, that the technologies are being thoughtfully done. And if there are issues in how it impacts, you know, like nursing workflows in a negative way or that could have unintended negative consequences, they have an opportunity to raise it so that those issues get raised before we're actually using it. So now, you know, say we build a model, we go through all the hoops and we deploy it. How are you thinking about monitoring the outcome of that model? And are you worried about drift of that model? And maybe you can elaborate on your point of view on drift a little bit. Yeah, so I think- Explain maybe what drift is. Yeah, so drift is the idea, see people call it data set shift or data drift, that you deploy a model and then over time its performance changes. Usually people view it as performance deteriorating to a point where that model is no longer useful. That's at least kind of how people classically view data drift or calibration drift or there's kind of different terms that people give to it. One of the challenges is just, do you have the mechanisms in place to actually monitor for it? And it's really challenging to do all that in-house because all our models don't live in the same place. Some of these models live in our EHR. Some of the models live in our imaging software. Some of the models live, you know, in kind of other devices we have or things that are linked to medical devices. And so we can't have the in-house capability to do all that monitoring. So I think my goal has really been to be very pragmatic and say, we don't have to monitor every single model, but someone does. And that might be the vendor that we turn to and ask for periodic updates. It's hard to scale that to 80-plus models. So I think we're still figuring out, how do we prioritize the kind of actions taken in response to monitoring, identifying red flags? The thing I will say, though, is that my personal view on this is that when you deploy a model and actually implement it into a clinical workflow, if the thing that you're monitoring is the model's performance, if the workflow is actually doing something, the model's performance should change. So I'll give you kind of two examples of that. Let's say you have a model and a workflow where your model identifies high-risk people, and your workflow lowers their risk. It prevents some bad outcome. If you deploy that model clinically and your intervention is actually effective, what you should see is that your model now predicts people at high risk who don't experience the outcome and make the model look bad. And so your error under the curve might drop. Your other metrics might get worse. That's this idea of confounding medical indications. So just because a model is looking worse isn't necessarily a bad thing. It might actually be a sign that your model is actually working when it's linked to a workflow. So that's why I think it's important to do a silent period of observation if you can, and then if you have an intervention, to very carefully monitor performance during that intervention, but don't overreact to it. The flip side of that is, what if you have a model that's telling you to do something? So it's saying, hey, if you get a score that's above this, start this medication. Now, let's say I deploy that model and I start to measure, what is the accuracy of this model? What I might find is that the model is perfectly accurate. Everyone who the model says should get a medication started gets that medication started. That's not necessarily a sign that good things are happening, but what it does mean is people are adhering to the model's recommendation as a process measure. And so I think, in that case, the error under the curve might go from 0.6 to 1, because everyone's following exactly what the model is telling you to do. And so I think what I view it as is not as simple as deploying a model, watching some performance metric, and reacting right to that metric by saying, oh, we need to retrain the model. If your model's saving lives or it's doing something that's telling you to take an action that you're taking because you're responding in accordance to your training in response to that model, you might be getting these changes in model performance that are purely artifact of the fact that you implemented it. You're not saying that don't monitor it, right? Just that monitor it. Don't overreact. Understand the, what are you trying to do with the model? And one of the key aspects of monitoring is actually monitoring uptime. You'd be surprised that a model is built on a Jupyter Notebook linked to this workflow, linked to this cron job, linked to a bunch of the downstream things. One little thing breaks, and that model doesn't have a score. And no one discovers it till a week goes by. So I think monitoring uptime is actually a really basic thing. Then monitoring alerting frequency and alerting patterns. And then maybe as a last thing, monitoring performance. Because performance is the potentially least actionable of the group because of the complexity. Yeah, it's a lot to think about when you're looking at the model performance there. You did show an example of a model you were running on your laptop. And when we think about LLMs and lots and lots of parameters involved, can you speak a little bit about the difference between a large language model and a small language model? And when you would think about using either one of those? Yeah, so the question is, what is the difference between large and small language model? So I don't think there's a legitimately arbitrary cutoff. But I would say that anyone who has models that are 3 billion parameters or less are actively marketing their thing as a small language model. And that's by virtue of the fact that that can usually fit on most computers' memory. And so it's something that most people could access. There are a lot of models that are in the 7, 8 billion parameter range where they're essentially called large language models. But they're small enough to run on a subset of laptops. The key innovation that I talked about, there's really two things. One is figuring out better ways to train models so that a fewer number of parameters can actually learn information. Some of that's come through better curating data. Some of that's come through using additional synthetic data on top of the data from the internet to get better models, surprisingly. Which I think synthetic data has never worked elsewhere. But for large language models, it surprisingly is quite effective. And then the last thing actually is that it used to be that the number of numbers after a decimal point was 32-bit numbers. And people figured out that you could actually reduce the precision of these numbers, these parameters. So instead of 32-bit, you could make them 16-bit. So there's fewer kind of significant digits after the decimal point. And if you do that in the right way, you can't just kind of cut it off. But if you do it in the right way, you could get a model that was almost as good as the 32-bit model, which is the full model. And you could do that for 8-bit and 4-bit. And so most of the tools that you can download today, there's a program called LM Studio. There's a program called OLAMA, will by default give you the 4-bit versions of these algorithms. And every time you go from like 32-bit to 16-bit, it takes up half the memory. And so suddenly now, you can get to 4-bit models where a billion parameters takes up half a gigabyte of RAM. And now you can run a 70-gigabyte model. If you've got a 70-billion parameter model on a machine that's got like 35 gigabytes of memory, and the 8 billion, you can run on a 4 gigabytes of memory. And so I think that now has changed the game for why these peripheral devices, where iPhones used to come with one gigabyte of memory, and now it's growing to four and eight, now we'll be able to run a lot of these things initially there. And it's a lot like speech recognition. The first pass will probably be local. And there might be a second pass that sends it to the cloud and gets you kind of a more complete answer if you're willing to wait. Yeah, you live in California now, so I'm sure you get this question all the time. These models use a ton of compute, lots of energy. What are the environmental impacts of these? And how are you thinking about it? Are you worried about that at all as a health care system? Yeah, so in 2022 in July, Microsoft came out with a report that they were using 34% more water than they were using the year before. And it was largely because all of OpenAI models run on Microsoft infrastructure. And July of 22 was when they were in the, I think, final stages of training GPT-4. And so that was an intense period of time. So I think the things that I would differentiate are there is a certain amount of compute you need to train these models from scratch. There's a certain amount of compute to take an existing model and fine-tune it, which is actually a lot less. And then there's a certain amount of compute you need to actually just run the models, which is, again, a lot less. So I think from an environmental perspective, a lot of these big tech companies that had made pledges to be environmentally friendly or carbon neutral or water neutral, they're right now on a roadmap to neutrality. None of those companies, I think, are neutral right now. Some of the innovations happening to try to make them neutral is other ways to cool things. So new chips that don't require water-based cooling is kind of a big area of innovation for a lot of the companies, because that's where a lot of the water usage is coming from. So I think my general solution to this is don't use a hammer when you can get away with something much tinier and that's equally effective. And so we're often turning to these huge language models to do information extraction. And I always ask people, why would you turn to something that's expensive, uses a lot of compute, isn't good for the environment, when for a simpler task, like information extraction, you can get away with something that's a lot more basic, and in some cases, nearly as good for the really, really simple tasks? So I think right-sizing the models for the complexity of the use case is kind of my current recommendation for how we should approach this. But definitely, I think the longer-term solution is going to be people either pricing people out of it, because it's so expensive to use some of the larger models. Like OpenAI rumored to be charging now $2,000 a month for some super mega enterprise-grade GPT-5, or kind of strawberry, as they call it. So that's one way that you handle it, is you just price people out of it. Or I think the other way is we're going to see innovations in chip design that will, over time, potentially reduce some of the water utilization from cooling needs. Now, we always have to think about privacy and security when it comes to using, when you're dealing with health information. Now, how are you thinking about using generative AI, especially when it comes to PHI use? And how are you solving for that problem? Yeah, so today, there's a lot of enterprise solutions that let you use generative AI on protected health information. You have to read between the lines, because a lot of them, even with enterprise agreements, have different levels of how they might use your data and how long they might retain your data. So a lot of our compliance folks that are engaged in kind of system-level, University of California, Office of the President-level negotiations are thinking about issues like, if we enter information into this tool, are you going to de-identify it? What's the actual legal process by which you will certify that it's de-identified? Are you going to reuse it for training? How might it violate the privacy of our patients? So I think the reality is that we're still very careful. And my general advice has been, we try to avoid it wherever possible. But the reality is that we have our data sitting in a specific vendor's data lake. We have access to large language models, some proprietary, some open, all behind the firewall of that data lake. And so actually, the most secure thing we can do is to use the language model from that environment so that it doesn't leave the system. Some of that's technology-related risk. Some of that is compliance-related risk. And that's why what I tell people is, if you have data locally, and you have it on an encrypted computer that's kind of managed by your workspace, where you're allowed to install software, in that case, don't use GPT-4. Don't use things. Even if they're an option, first start with the smaller models. See if they get you what you need. And it's not for the larger things where you need to turn to those. And I think we kind of have bypassed the product category. So ambient documentation, that's never going to be something we're going to build in-house. That is way too complicated. There's way too many moving parts. There's a recording component. There's a compliance component. There's a privacy. There's a consent component. There is a inadvertent recording of something that leads to a disciplinary action component that the unions are concerned about. And so I think we have to think about all the downstream consequences to a variety of stakeholders. So yes, some of those things we handle in-house. But some of those things we have to turn to enterprise solutions, because it's not just the language model. It's the kind of complexity of the entire workflow around it. Yeah, it's a very complicated environment there. So let me pass you a little bit more on your strategy for the next five years. Every day, there's something new. And we're struggling to keep up, essentially. So how are you thinking about building capabilities that will allow you to stay ahead of the curve for the next five years? Do you have a point of view of how this will evolve over the next five years? Yeah, so one of the roles I have is a leadership role within our Jacobs Center for Health Innovation. And so that is essentially a center for health innovation situated in the health system, not in the medical school, which means that our success is basically determined by our CEO. So if we have key health system priorities and the Jacobs Center for Health Innovation isn't making those priorities get met, then we're not doing our job. And so we are innovating in a lot of different things, some ourselves, some with partners, but it's largely oriented around health system needs. So the things that are complex we have to buy, or we have to look at what the marketplace is. The things that I'm trying to do in-house with a scrappy data science team is focus on capabilities that are simple and reproducible. The three capabilities that I'm really focusing on right now, one is system level per day predictions of any kind of forecasting related things that we're going to use to inform our leadership, inform our patient flow practices, and inform potentially bigger term decisions around, oh, we don't actually have enough capacity here. We need to build a clinic here to get better capacity, which will get uncovered by forecasting plus what happens in reality. So the per patient, per system, per day predictions is something I'm trying to do in-house. And we have at least one or two of those models already live, and we're working on several other ones. The other one I'm trying to do in-house is per patient, per day predictions. So a lot of companies are focused on helping you predict length of stay, on helping you predict admissions, on helping you predict discharges, on helping you predict fall risk at a kind of, not at a patient level, but at a system level, or risk of going to the ICU at a system level. And what I'm trying to do is, those are things, if we can do one of those things, we can do all those things. The level of complexity to do the second thing is dramatically lower than the level of complexity to do the first thing. So I think we've kind of picked one of the things we're going to focus on internally, but that's kind of an internal build capability. The third one is that we actually have a lot of our clinical documentation in a data lake. So we're trying to pick one use case, and we have one in mind, where we have a large language model converting clinical notes in mass into dashboards that we use alongside other dashboards that are essentially things that we are using already operationally. And just to give you a sense, our dashboards don't just go on a website. We actually have a set of dashboards that get emailed out to all employees every single day. So it essentially has incredibly high visibility for the sort of things that actually need it, like how many ORs are open today, how many cases do we have scheduled today, how many planned discharges do we have today, what is our system going to look like tomorrow in response to what is currently looking like is on the agenda for today. That's the sort of thing that we want to be able to populate stuff like that with information sitting in patients' notes. Like, oh, we have 10 patients today who need rides who don't have rides arranged for. That's something a case manager might know at a floor level, may not know at a house level across the hospital or across three hospitals. And so we're trying to centralize some of that capability by basically LLM plus zero shot learning plus prompt to feed a dashboard. Yeah. I mean, it's incredible. I think you're like really giving me and our audience a behind-the-closed-door view of what a chief AI officer does. We have about a minute left here. So do you have any parting shots, any thoughts? And people are super excited with what you told them about. And they're going to go back in their institutions and they're going to ask for AI governance. How should they start this conversation in institutions that don't have this capability? So I think my advice is no two institutions are the same. And so what you really have to do is look at how does governance of everything else work and not think of AI as something that should work entirely differently from that. So when I was at Michigan, we modeled our AI governance after our clinical decision support committee. At UC San Diego, we already had AI governance in place. There were some functions we were doing well. There were some functions we weren't doing well. And so I borrowed some stuff we were doing at Michigan. But I also borrowed some stuff we were doing on a quality improvement committee at UC San Diego saying, ah, they have a set of practices they follow that are accepted as norms. We're going to borrow from that so that when we roll this out, it actually is socially acceptable because people have a way to frame it. So I think the thing I would say is, do we need it? Well, I would tell people, what's the worst that could happen? If the worst that could happen is your use of AI can land you on the front page of the Boston Globe or the Ann Arbor News, then you probably should have AI governance because someone needs to be accountable for your use of AI. It doesn't have to be a chief health officer, but it does need to be some body, some governing group. And I think that's the way I think about it is, that's who's going to say, we reviewed this. We signed off on it. We considered the consequences. We had mitigation strategies. This wasn't just something we rolled out, which is what people just assume when they see some AI being rolled out in a health system. So that's, I think, the why and the how. But I think I wouldn't take anything I said as gospel because I think, really, you have to look at what is your own situation, what is the health system setting. And it comes down to politics. What is going to be acceptable in this setting may not be acceptable in that setting. And you want to figure out the reporting structure also. So one of the changes we made is, we actually changed our reporting structure a little bit. So we report up to an executive body that has the CEO involved so that the stuff has a little bit higher visibility because the potential for misuse is a little bit higher. Well, that was a tour of the force on AI. Thank you so much. It was wonderful. Thanks. Thank you. Yeah. Thank you again.
Video Summary
The speaker introduces Karan Deep-Sink, the Chief AI Health Officer at UC San Diego. Sink is lauded for crafting AI strategy and governance in the health system. His discussion focuses on generative AI advancements beyond ChatGPT, particularly in healthcare applications. He explains how ChatGPT evolved from merely predicting the next word to following instructions through reinforcement learning, making it more useful for tasks like information extraction and automated medical reasoning.<br /><br />Sink provides examples of how AI can help in clinical settings, such as chart reviews and error detection in documentation. He also highlights practical applications like AI-assisted replies to patient messages and ambient documentation. While these technologies can reduce cognitive load and streamline workflows, they don't necessarily save time but can improve clinician satisfaction and retention.<br /><br />The lecture underscores the importance of distinguishing between when to use generative AI versus predictive AI. It also emphasizes the need for proper AI governance, including ethical considerations, transparency, and monitoring for data drift. Finally, the session touches on environmental concerns related to AI compute resources, advocating for practical solutions to mitigate these impacts.
Keywords
Karan Deep-Sink
AI Health Officer
UC San Diego
generative AI
healthcare applications
ChatGPT
AI governance
clinical settings
environmental concerns
HRX is a Heart Rhythm Society (HRS) experience. Registered 501(c)(3). EIN: 04-2694458.
Vision:
To end death and suffering due to heart rhythm disorders.
Mission:
To Improve the care of patients by promoting research, education, and optimal health care policies and standards.
© Heart Rhythm Society
1325 G Street NW, Suite 500
Washington, DC 20005
×
Please select your language
1
English