AI Model Integrity
Ensuring useful AI-based simulation results requires robust data curation and training.
Engineering Resource Center News
Engineering Resource Center Resources
July 19, 2024
Artificial intelligence is rapidly expanding into engineering and simulation applications, with software tools from vendors like Ansys and Altair incorporating the technology to provide rapid simulation and automation capabilities. The use of AI and machine learning in simulation presents opportunities, but many end users are concerned about the risks of the technology.
In more mainstream use cases of AI, such ChatGPT-type systems that leverage large language models (LLMs), the technology can potentially hallucinate — in other words, it can generate results that are false or, in some cases, absurd. Another potential pitfall is the concept of model collapse, which can occur if an AI model ingests a significant amount of AI-generated data. This can also degrade the performance of the model.
For AI-based simulation use cases, model reliability is paramount. Unlike consumer-style AI applications, engineering organizations will leverage internal (usually proprietary) physical test and simulation data to train the models, often informed by physics, which provides an advantage when it comes to model integrity. To find out more about data and model integrity in this context, we spoke to Erick Galinkin, AI Security Researcher at NVIDIA.
What are the primary differences between the data considerations in a ChatGPT-type environment, versus a more enterprise-specific use case like simulation?
Erick Galinkin: I think the big difference is that with ChatGPT or other models you might see in academia or at big labs, they are typically dealing with pretty clean, well-understood data formats. I come from a cybersecurity background, so I have a lot of experience working with data that is not well-formatted or heterogeneously formatted. In industrial applications, we are often dealing with outputs of sensors, and when you are dealing with that, getting it into a format that is reliable for training a model can be challenging.
How do firms ensure data integrity for AI-based simulation and other business-focused applications?
You want the data to be as close to its original form as possible, which creates a fundamental tension with the fact that the models want the data to be in a certain shape. LLMs want flat text. Convolutional neural networks want a rectangle. It can be hard to get data into those shapes without inherently destroying some of the information that is encoded in data’s structure.
Depending on the specific application, you have to try and find the right balance. Keep it as close as possible to the original data while putting it in a form that can be used by the model architecture you have decided is right for the problem. Sometimes that means picking a different model architecture than you started with.
What are the risks of AI model collapse or hallucination, and how do you mitigate against that?
I will tackle those two separately because they have different causes and outcomes.
Hallucination is endemic to LLMs, because they are just next-token generators. They are trying to predict the next token in a sequence. Once a token is generated, it is not going to look back for consistency. You can end up with cases where the model decides the next best token is just not correct, and it will keep generating tokens based on that incorrect assumption.
To prevent hallucination, you can use things like guardrails to do consistency checks, or you can check if the output is contained in the source documents. You can take steps to mitigate it, but the existence of hallucination is baked into this type of model.
Model collapse (or catastrophic forgetting) is something that the AI research community has struggled with for a really long time. If you change the distribution the model is training on, it can forget the stuff it used to be good at. You can think about it in terms of bell curves. If you are rolling two fair dice, the most likely outcome is seven. If you make those dice weighted and update those weights after rolling them repeatedly, then eventually you will get enough sevens that you only ever roll a seven.
You pull from the bell curve and update the curve with what you just sampled, and as you do that more and more tails disappear and it will collapse to the mean.
Ensuring enough diversity in the data is really crucial, because ultimately what you want is heterogeneous variance that still kind of reflects what the real world should look like.
In simulation, you have an advantage. If you are simulating physical systems, there is a lot of existing research in physics we can use. So within some error boundaries, you can make sure the synthetic data and simulated data do not look crazy to the model or in the physical world. If you have a differential equation, a heuristic, that can be a good bounding function.
We have seen use cases for synthetic data in engineering. How do you ensure integrity of AI-generated synthetic data in these scenarios?
Like any tool, you need to use it when it is appropriate, but not over use it. AI can be useful for data that is difficult to collect. It is better to have real-world data, because you cannot perfectly simulate it, but it is still worth augmenting the data if you are lacking certain cases.
You should keep an eye on how the overall model performs on real-world data, as a kind of subset. That is something that I have done. I use my real-world data, not as a hold-out set, but as a sanity check to compare to the performance on synthetic data.
Does NVIDIA offer tools or services that help with model integrity?
I think that the NVIDIA NeMo is an excellent platform for building custom generative AI models with integrity. It offers NeMo Guardrails, which allows developers to add programmable guardrails, ensuring trustworthiness, safety, security, and controlled dialog while protecting against common LLM vulnerabilities
Model collapse is harder, because it is more domain specific. It is more of a data curation problem, where you want to make sure you have a validation set that is representative and may contain some outliers. Using NeMo Curator, you can efficiently handle such data curation challenges by ensuring that your dataset is diverse and representative. It offers various out-of-the-box functionalities such as deduplication, text cleaning, quality filtering, privacy filtering, and domain and toxicity classification.
No model is ever going to be perfect. One of the first things you learn when training models is that if you ever have a perfect score, you probably did something wrong. You are going to miss things. You need that validation data as a sanity check. NeMo offers an end-to-end platform for building custom generative AI, allowing you to continuously customize and evaluate the models to find the best one for your needs.
Conclusion
Engineering organizations are just beginning to grapple with the promise and challenges of adding AI to their simulation toolkits, and data management and curation will play a large role in the success of those efforts. NVIDIA and Dell have partnered to create the Dell AI Factory with NVIDIA, to assist enterprises with implementing AI in their workflows. NVIDIA also offers a number of AI tools to assist developers, which you can learn about here.