Safety guarantees and robustness quantification of generative AI models

Generative AI can enhance the performance of all kinds of systems by automating various manual tasks. However, such large language models are not ready for safety-critical systems yet. LLMs are known to “hallucinate” — AI provides incorrect responses that sound reasonable. Additionally, they are susceptible to attacks with prompt engineering. Consequently, public trust in AI is low as there are no guarantees on its performance/failure. In many AI applications, situations often arise where the AI model is confident in an incorrect/hazardous prediction. This is unacceptable when people's lives are at stake (such as in healthcare, aviation, or self-driving car applications). Building responsible, ethical, and trustworthy AI systems is a bottleneck in realizing the full potential of AI. The goal of this project is to utilize formal methods to characterize worst-case guarantees and the safety of AI models. We hope to integrate control theory and formal methods with AI to establish these safety guarantees.

Current participants

  1. None


  1. Gireeja Ranade, UC Berkeley

Selected references for this research area

  1. Elhage, Nelson, et al. "A mathematical framework for transformer circuits." Transformer Circuits Thread 1 (2021): 1.
  2. Ji, Ziwei, et al. "Survey of hallucination in natural language generation." ACM Computing Surveys 55.12 (2023): 1-38.
  3. Olsson, Catherine, et al. "In-context learning and induction heads." arXiv preprint arXiv:2209.11895 (2022).
  4. Jha, Susmit, et al. "Dehallucinating large language models using formal methods guided iterative prompting." 2023 IEEE International Conference on Assured Autonomy (ICAA). IEEE, 2023.
  5. Giraldo, Jairo, and Alvaro A. Cardenas. "A new metric to compare anomaly detection algorithms in cyber-physical systems." Proceedings of the 6th Annual Symposium on Hot Topics in the Science of Security. 2019.
  6. Kumar, Aounon, et al. "Certifying llm safety against adversarial prompting." arXiv preprint arXiv:2309.02705 (2023).
  7. Li, Zelong, et al. "Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents." arXiv preprint arXiv:2402.00798 (2024).
  8. Jha, Susmit, et al. "Detecting adversarial examples using data manifolds." MILCOM 2018-2018 IEEE Military Communications Conference (MILCOM). IEEE, 2018.


This project is supported by the UC CITRIS and the Banatao Research Institute (Center for Information Technology Research in the Interest of Society and the Banatao Institute (CITRIS)) as a seed grant and by the UC Merced Academic Senate.

Last update: Mar 15, 2024. You can contribute to this page by creating a pull request on GitHub.