Adversarial robustness of large language model (LLM) safety

Developing an efficient automatic attack model to improve the evaluations and training of LLMs, making them safer and more robust.

| April 11, 2026
Abstract background with geometric shapes

Assessing the vulnerabilities of LLMs has become a key area of AI safety research. Canada CIFAR AI Chair Gauthier Gidel proposes a novel, more efficient and automated way of finding vulnerabilities in LLMs. By using optimization and borrowing methods from image-based adversarial attacks, the project aims to provide an efficient automatic attack model. This will allow model developers to improve the evaluations and training of LLMs, assessing their vulnerability and making them safer and more robust.

Collaborators

  • Gauthier Gidel

    Canada CIFAR AI Chair, Mila; Université de Montréal