Adversarial robustness of large language model (LLM) safety
Developing an efficient automatic attack model to improve the evaluations and training of LLMs, making them safer and more robust.
Assessing the vulnerabilities of LLMs has become a key area of AI safety research. Canada CIFAR AI Chair Gauthier Gidel proposes a novel, more efficient and automated way of finding vulnerabilities in LLMs. By using optimization and borrowing methods from image-based adversarial attacks, the project aims to provide an efficient automatic attack model. This will allow model developers to improve the evaluations and training of LLMs, assessing their vulnerability and making them safer and more robust.
Collaborators
Gauthier Gidel
Canada CIFAR AI Chair, Mila; Université de Montréal
