Sampling latent explanations from LLMs for safe and interpretable reasoning
Developing a monitoring guardrail for AI agents that can lead to safer AI deployment across many applications.
Ensuring that LLMs produce trustworthy and interpretable results is a major goal of AI safety researchers. Canada CIFAR AI Chair Yoshua Bengio will develop more trustworthy explanations of LLMs by deploying generative flow networks in a novel way. His focus is to train AI to explain what humans say, by looking at the hidden reasons behind AI decisions and evaluating their accuracy to disentangle the underlying causes behind what AI generates. Ultimately, this project aims to develop a monitoring guardrail for AI agents that can lead to safer AI deployment across many applications.
Collaborators
Yoshua Bengio
Canada CIFAR AI Chair, Mila; Université de Montréal

