Trustworthy & Interpretable AI

Abstract background with geometric shapes

Catalyst Project

Democratic Alignment of LLMs Through Economic Theory: Relative Preferences and Strategic Coordination

Abstract background with geometric shapes

Catalyst Project

Towards Socially Grounded AI Safety: Integrating Causal and Institutional Reasoning in Language Models

Abstract background with geometric shapes

Catalyst Project

Performative Empathy and Deceptive Alignment

Abstract background with flowing shapes

AI Alignment Project

Game-theoretic safety guarantees for advanced AI systems

Abstract background with flowing shapes

AI Alignment Project

Sample-efficient online fine-tuning against resistant behaviors: statistical foundations for post-training alignment

Abstract background with flowing shapes

AI Alignment Project

Scaling laws, data distributions, and learning dynamics: simulated high-energy physics data as a benchmark for data in the wild

Abstract background with flowing shapes

AI Alignment Project

A unified statistical framework for quantifying rare event risks for language models

Abstract background with geometric shapes

Catalyst Project

Advancing AI alignment through debate and shared normative reasoning

Abstract background with geometric shapes

Catalyst Project

Adversarial robustness in knowledge graphs

Abstract background with geometric shapes

Catalyst Project

Adversarial robustness of large language model (LLM) safety