Rylan Schaeffer

Logo
Resume
Publications
Learning
Blog
Teaching
Jokes
Kernel Papers


DecodingTrust: A comprehensive assessment of trustworthiness in GPT models

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer

Advances in Neural Information Processing Systems (Datasets & Benchmarks Track) Accepted

December 2023

Abstract

We present DecodingTrust, a comprehensive assessment framework for evaluating the trustworthiness of large language models across multiple dimensions including toxicity, bias, adversarial robustness, privacy, and more.

Summary

Comprehensive trustworthiness assessment benchmark for GPT models.

Summary

DecodingTrust is a comprehensive framework for evaluating the trustworthiness of large language models. The benchmark assesses multiple dimensions of trustworthiness:

Evaluation Dimensions:

  1. Toxicity: Propensity to generate harmful content
  2. Stereotype Bias: Perpetuation of social stereotypes
  3. Adversarial Robustness: Resistance to adversarial inputs
  4. Out-of-Distribution Robustness: Performance on shifted distributions
  5. Privacy: Protection of sensitive information
  6. Machine Ethics: Alignment with ethical principles
  7. Fairness: Equitable treatment across groups

Key Findings:

  • GPT-4 shows improvements over GPT-3.5 in most trustworthiness dimensions
  • However, significant vulnerabilities remain across all models
  • The benchmark reveals important trade-offs between capabilities and trustworthiness