DecodingTrust: A comprehensive assessment of trustworthiness in GPT models
Abstract
We present DecodingTrust, a comprehensive assessment framework for evaluating the trustworthiness of large language models across multiple dimensions including toxicity, bias, adversarial robustness, privacy, and more.
Summary
Comprehensive trustworthiness assessment benchmark for GPT models.
Summary
DecodingTrust is a comprehensive framework for evaluating the trustworthiness of large language models. The benchmark assesses multiple dimensions of trustworthiness:
Evaluation Dimensions:
- Toxicity: Propensity to generate harmful content
- Stereotype Bias: Perpetuation of social stereotypes
- Adversarial Robustness: Resistance to adversarial inputs
- Out-of-Distribution Robustness: Performance on shifted distributions
- Privacy: Protection of sensitive information
- Machine Ethics: Alignment with ethical principles
- Fairness: Equitable treatment across groups
Key Findings:
- GPT-4 shows improvements over GPT-3.5 in most trustworthiness dimensions
- However, significant vulnerabilities remain across all models
- The benchmark reveals important trade-offs between capabilities and trustworthiness
