📝 Tamper-Resistant Safeguards for Open-Weight LLMs

as seen in WIRED magazine

Aug 07, 2024

We are excited to share our new paper “Tamper-Resistant Safeguards for Open-Weight LLMs”, in collaboration with the Center for AI Safety.

Paper: https://arxiv.org/pdf/2408.00761 (under review)

Open-weight LLMs are often released with safeguards to prevent malicious use, yet existing safeguards can be easily removed with fine-tuning attacks. To address this problem, we develop tamper-resistant safeguards to increase the costs of adversarial behavior.

We introduce the first safeguards for LLMs that resist realistic fine-tuning attacks of up to 5,000 steps, demonstrating the potential of tamper-resistance as a powerful new tool for making open-weight LLMs safer.

Code: https://github.com/rishub-tamirisa/tamper-resistance
Wesbite: https://www.tamper-resistant-safeguards.com/
Collection: https://huggingface.co/collections/lapisrocks/tamper-resistant-safeguards-for-open-weight-llms-66b2dc4cc40442c79ec890a5

This work was also covered by WIRED!

https://www.wired.com/story/center-for-ai-safety-open-source-llm-safeguards/

read the article: https://www.wired.com/story/center-for-ai-safety-open-source-llm-safeguards/

Lapis Labs

Discussion about this post