We are excited to share our new paper βTamper-Resistant Safeguards for Open-Weight LLMsβ, in collaboration with the Center for AI Safety.
Paper: https://arxiv.org/pdf/2408.00761 (under review)
Open-weight LLMs are often released with safeguards to prevent malicious use, yet existing safeguards can be easily removed with fine-tuning attacks. To address this problem, we develop tamper-resistant safeguards to increase the costs of adversarial behavior.
We introduce the first safeguards for LLMs that resist realistic fine-tuning attacks of up to 5,000 steps, demonstrating the potential of tamper-resistance as a powerful new tool for making open-weight LLMs safer.
Code: https://github.com/rishub-tamirisa/tamper-resistance
This work was also covered by WIRED!
read the article: https://www.wired.com/story/center-for-ai-safety-open-source-llm-safeguards/