#239 Stealing part of a production language model
This paper introduces the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI’s ChatGPT or Google’s PaLM-2. Specifically, the attack recovers the embedding projection layer of a transformer model, given typical API access. For under $20 USD, the attack extracts the entire projection matrix of OpenAI’s ada and babbage language models. The attack also helped recover the exact hidden dimension size of the gpt-3.5-turbo model, and the authors estimate that it would cost under $2,000 in queries to recover the entire projection matrix.
In this video, I talk about the following: What part of LMs can you steal and how? How can you extract hidden dimensionality and full projection matrix for logit-vector APIs? Extraction Attack for Top-5 Logit Bias APIs . Extraction Attack for top-1 Binary Logit Bias APIs. How to defend against such attacks?
For more details, please look at https://arxiv.org/pdf/2403.06634
Carlini, Nicholas, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee et al. "Stealing part of a production language model." In Forty-first International Conference on Machine Learning.