Dont you (forget NLP): Prompt injection with control characters in ChatGPT
摘要
Like many companies, Dropbox has been experimenting with large language models (LLMs) as a potential backend for product and research initiatives. As interest in leveraging LLMs has increased in recent months, the Dropbox Security team has been advising on measures to harden internal Dropbox infrastructure for secure usage in accordance with our AI principles. In particular, we’ve been working to mitigate abuse of potential LLM-powered products and features via user-controlled input.
Injection attacks that manipulate inputs used in LLM queries have been one such focus for Dropbox security engineers. For example, an adversary who is able to modify server-side data can then manipulate the model’s responses to a user query. In another attack path, an abusive user may try to infer information about the application’s instructions in order to circumvent server-side prompt controls for unrestricted access to the underlying model.
As part of this work, we recently observed some unusual behavior with two popular large language models from OpenAI, in which control characters (like backspace) are interpreted as tokens. This can lead to situations where user-controlled input can circumvent system instructions designed to constrain the question and information context. In extreme cases, the models will also hallucinate or respond with an answer to a completely different question.
欢迎在评论区写下你对这篇文章的看法。