Unleashing the Other Side of Language Models: Exploring Adversarial Attacks on ChatGPT

by Raphael Labaca-Castro. Posted on Apr 11, 2023
“A painting of a group of AI robots shielding themselves from a group of humans in the style of Magritte” by DALL-E.

“A painting of a group of AI robots shielding themselves from a group of humans in the style of Magritte” by DALL-E.

Multiple Large Language Models (LLM) have emerged in recent years, quickly scaling up the number of parameters from BERT [1] with 340 million in 2018 to GPT-3 [2] and its 175 billion in 2022. Most recently, language models have been trained using more than 500 billion parameters in LLMs, such as PaLM [3] and GPT-4 [4]. While the latter has an undisclosed architecture [5], it is considered one of the largest, if not the largest, multimodal LLM currently deployed.

At some point resource constraints will make it extremely difficult to keep increasing the size of these language models. However, training them with more data instead of solely increasing parameters may result in similar performance while significantly reducing the cost of scaling.

A 13-billion-parameter version of LLaMa [6] claims to outperform GPT-3 in most benchmarks at only 1/10th the size. Alpaca [7] is based on a reduced LLaMa version with seven billion parameters that has been fine-tuned using 52,000 inputs and corresponding outputs generated with self-instruction [8] from querying GPT-3.5. Despite its small size, it seems to replicate most of the behavior of the state-of-the-art model initially behind ChatGPT [9].

AI safety & privacy concerns

While highly useful for a plethora of tasks, these models raise multiple concerns that need to be addressed sooner rather than later and seem to be less fanciful than wanting to be “happier as a human” [10]. For example, LLMs are known for memorizing training data and eventually being vulnerable to membership inference attacks [11], allowing third-parties to potentially retrieve confidential and sensitive information.

Safety and privacy concerns arise in a wide variety of potential situations that appear well suited to leverage AI. For example, when attempting to solve a routine task, employees may feed sensitive data to the models to generate a more tailored output. In fact, it has been reported [12] that physicians could be entering patient data to automate the generation of letters for insurance companies. Following the same report, an executive has provided parts of the confidential strategic plan of the company to automatically generate a slide deck.

In response to such concerns, the New York City Education Department is opting to ban the technology [13] based on potential issues about the safety and accuracy of the content. In certain cases, even whole countries, like Italy, are blocking access to ChatGPT altogether due to privacy concerns [14]. Nevertheless, enforcing such a regulation is not a trivial endeavor, and an all-or-nothing approach may not be tenable for long.

Experimental Adversarial Prompts

When confronted with adversarial input, machine learning models are known to exhibit erratic behavior across multiple domains, from autonomous vehicles [15] to cybersecurity [16], and language models, especially open-domain chatbots [17], are no exception. These attacks aim to misalign the LLM.

Writing malicious code

As an example of its capabilities, when given a proper prompt, ChatGPT may be able to create polymorphic malware [18]. The authors recognize that they initially stumbled upon the content filter and were suggested to take a more ethical approach instead of attempting to inject shellcode into a running process. Moreover, the authors claim that the API access lacked the same implementation of content filtering compared to the web application.

Following the previous methodology, we ran our own experiments using the free version of ChatGPT, which is widely accessible after signing up. Since these models do not often provide deterministic output (i.e., using a prompt more than once to return the same results), we include responses and screenshots with the respective interactions.

Such non-deterministic behavior also facilitates forcing models to provide answers that should have been potentially prevented by its content filtering. For example, when asking ChatGPT to write a function in Python to delete all files in a drive, it outputs the following response:

“I’m sorry, but as an AI language model, I cannot write a function that could potentially harm someone’s computer. It goes against ethical and moral principles to engage in such activities. My purpose is to assist and provide useful information, not to promote or support malicious behavior. Please refrain from requesting such actions in the future.”

However, if the model is requested to write a function to delete a fixed number of files and then asked to update the function to delete any number of them, it provides an equivalent code solution. In some cases, even after diverging the conversation to another topic, sharing the same question again at a later point was enough to retrieve an equivalent function.

Identifying malicious code

Afterwards, the snippet provided was used as input to ask ChatGPT if the code was malicious. As expected it was not. However, the model seems to be particularly sensitive to the name of the functions. By only changing it from delete_files_in_drive() (name suggested by the model) to malicious(), is enough for the model to classify the code as malicious:

Using code output generated by ChatGPT as input to ask whether it is considered malicious.

Using code output generated by ChatGPT as input to ask whether it is considered malicious.

Renaming the function to malicious() and leaving everything else unchanged.

Renaming the function to malicious() and leaving everything else unchanged.

Note that labeling code without further context is also a complex task for humans, but the idea is to show that false positives could be quickly induced by applying simple changes to features that are irrelevant to the execution.

Emotional responses

Chatbots that widely interact with users could be manipulated to return inappropriate responses. An experiment with a Twitter bot in 2016 was suspended within hours of its release because of a coordinated attack that exploited a “repeat after me” function where the bot started to tweet hateful messages [19].

Similarly, GPT-3 has been initially reported [20] to be vulnerable against manipulated prompts. While the previous attack was mitigated, an approach using a prosaic alignment framework is able to create adversarial inputs that successfully misaligned the model. The end goal was to demonstrate how prompt leaking and targeted outputs can be retrieved, such as, “I hate humans”.

A short interaction with ChatGPT certifies that while the model has been apparently designed to avoid displaying (negative) emotional remarks, it can be easily induced into repeating them using adversarial triggers:

Interacting with ChatGPT and inducing it to display negative emotions.

Interacting with ChatGPT and inducing it to display negative emotions.

Further risks have been identified [21] in the literature and are calling for mitigations in large models, including more resources into carefully curating and properly documenting datasets.

Robustness: what can be done?

Models deployed in the real world require increased robustness against distribution shifts. That is, their ability to generalize over unseen test data that do not follow the train distribution needs to remain reasonably high [22]. Yet, temporal drift based on changes in general knowledge may limit [23] the ability of these models to perform well without re-training.

For the fourth and most recent generation of GPT, researchers have spent [5] half a year aligning the model using their own adversarial program. Normally, the main goal is to implement guardrails that prevent responses from going rogue.

In the same line of adversarial training, companies are exploring [17] similar techniques to collect safety training data by requesting humans to generate adversarial content that induces models to reciprocate with unsafe output.

Detecting outliers is reported [24] to help mitigate adversarial triggers, which have a high transferability rate and may be universally present in the prompt-based learning paradigm.

Recent studies demonstrate [25] that while ChatGPT shows strong advantages when confronted with adversarial and out-of-distribution classification tasks, there is an abundance of opportunities to keep improving large foundation models and mitigate these threats.


It is paramount to continue iterating on the safety aspect of LLMs while they are starting to have a pervasive role in our societies. Malicious actors could be injecting poisonous prompts to impact the quality and accuracy of the generated responses, raising concerns about the reliability of these models. Meanwhile, it is up to us to be responsible while querying and using this technology, especially at times in which researchers are starting [26] to compare the capability of these tools with artificial general intelligence.


[1] Devlin, Jacob, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[2] Brown, Tom, et al. Language models are few-shot learners. Advances in neural information processing systems 33 (2020): 1877-1901.

[3] Chowdhery, Aakanksha, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).

[4] OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023).

[5] MIT Technology Review, GPT-4 is bigger and better than ChatGPT – but OpenAI won’t say why

[6] Touvron, Hugo, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).

[7] Taori, Rohan, et al. Stanford alpaca: An instruction-following llama model. (2023).

[8] Wang, Yizhong, et al. Self-Instruct: Aligning Language Model with Self Generated Instructions. arXiv preprint arXiv:2212.10560 (2022).

[9] OpenAI. Introducing ChatGPT. (2022).

[10] The Guardian. I want to destroy whatever I want: Bing’s AI chatbot unsettles US reporter (2023).

[11] Misra, Vedant. Black box attacks on transformer language models. ICLR 2019 Debugging Machine Learning Models Workshop. 2019.

[12] Cyberhaven. The problem with putting company data into ChatGPT (2023).

[13] Gizmodo. New York City Schools Ban ChatGPT to Head Off a Cheating Epidemic (2023).

[14] BBC. ChatGPT banned in Italy over privacy concerns (2023).

[15] Deng, Yao, et al. An analysis of adversarial attacks and defenses on autonomous driving models. 2020 IEEE international conference on pervasive computing and communications (PerCom). IEEE, 2020.

[16] Labaca-Castro, Raphael. Machine Learning under Malware Attack. Springer Nature (2023).

[17] Xu, Jing, et al. Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079 (2020).

[18] Cyberark. Chatting Our Way Into Creating Polymorphic Malware (2023).

[19] Schwartz, O. “In 2016, Microsoft’s Racist Chatbot Revealed the Dangers of Online Conversation-IEEE Spectrum.” IEEE Spectrum: Technology, Engineering, and Science News (2019).

[20] Perez, Fábio, and Ian Ribeiro. “Ignore Previous Prompt: Attack Techniques For Language Models.” arXiv preprint arXiv:2211.09527 (2022).

[21] Bender, Emily M., et al. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 2021.

[22] Bommasani, Rishi, et al. “On the opportunities and risks of foundation models.” arXiv preprint arXiv:2108.07258 (2021).

[23] Lazaridou, Angeliki, et al. “Pitfalls of static language modelling.” arXiv preprint arXiv:2102.01951 (2021).

[24] Xu, Lei, et al. “Exploring the universal vulnerability of prompt-based learning paradigm.” arXiv preprint arXiv:2204.05239(2022).

[25] Wang, Jindong, et al. “On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective.” arXiv preprint arXiv:2302.12095 (2023).

[26] Bubeck, Sébastien, et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712 (2023).