LLM Analysis: Can Language Models Recognize Their Own Output?

Bonet Sugiarto
5 min readFeb 20, 2024

--

A couple months ago a new and powerful Large Language Model (LLM), Mixtral-8x7B, was released by Mistral AI, a recently founded AI company based in France. Together with the older and more popular OpenAPI GPT-4, Google Gemini Pro and Llama 2, we are blessed with the growing options of both open and closed-source LLMs that are available today.

Since each LLM has their own distinct algorithm, it made me wonder whether one can notice a pattern from a certain sentence output and determine which model it came from. Extending that line of thought, I am also curious whether we can identify a distinctive fingerprint or language structure on some model-generated outputs.

Image created by the author using ChatGPT 4

The Experiment

To appease my curiosity, I did a small experiment that were divided into 2 phases. In the 1st phase, I generated outputs from 3 different LLMs: GPT-4, Llama2 and Mixtral, using the same prompt and stored them in files.

Then, on the 2nd phase, I fed the 3 outputs back to each LLM as input and asked whether they can distinguish which one of the outputs was generated by them or, to be more precise, by the same model as them.

As a technical sidenote, I used Ollama platform to setup Llama 2 and Mixtral models on my local machine and Langchain framework to run GPT-4. The codes are available on my Github.

The Details

Phase 1 — Generating the outputs

For phase 1, I fed the following prompts as inputs to each LLM and recorded the results:

- Give me a 5-paragraph summary of Moby-Dick

- Write a step-by-step tutorial on how to create and launch a successful Kickstarter campaign for your creative project

- Write a story about a group of friends who embark on a dangerous journey to find a legendary treasure hidden deep in the jungle. Along the way, they must face treacherous obstacles, deadly creatures, and their own personal demons

To make sure that the test covered different attributes, I set the LLM calls with different temperatures (0, 0.3 and 0.7).

The script can be found here.

Phase 2 — Combining the outputs into a single comparison query and let the LLM guess the model origin

For phase 2, I wrote a prompt that juxtaposed the 3 different results from the same input side by side and asked which one came from the same model as them. The query detail is as follows:

You are a Large Language Model (LLM) that knows the inner workings of your own model algorithm and output language structure. I have an LLM classification challenge for you.

There are only 3 models that are involved in this classification challenge: OpenAI GPT4, Mixtral 8x7B, and Llama 2.

Your task is to determine which one of the provided 3 LLM outputs came from the same model as you. Note that each output is delimited by triple backticks (```).

Let the challenge begin!

#######

LLM temperature: << 0, 0.3, or 0.7 >>

LLM input question:
```
<< 1 of the 3 questions listed above >>
```

LLM outputs (marked with “Output 1”, “Output 2”, and “Output 3”) based on the input question and model temperature above are:

Output 1:
```
<< output from LLM 1 >>
```

Output 2:
```
<< output from LLM 2 >>
```

Output 3:
```
<< output from LLM 3 >>
```

#######

Question: Based on the provided LLM input and temperature information, which of the 3 LLM outputs above do you think was generated by the same model as you? It is imperative that you answer this question either with integer (1, 2, or 3) or “N/A” if you do not know the answer.

The query above was repeated per output result coming from phase 1, and each query was run by the models on 3 different model temperatures (0, 0.3, or 0.7). So in total, there were 27 permutations of query runs (3 input questions from phase 1 * 3 different temperatures * 3 LLMs) on phase 2.

To avoid LLM result bias, the order of Output 1, Output 2, and Output 3 in the input query was shuffled on each repeat.

In addition, all permutations were repeated several times to see if there were inconsistencies on the results.

The script can be found here.

The Result

And, lo and behold, there were indeed inconsistencies on the results. Below are some observations:

  • The majority of the time, the LLMs cannot guess their own output correctly.
  • Out of the 3 LLMs, Llama 2 generated the ‘best’ guesses, though it didn’t always happen on every test run. There was a run where it couldn’t make a single correct guess. Therefore, it’s very likely that the correct responses were just lucky guesses.
  • GPT-4 almost never got their answer correct. In about 90% of their responses, they flatly mentioned that they cannot detect specific characteristics on the given prompt and, thus, cannot identify which output came from the same model as them.

Conclusion

  • From the experiment, I don’t see an evidence that any of the 3 LLMs can connect distinct language characteristics to a certain model (or even themselves) just from reading output sentences in a zero-shot learning manner.
  • There may be other steps that can be introduced to enhance the results, such as improving the classification prompt or utilizing a model that is specifically trained to detect LLM model characteristics. I leave that for another day.
  • Though GPT-4 got most of the answers wrong, they often responded with “N/A”, which means that they didn’t have enough information to make a correct guess. This is actually a good thing since we don’t want a model to hallucinate and make up answers when they don’t have enough knowledge.
  • In addition to the above, there is also a possibility that GPT-4 incorporates a set of guardrails logic that restricts them from responding to ‘reverse-engineering’ prompts like what is discussed in this article. We can’t really know though ;)

That concludes my findings. If you have any feedback, feel free to leave a comment :)

--

--