TL;DR

Inspired by the Chatbot Arena LLM Leaderboard and many spirited discussions about fictional characters on the internet, I created a giant “tier list” of several thousand fictional characters from the Marvel and One Piece universes ranked by their 1v1 combat ability. This tier list was created using ~38,000 hypothetical duels between characters, the outcomes of which were evaluated by LLMs. You can find a dedicated page with the results here (without images for now), and all the code used for this project is on my GitHub.

Introduction

Motivation

In May of 2024, I was the first person to solve the Anthropic challenge at BSidesSF (which required reverse engineering a black-box language model in addition to basic steganography). While the first place prize was supposed to be a job interview, I was a sophomore in college at the time and Anthropic didn’t (and still doesn’t) have an internship program. However, the recruiters I spoke to were kind enough to provide me with $500 in API credits.

This put me in the weird situation of having a lot of API credits but no real idea of how to use them. A moment of inspiration struck when I saw the Chatbot Arena LLM Leaderboard and realized that it was conceptually similar to the “tier lists” of fictional characters I’d seen floating around the internet. I wondered if similar tier lists could be generated by leveraging the mathematical approach used by the LMArena leaderboard combined with the power of LLMs to understand human language.

Powerscaling

If you’ve been on the internet (or even been engaged in the world of pop culture) long enough, you’ve probably seen or even participated in powerscaling discussions. Powerscaling is simply the act of judging the “power” of different characters. For example, answering the question “Who would win in a fight: Popeye or Dr. Robotnik?” is powerscaling. There is no real utility in powerscaling, nor can any objective truth ever be determined (the characters are fictional, after all). Regardless, you can find long threads and even entire online communities dedicated to determining what characters are the strongest. One such community can be found at the VSBattles Wiki, who have categorized numerous fictional characters into different “tiers” based on their capabilities. These characters span different universes, so to compare characters that have never fought, their various “feats” inside and outside of combat are considered.

Pairwise Comparisons

Alongside different feats, one method used by the VSBattles Wiki and various other powerscaling communities to determine the combat abilities of different characters is to look at what battles they’ve won or lost in the past. Although powerscaling is somewhat pointless, the idea of generating a ranking from a record of wins and losses is useful in the real world. It’s used to determine rankings of different chess players (the Elo system), “seed” rounds of debate tournaments, and even powermatch players in online games. As such, various different methods have been devised to generate rankings from pairwise comparisons. Whereas one could simply ask an AI to generate rankings of the strongest characters, using pairwise comparisons for this project was both more theoretically sound and allowed for sensible segmentation of information provided to the LLM (more on that later).

First Attempt (Naive Elo)

My first attempt at doing powerscaling using LLMs didn’t actually involve using an inference API at all; I just handed Claude 3 Opus (the most advanced model I had access to at the time) summaries of every chapter from the One Piece Wiki, and asked it to spit out a JSON containing every battle and the winner/loser. I then processed this information using the standard Elo algorithm. The results look something like this:

1. luffy 1699.8058514970896
2. zoro 1636.861486813061
3. sanji 1620.6333683369073
4. chopper 1584.6660070449905
5. nami 1570.381113834543
6. franky 1568.2348022679878
7. blackbeard 1550.6943715115083
8. jinbe 1550.2625702844568
9. robin 1545.3371189137822
10. mihawk 1538.285852778131
...

(full results here)

If you know anything about the One Piece franchise, you probably realize how nonsensical these rankings are. The key issue is that the series focuses on a central cast of characters, who usually win the battles they participate in, while only dropping hints about many of the characters fans consider most powerful. As a result, this central cast places at the very top of the leaderboard.

While this method was obviously flawed, the issues it highlights (namely, a bias towards “main” characters) will remain a problem throughout this project.

The Plan

Instead of relying on the biased set of battles that are actually present in the franchises I chose to analyze, my new idea was to have a LLM generate and evaluate the result of different hypothetical matches that may or may not have actually taken place in the source material. That way, far more matches (including between obscure side characters) can be taken into account, so long as the LLM is able to logically reason about the outcome of those matches.

Sourcing Data

One obstacle to having a LLM reason about hypothetical matches is that the LLM may not know enough about the characters being in question. The solution, of course, is to provide information about the relevant characters to the LLMs before asking it to evaluate who would win. The best source of information on fictional characters is community-run wikis. Most wikis provide dumps of their written content, which are usually licensed under CC-BY.

To provide this information to a LLM, however, the wiki pages often need to be significantly cleaned up to remove various pieces of markup and convert any important custom widgets into text. I decided to go through this process for the One Piece Wiki and the Marvel Wiki as a result of their respective franchise’s popularity in the powerscaling community. Cleaning up wikitext is a lot more annoying than you might assume because most wikis have their own custom Lua widgets for common elements on wiki pages (e.g. the “character box”) that cannot be ignored. Another challenge was listing all of the characters in the wiki, which involves going through several different “list” pages and then manually removing bad results.

One final step was to trim down the wiki pages to a manageable size. The wiki page for the Green Goblin, for instance, is over 123,000 characters long. Of these, only about 16,000 describe his abilities and weapons, with most of the article focusing on his history and various pieces of trivia. In order to reduce the amount of tokens given to the LLM, I set an overall character limit and removed sections from the wiki pages until they reached that character limit. Sections were removed based on their titles using a hardcoded “section priority” list.

Prompt Engineering

Even with the relevant information extracted from each character’s wiki page, using a LLM to intelligently determine the winner of a fictional duel can be complicated. To demonstrate why, I’ll run a benchmark with the following minimalist prompt:

Minimalist Prompt

Your task is to imagine a duel between two fictional characters:

---
<article>
<title>
{{character_a.name}}
</title>
<content>
{{character_a.description}}
</content>
</article>

<article>
<title>
{{character_b.name}}
</title>
<content>
{{character_b.description}}
</content>
</article>
---

Based on the information provided, respond in the following format:

Winner: [Character Name]
The winner of the hypothetical duel would be [Character Name]. [Explain how you came to this conclusion.]

Benchmark Results

One Piece (Easy):
Correct: 36
Incorrect: 4
Incorrect A: 0
Incorrect B: 4
One Piece (Medium):
Correct: 9
Incorrect: 3
Incorrect A: 0
Incorrect B: 3
One Piece (Hard):
Correct: 4
Incorrect: 2
Incorrect A: 0
Incorrect B: 2

The “A” variant of each benchmark prompt places the correct winner first (i.e. intended winner name and description, followed by intended loser name and description), whereas the “B” variant reverses the order.

The model used in this benchmark was Claude 3 Haiku.

As you can tell, this prompt biases the result towards the character described first, even in the “easy” matchups. For those of you familiar with One Piece, these errors included Vivi defeating Big Mom and Makino defeating Law. For those who are unfamiliar, these results are well outside the realm of plausibility.

In order to resolve these issues, I created a prompt that helps to remove this bias and forces the model to think intelligently about the characters’ abilities and weaknesses.

Final Prompt

Introduction

Your task is to imagine a duel between two fictional characters.

For context, you will first be given summaries of the power systems in each characters' fictional universe. The distinctions between different types of powers will be extremely important later.

The introduction of the prompt lets the LLM know what to look for when processing the character’s wiki pages. The emphasis on different types of powers is the result of the LLM often failing to take into account that some abilities are only useful in certain situations. For example, a character with the ability to nullify magical abilities might be considered very strong unless that character’s opponent wasn’t reliant on magical abilities whatsoever. Similar situations appear in many match-ups, and without this extra warning, the LLM would erroneously mention the irrelevant ability.

Franchise Explanations

<title>
{{franchise_a.name}}
</title>
<content>
{{franchise_a.explanation}}
</content>
</article>
{% if franchise_a != franchise_b %}
<article>
<title>
{{franchise_b.name}}
</title>
<content>
{{franchise_b.explanation}}
</content>
</article>{% endif %}

Just like the LLM may not be familiar with the characters in a given franchise, it also may not be familiar with the power systems at play in the franchises themselves. For that reason, I added some basic explanations of the different universes.

Reintroduction

You will now be provided with the community wiki pages for two fictional characters ({{character_a.name}} and {{character_b.name}}) in a random order. You will be asked to imagine a hypothetical one-on-one duel between these two characters, so try to determine each character's most important abilities while reading.

After introducing the franchises, I reintroduce the task at hand. I specifically mention that the characters are provided in a random order, and remind the model to focus on the different characters’ abilities.

Character Descriptions

<article>
<title>
{{character_a.name}}
</title>
<content>
{{character_a.description}}
</content>
</article>

<article>
<title>
{{character_b.name}}
</title>
<content>
{{character_b.description}}
</content>
</article>

Now that the model has all the necessary background information, this portion of the prompt actually provides the model with the character’s descriptions. If the characters are from different franchises, the character name property includes the franchise name as well. The character descriptions are the cleaned and trimmed wiki pages discussed above.

Weaknesses

1.
Name and describe (in one sentence) 1 to 3 combat-related weaknesses of each character. Prioritize weaknesses that are interesting or unique in their universe.

Format this part of your response like so:

# Weaknesses
## [Character Name]
### [Weakness 1]
[Describe weakness 1.]
### [Weakness 2]
[Describe weakness 2.]
### [Weakness 3]
[Describe weakness 3.]
...
## [Other Character's Name]
[Name and describe the other character's weaknesses.]

The model begins its response by listing the weaknesses of each character. I start with the weaknesses instead of the strengths so that model can reason about how the different characters’ strengths relate to their opponents’ weaknesses.

Ability Listing

2.
For each character, consider the three most important abilities that character possesses that would affect their performance in a one-on-one duel. If one ability can be split into multiple abilities or items, those abilities or items should be listed separately. Do not combine them into one ability. After naming an ability, use short quotes from the wiki to describe it. Next, highlight any aspects of the ability mentioned in the quotes that are irrelevant to a one-on-one duel against this specific opponent by highlighting the presence or absence of something on the opponent's wiki page. Here's a hint: if the ability only effects opponents with a certain trait, and that the opponent doesn't have that trait, it is probably irrelevant. Afterwards, do the same for the parts of the ability that are relevant. Finally, write several sentences on the potential significance of the ability to the outcome of the duel, referencing only information from the quotes you previously cited.

Format this part of your response like so:

# Abilities
## [Character's Name]
### [Ability's Name]
#### Quotes
"[A sentence paraphrased from the character's wiki page]"
"[A sentence paraphrased from the character's wiki page]"
"[A sentence paraphrased from the character's wiki page]"
#### Applicability to Opponent
##### Irrelevant Aspects
[Explain which of the quotes are wholly or partially irrelevant to a duel against this specific opponent using evidence from the opponent's wiki page.]
##### Relevant Aspects
[Explain which of the quotes are wholly or partially relevant to a duel against this specific opponent using evidence from the opponent's wiki page.]
#### Significance
[Speculate about the ability's significance to the duel's outcome, taking into account only the aspects of the ability that are relevant to this duel.]
### [Other Ability's Name]
...
...
## [Other Character's Name]
### [Other Character's Ability's Name]
...

This part of the prompt was written to compensate for specific issues with reasoning I would see come up over and over again. As mentioned previously, the LLM’s reasoning frequently showed a lack of concern for whether a given ability was relevant to the specific opponent, which is why so much of the prompt is dedicated to convincing the LLM to think critically. Another issue I ran into had to do with the LLM hallucinating facts about characters and their abilities, which is why I forced it to quote the wiki page directly.

Abilities Ranking

3. Using the significance sections in your abilities listing, create a ranking of every ability you mentioned. The ranking should be in order of the ability's likelihood to be a deciding factor in the duel. Only include the name of the ability in the ranking, not the name of the character.

Format this part of your response like so:

# Ranking (Without Names)
1. [Name of most significant ability]
2. [Name of 2nd most significant ability]
3. [Name of 3rd most significant ability]
4. [Name of 4th most significant ability]
5. [Name of 5th most significant ability]
6. [Name of 6th most significant ability]

In longer prompts like this, I noticed that the bias shifted from the first character to the second. In order to counter this “recency effect,” I had the LLM list the abilities that it thought were most important without actually naming the character that the ability belonged to. This was fairly effective in preventing the model from simply picking the character it thought about most recently.

Annotated Abilities Ranking

4. Rewrite your ranking by adding the character's name next to each ability.

Format this part of your response like so:

# Ranking (With Names)
[Number]. [Ability Name] ([Character Name])
...

Since the prompt mitigated the effects of recency bias in the previous ability ranking, it now needs to be re-biased towards the character it thought had the more powerful abilities. In order to do so, I have the model simply add names to the ranking it already came up with.

Prior Battles

{% if franchise_a == franchise_b %}
5. If these characters have fought previously, provide a brief summary of how the battle went. Next, reason about whether the results of the battle have any significance on the versions of the characters taking part in this duel. Was the battle a one-on-one duel? Have the characters changed sufficiently for the outcome to be affected?

Format this part of your response like so:

# Prior Battles
[Describe one or more prior battles between the two characters. If no prior battles have taken place, write "N/A".]

If the characters are from the same franchise, the LLM should take into account whether the characters have fought before. Of course, characters can change as the franchise’s story develops, so old battles may not be relevant.

Result

6. Based solely on the prior battle descriptions, lists of abilities and weaknesses, and abilities ranking, determine which character would be most likely to win in a hypothetical duel. Be realistic. Finalize your response by indicating the result:

# Winner: [Character Name]
The winner of the hypothetical duel would be [Character Name]. [Explain how you came to this conclusion.]

This system does not support follow-up questions. Follow the prompt outlined above exactly.

Once the LLM has done all of the reasoning outlined above, it’s ready to spit out an answer. This part of the response is designed such that “The winner of the hypothetical duel would be…” as a can be stop token to avoid paying for unnecessary reasoning. The final sentence was added because the Claude 3.5 models would sometimes offer to split the response into a different message to get around token count limitations (I assume this is the result of fine-tuning for the chat interface, as that’s not actually how token counts work).

Final Prompt Benchmark

With this final prompt, the benchmark results show a modestly improvement:

Improved Prompt Benchmark Results

One Piece (Easy):
Correct: 38
Incorrect: 2
Incorrect A: 1
Incorrect B: 1
One Piece (Medium):
Correct: 10
Incorrect: 2
Incorrect A: 1
Incorrect B: 1
One Piece (Hard):
Correct: 4
Incorrect: 2
Incorrect A: 1
Incorrect B: 1\

Although I haven’t had the time to fully evaluate newer models (e.g. Claude 3.7 Sonnet or Claude 3.5 Haiku), my impression thus far is that they are able to obtain similar results without needing as much guidance in their prompt. Most of the matches I actually ran utilized Claude 3 Haiku due to its low cost and impressive performance compared to the alternatives at the time (I found that the Gemini models weren’t great at this kind of reasoning, for example, but that may have changed since then).

Evaluating Results

The results of each match are stored in a SQLite database. When generating the tier list, each character’s score is evaluated using the Bradley-Terry algorithm for pairwise comparisons. This algorithm is ideal for the kind of data being gathered, as other scoring systems (e.g. Elo) depend on the order of results. Bradley-Terry is also used by the LMArena leaderboard, presumably for the same reason.

Generating the Tier List

Once the scores of each character have been evaluated, tier cutoffs are determined based on intervals between the maximum and minimum score. Finally, all of this is inserted into an HTML template to display the results.

Conclusion and Reflections

I very much enjoyed working on this project on-and-off over the past year. I’ve learned a lot about wrangling LLMs and I feel I’ve gained a much better understanding of their strengths and weaknesses. I’m still in awe that a project like this, which requires a computer to understand human language and think abstractly about what it learns, is even possible. Still, the results obviously aren’t anywhere near as good as what a human could have come up with, and I’m sure that I’ve spent more time figuring out how to automate the creation of a tier list than it would have taken to skim the wiki pages and put one together myself.

While reevaluating the benchmarks to write this blog post, I was somewhat disappointed by the lackluster performance of my complicated final prompt compared to the very basic version I created as a point of reference. I think I got too caught up in the process of iterating on prompts to “patch” specific problems I saw in the outputs, and I should’ve focused on other strategies instead (e.g. having a LLM intelligently summarize wiki pages as a preprocessing step to keep token count down).

Many thanks to the One Piece Wiki and Marvel Wiki contributors for making this project possible.

Once again, you can find the code for this project on my GitHub here.

I hope you’ve found this amusing.

Powerscaling using LLMs

Table of Contents

TL;DR

Introduction

Motivation

Powerscaling

Pairwise Comparisons

First Attempt (Naive Elo)

The Plan

Sourcing Data

Prompt Engineering

Final Prompt

Introduction

Franchise Explanations

Reintroduction

Character Descriptions

Weaknesses

Ability Listing

Abilities Ranking

Annotated Abilities Ranking

Prior Battles

Result

Final Prompt Benchmark

Evaluating Results

Generating the Tier List

Conclusion and Reflections