Full Code Repository is available on my GitHub
Have you ever noticed that in some video games, no matter the genre, characters tend to follow certain "archetypes" in battle? For example, you might have a beefy, strong but slow character, a precise, high-damage character or a speedy but weak character. To go one step further, have you ever noticed that when the concept of "types" or "elements" are involved, these archetypes are even more common? For example, you would imagine a rock-character to be a big golem, a slow and heavy character with lots of health points and defensive stats. Or you might imagine a fire-character, someone with lots of explosive power, but subpar health and defensive stats. Maybe you have seen lightning-based characters, high speed, high precision and lots of critical hits.
This leads to my question for today, are there any archetypes present in the Pokémon games? If so, are they based on the Pokémon “type”? These are two questions that sort of build on one another. For those who are unfamiliar with Pokémon, every single Pokémon can have one or two types from a selection of 18 different types. Every Pokémon has a primary type and then they might have an optional secondary type. These types range from simple, abstract types such as “fire” or “water” to more specialized types such as “fighting” or “dragon”. With almost one thousand different Pokémon that are of many different types, one might wonder if they ever fall victim to the archetype’s trope described earlier.
We will be jumping right into the video games for this project and using a lot of context and lingo that is easier to understand if you are an enjoyer of those games yourself. If not, let's first take a look at the dataset and understand what the data represents.
The "Pokémon With Stats" data set is a .csv file which contains 1072 records. Each record represents a Pokémon and has 13 columns called number, name, type1, type2, total, hp, attack, defense, sp_attack, sp_defense, speed, generation, legendary. The description of these columns is as follows, as described on the data set's webpage.
Something to note is that there are currently 898 different Pokémon in creation, but there are 1072 records. This is because some Pokémon have alternate forms that modify their battle statistics. For the purpose of this research, we will be ignoring any alternate forms and stick to just the 898 base Pokémon.
To get started, like with any other great code-based analysis, we must import all of our packages that will be used. This project uses pandas and numpy for data structures, and seaborn/matplotlib for visualizations. For some summary statistics, the scipy.stats package is used. Finally, for our machine learning techniques, scikit-learn is used.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import colors
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabasz_score
from scipy import stats
First, we will filter out invalid records, since each individual Pokémon has it’s own ID that is unique to the Pokémon (the “number” column) and we know that any alternate form of a Pokémon will have the same number, we can eliminate alternate Pokémon records by simply dropping any duplicate records based on the “number” column (keeping the first of the duplicate records will always be the original, base Pokémon). After doing this simple filtering, we can see the 898 total records.
df = pd.read_csv('Pokemon.csv',keep_default_na=False)
df = df.drop_duplicates(subset=['number'])
df = df.reset_index(drop=True)
df
number | name | type1 | type2 | total | hp | attack | defense | sp_attack | sp_defense | speed | generation | legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 4 | Charmander | Fire | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False | |
4 | 5 | Charmeleon | Fire | 405 | 58 | 64 | 58 | 80 | 65 | 80 | 1 | False | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
893 | 894 | Regieleki | Electric | 580 | 80 | 100 | 50 | 100 | 50 | 200 | 8 | True | |
894 | 895 | Regidrago | Dragon | 580 | 200 | 100 | 50 | 100 | 50 | 80 | 8 | True | |
895 | 896 | Glastrier | Ice | 580 | 100 | 145 | 130 | 65 | 110 | 30 | 8 | True | |
896 | 897 | Spectrier | Ghost | 580 | 100 | 65 | 60 | 145 | 80 | 130 | 8 | True | |
897 | 898 | Calyrex | Psychic | Grass | 500 | 100 | 80 | 80 | 80 | 80 | 80 | 8 | True |
898 rows × 13 columns
The data is given in base values of each battle statistic; however, we must consider that some Pokémon are designed to be stronger than others. If we imagine a scenario where for example water type Pokémon tend to have low defense and high attack (let’s say this archetype is true, just for the sake of argument), we might find that there are more legendary water type Pokémon where all their battle stats are relatively high, meaning by comparison to normal, non-legendary Pokémon, they might have high defense and really high attack. This does not disprove that water type Pokémon don't have low defense proportionally, however. That is because we should not be examining the actual values of the battle statistics, but rather the proportions of the battle statistics within each Pokémon.
When a Pokémon is described as following a "high attack, low defense" archetype, that means the proportion of points that are in its attack statistics is large and the proportion of points in its defense statistics is low. The data set has a calculated attribute, total, which counts the total number of points (the sum of the 6 attributes). This total column can be used to convert the points into proportions. After doing this transformation, we can easily see if a Pokémon has most of its total points in a certain attribute, or if they are relatively well balanced across the 6 attributes. Then, we can examine the distribution of the data.
battle_cols=df.columns[5:11]
for col in battle_cols:
df[col]=df[col]/df['total']
df
number | name | type1 | type2 | total | hp | attack | defense | sp_attack | sp_defense | speed | generation | legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 0.141509 | 0.154088 | 0.154088 | 0.204403 | 0.204403 | 0.141509 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 0.148148 | 0.153086 | 0.155556 | 0.197531 | 0.197531 | 0.148148 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 0.152381 | 0.156190 | 0.158095 | 0.190476 | 0.190476 | 0.152381 | 1 | False |
3 | 4 | Charmander | Fire | 309 | 0.126214 | 0.168285 | 0.139159 | 0.194175 | 0.161812 | 0.210356 | 1 | False | |
4 | 5 | Charmeleon | Fire | 405 | 0.143210 | 0.158025 | 0.143210 | 0.197531 | 0.160494 | 0.197531 | 1 | False | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
893 | 894 | Regieleki | Electric | 580 | 0.137931 | 0.172414 | 0.086207 | 0.172414 | 0.086207 | 0.344828 | 8 | True | |
894 | 895 | Regidrago | Dragon | 580 | 0.344828 | 0.172414 | 0.086207 | 0.172414 | 0.086207 | 0.137931 | 8 | True | |
895 | 896 | Glastrier | Ice | 580 | 0.172414 | 0.250000 | 0.224138 | 0.112069 | 0.189655 | 0.051724 | 8 | True | |
896 | 897 | Spectrier | Ghost | 580 | 0.172414 | 0.112069 | 0.103448 | 0.250000 | 0.137931 | 0.224138 | 8 | True | |
897 | 898 | Calyrex | Psychic | Grass | 500 | 0.200000 | 0.160000 | 0.160000 | 0.160000 | 0.160000 | 0.160000 | 8 | True |
898 rows × 13 columns
There are 18 different possible Pokémon types, meaning there are 324 (18*18, 18 possibilities for the first type and 18 possibilities for the second type, including “no second type”) possible permutations of Pokémon types as each Pokémon can have a primary type and optional secondary type. Let's take a quick look at the distribution of the primary types.
plt.figure(figsize=(15,10))
plt.xlabel("...", labelpad=20)
fig=sns.countplot(x='type1',data=df)
fig.set(xlabel='Primary Type',ylabel='Frequency',title="Distribution of Pokemon Types")
plt.show(fig)
Within this data, we can also view how each of the 6 “battle stats” are distributed across all Pokémon by taking a look at some side-by-side boxplots.
cols=list(df.columns[5:11])
battle_stats=df[cols]
plt.figure(figsize=(15,10))
fig=sns.boxplot(data=battle_stats)
plt.show(fig)
We see that there are a lot of outliers, meaning there are Pokémon with unusually high or unusually low proportions in certain stats, and not enough similar Pokémon to justify that it fits a certain archetype. Since the techniques to be used in this analysis are sensitive to outliers, it would be advantageous to remove them. However, that is not enough reason to just remove them personally, I always feel there should be reason behind removing outliers. In the context of our original question, we can imagine that even if there are archetypes found in the Pokémon data, there’s bound to be Pokémon that “break the mould”. In this sense, a game designer can design a Pokémon with whatever proportions they please and so the existence of a few Pokémon here and there that don’t fit any common archetype should not deny the existence of archetypes altogether (leaving them in might cause us to incorrectly conclude that there is too much variance to have these archetypes we are looking for). As such, we will remove these outlier Pokémon.
To accomplish this, we will simply find the interquartile range for each of the 6 battle stats, and for each range we will filter out any Pokémon that has at least one stat outside any of those 6 ranges. That is, to be included in our analysis, a Pokémon must be inside all 6 of the interquartile ranges (AKA is not an outlier in any sense).
tempdf=df # We'll need a copy of the original data set with outliers for later!
for col in cols:
q25, q75 = np.percentile(df[col], 25), np.percentile(df[col], 75)
iqr = q75 - q25
cut_off = iqr * 1.5
lower, upper = q25 - cut_off, q75 + cut_off
df = df[df[col] < upper]
df = df[df[col] > lower]
The resulting distributions show much less outliers after the filtering.
cols=list(df.columns[5:11])
battle_stats=df[cols]
plt.figure(figsize=(15,10))
fig=sns.boxplot(data=battle_stats)
plt.show(fig)
Now we can take a quick look at some summary statistics to get an idea for what the "average" Pokémon looks like (if we don’t take into account any clusters/labels).
battle_stats.describe()
hp | attack | defense | sp_attack | sp_defense | speed | |
---|---|---|---|---|---|---|
count | 779.000000 | 779.000000 | 779.000000 | 779.000000 | 779.000000 | 779.000000 |
mean | 0.159656 | 0.182843 | 0.163788 | 0.167095 | 0.163029 | 0.163680 |
std | 0.029690 | 0.044405 | 0.039927 | 0.044262 | 0.036166 | 0.053418 |
min | 0.083333 | 0.075758 | 0.049180 | 0.047619 | 0.064912 | 0.034483 |
25% | 0.140064 | 0.148886 | 0.134237 | 0.132395 | 0.137169 | 0.124307 |
50% | 0.156250 | 0.180000 | 0.157182 | 0.166667 | 0.160000 | 0.161290 |
75% | 0.176777 | 0.216667 | 0.185567 | 0.200000 | 0.187016 | 0.202532 |
max | 0.240000 | 0.295082 | 0.280528 | 0.300000 | 0.264151 | 0.314815 |
It seems that most Pokémon tend to have slightly higher attack while having balance in the remaining 5 stats. Let’s take a look at how the stats relate to each other.
sns.pairplot(battle_stats)
<seaborn.axisgrid.PairGrid at 0x1c4e0d9a320>
We also see by the histograms (located on the main diagonal) that we see somewhat of a normal distribution. If each stat is in fact normally distributed and if find very little correlation between the stats, then we can perform MANOVA to see if Pokémon type has an effect on these 6 stats (which is exactly what we want to figure out!). It is hard to tell the linear dependence of some of these pairs just from these scatterplots, which is why we will look at the correlation matrix and also the covariance matrix to help decipher what these plots mean.
cov_mat=battle_stats.cov()
cov_mat
hp | attack | defense | sp_attack | sp_defense | speed | |
---|---|---|---|---|---|---|
hp | 0.000882 | 0.000070 | 0.000025 | -0.000313 | -0.000139 | -0.000525 |
attack | 0.000070 | 0.001972 | 0.000042 | -0.000878 | -0.000860 | -0.000339 |
defense | 0.000025 | 0.000042 | 0.001594 | -0.000672 | 0.000159 | -0.001149 |
sp_attack | -0.000313 | -0.000878 | -0.000672 | 0.001959 | 0.000139 | -0.000235 |
sp_defense | -0.000139 | -0.000860 | 0.000159 | 0.000139 | 0.001308 | -0.000605 |
speed | -0.000525 | -0.000339 | -0.001149 | -0.000235 | -0.000605 | 0.002854 |
corr_mat = battle_stats.corr()
corr_mat
hp | attack | defense | sp_attack | sp_defense | speed | |
---|---|---|---|---|---|---|
hp | 1.000000 | 0.052878 | 0.021172 | -0.237989 | -0.129617 | -0.331319 |
attack | 0.052878 | 1.000000 | 0.023764 | -0.446826 | -0.535569 | -0.142778 |
defense | 0.021172 | 0.023764 | 1.000000 | -0.380275 | 0.109813 | -0.538535 |
sp_attack | -0.237989 | -0.446826 | -0.380275 | 1.000000 | 0.086744 | -0.099480 |
sp_defense | -0.129617 | -0.535569 | 0.109813 | 0.086744 | 1.000000 | -0.313125 |
speed | -0.331319 | -0.142778 | -0.538535 | -0.099480 | -0.313125 | 1.000000 |
There is essentially no covariance between any of the 6 battle stats, but there is some strong linear relationships between some of the stats, such as between sp_defense and attack having a correlation of -0.447. It is very hard to see some of these relationships visually, but with this added knowledge we can see the that some of plots do in fact somewhat form a diagonal line from the top left to the bottom right (a line with negative slope).
While we see there is a lot of linear dependence between some of the pairs of variables, it might be worthwhile to actually test if each of the 6 battle stats are normally distributed. To do this, we will perform the Shapiro-Wilks test for normality. This test will detect if there are any significant departures from a normally distributed population.
for col in cols:
shapiro_test = stats.shapiro(battle_stats[col])
print(shapiro_test)
ShapiroResult(statistic=0.978453516960144, pvalue=2.6408730757765397e-09) ShapiroResult(statistic=0.9895996451377869, pvalue=2.5290804842370562e-05) ShapiroResult(statistic=0.9668951034545898, pvalue=2.7778085821134058e-12) ShapiroResult(statistic=0.9892361164093018, pvalue=1.7584263332537375e-05) ShapiroResult(statistic=0.9807162880897522, pvalue=1.3024076039869215e-08) ShapiroResult(statistic=0.9937264323234558, pvalue=0.00244623189792037)
Interpreting this test, the Shapiro-Wilk test assumes that the population is in fact normally distributed, and then attempts to find evidence that disproves that hypothesis. The resulting "pvalue" tells us the probability that the population is normally distributed based on the evidence it was given. We see that the largest pvalue is for the speed stat, with there being only a 0.2% chance that the population is normally distributed. The remaining values range from 0.001758% chance all the way to 0.0000000002778% chance that they are normally distributed. With the evidence presented, we can confidently say that none of these variables are normally distributed.
Before we go any further, there is one more thing we need to sort out, the idea of "hybrid" Pokémon or Pokémon with a secondary type. If we truly want to examine the effect of a single type on a Pokémon’s stats, it might be useful to have a set of Pokémon that have no secondary type (we can refer to those as “pure” Pokémon). That way, if there is a relationship between type and Pokémon stats, there will be less variance to account for if the Pokémon we use have secondary types (for example, a Pokémon that shouldn’t have high defense we find has unusually high defense, but is also it’s secondary is rock, which might explain that). If we can find a relationship in the pure Pokémon set, then we can move onto the set of all Pokémon with the types we found in mind.
To perform this filtering, all we must do is filter out any record that has an empty “type2” column. After filtering out non-pure Pokémon, we are left with 456 Pokémon. Let’s now do the same exploration and outlier detection as before to this “pure” Pokémon set.
pure_df=tempdf[tempdf['type2']=='']
pure_df = pure_df.reset_index(drop=True)
pure_df
number | name | type1 | type2 | total | hp | attack | defense | sp_attack | sp_defense | speed | generation | legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | Charmander | Fire | 309 | 0.126214 | 0.168285 | 0.139159 | 0.194175 | 0.161812 | 0.210356 | 1 | False | |
1 | 5 | Charmeleon | Fire | 405 | 0.143210 | 0.158025 | 0.143210 | 0.197531 | 0.160494 | 0.197531 | 1 | False | |
2 | 7 | Squirtle | Water | 314 | 0.140127 | 0.152866 | 0.207006 | 0.159236 | 0.203822 | 0.136943 | 1 | False | |
3 | 8 | Wartortle | Water | 405 | 0.145679 | 0.155556 | 0.197531 | 0.160494 | 0.197531 | 0.143210 | 1 | False | |
4 | 9 | Blastoise | Water | 530 | 0.149057 | 0.156604 | 0.188679 | 0.160377 | 0.198113 | 0.147170 | 1 | False | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
451 | 891 | Kubfu | Fighting | 385 | 0.155844 | 0.233766 | 0.155844 | 0.137662 | 0.129870 | 0.187013 | 8 | True | |
452 | 894 | Regieleki | Electric | 580 | 0.137931 | 0.172414 | 0.086207 | 0.172414 | 0.086207 | 0.344828 | 8 | True | |
453 | 895 | Regidrago | Dragon | 580 | 0.344828 | 0.172414 | 0.086207 | 0.172414 | 0.086207 | 0.137931 | 8 | True | |
454 | 896 | Glastrier | Ice | 580 | 0.172414 | 0.250000 | 0.224138 | 0.112069 | 0.189655 | 0.051724 | 8 | True | |
455 | 897 | Spectrier | Ghost | 580 | 0.172414 | 0.112069 | 0.103448 | 0.250000 | 0.137931 | 0.224138 | 8 | True |
456 rows × 13 columns
plt.figure(figsize=(15,10))
plt.xlabel("...", labelpad=20)
fig=sns.countplot(x='type1',data=pure_df)
fig.set(xlabel='Primary Type',ylabel='Frequency',title="Distribution of Pure Pokemon Types")
plt.show(fig)
cols=list(pure_df.columns[5:11])
pure_stats=pure_df[cols]
plt.figure(figsize=(15,10))
fig=sns.boxplot(data=pure_stats)
plt.show(fig)
for col in cols:
q25, q75 = np.percentile(pure_stats[col], 25), np.percentile(pure_stats[col], 75)
iqr = q75 - q25
cut_off = iqr * 1.5
lower, upper = q25 - cut_off, q75 + cut_off
pure_df = pure_df[pure_df[col] < upper]
pure_df = pure_df[pure_df[col] > lower]
pure_stats=pure_df[cols]
plt.figure(figsize=(15,10))
fig=sns.boxplot(data=pure_stats)
plt.show(fig)
pure_stats.describe()
hp | attack | defense | sp_attack | sp_defense | speed | |
---|---|---|---|---|---|---|
count | 389.000000 | 389.000000 | 389.000000 | 389.000000 | 389.000000 | 389.000000 |
mean | 0.159900 | 0.183619 | 0.161868 | 0.164946 | 0.163574 | 0.166093 |
std | 0.030002 | 0.044587 | 0.036445 | 0.043023 | 0.035755 | 0.053912 |
min | 0.083333 | 0.075758 | 0.073171 | 0.047619 | 0.083333 | 0.034483 |
25% | 0.140625 | 0.151976 | 0.134146 | 0.133333 | 0.137931 | 0.127389 |
50% | 0.157895 | 0.181416 | 0.157534 | 0.163265 | 0.160377 | 0.164706 |
75% | 0.176056 | 0.214433 | 0.181598 | 0.197222 | 0.186335 | 0.205882 |
max | 0.240964 | 0.291667 | 0.269231 | 0.284314 | 0.265306 | 0.310345 |
Again, we see similarly that pure Pokémon tend to have about 18% of their total points in attack and then relatively even proportions for the remaining 5 stats.
sns.pairplot(pure_stats)
<seaborn.axisgrid.PairGrid at 0x1c4e5b58100>
Based on the histograms, the stats also have similar distributions in this filtered set.
pure_cov_mat = pure_stats.cov()
pure_cov_mat
hp | attack | defense | sp_attack | sp_defense | speed | |
---|---|---|---|---|---|---|
hp | 0.000900 | 0.000070 | 0.000114 | -0.000330 | -0.000109 | -0.000644 |
attack | 0.000070 | 0.001988 | 0.000061 | -0.000863 | -0.000859 | -0.000397 |
defense | 0.000114 | 0.000061 | 0.001328 | -0.000533 | 0.000090 | -0.001059 |
sp_attack | -0.000330 | -0.000863 | -0.000533 | 0.001851 | 0.000140 | -0.000265 |
sp_defense | -0.000109 | -0.000859 | 0.000090 | 0.000140 | 0.001278 | -0.000541 |
speed | -0.000644 | -0.000397 | -0.001059 | -0.000265 | -0.000541 | 0.002906 |
pure_corr_mat = pure_stats.corr()
pure_corr_mat
hp | attack | defense | sp_attack | sp_defense | speed | |
---|---|---|---|---|---|---|
hp | 1.000000 | 0.052408 | 0.103809 | -0.255965 | -0.101523 | -0.398419 |
attack | 0.052408 | 1.000000 | 0.037550 | -0.450034 | -0.538716 | -0.165166 |
defense | 0.103809 | 0.037550 | 1.000000 | -0.339854 | 0.068756 | -0.539215 |
sp_attack | -0.255965 | -0.450034 | -0.339854 | 1.000000 | 0.091296 | -0.114191 |
sp_defense | -0.101523 | -0.538716 | 0.068756 | 0.091296 | 1.000000 | -0.280514 |
speed | -0.398419 | -0.165166 | -0.539215 | -0.114191 | -0.280514 | 1.000000 |
While some relationships are a little bit weaker/stronger here and there, overall the covariance and correlations between the 6 stats are very similar.
for col in cols:
shapiro_test = stats.shapiro(pure_stats[col])
print(shapiro_test)
ShapiroResult(statistic=0.9741064310073853, pvalue=2.031649728451157e-06) ShapiroResult(statistic=0.9903718829154968, pvalue=0.011999256908893585) ShapiroResult(statistic=0.9700005054473877, pvalue=3.563248753835069e-07) ShapiroResult(statistic=0.9924943447113037, pvalue=0.04792152717709541) ShapiroResult(statistic=0.9727322459220886, pvalue=1.1170828884132789e-06) ShapiroResult(statistic=0.9910305738449097, pvalue=0.018365293741226196)
After running the Shapiro-Wilk test again, 6 separate times we see that sp_attack has a 4.79% chance of being normally distributed and speed has a 1.84% chance and attack has a 1.20% chance. These are better but the remaining stats have very low chances of being normally distributed.
For our analysis, many techniques are ruled out due to the distributions not being normal. Even if we could solve the normal distribution problems by resampling and using central limit theorem, the issue of linear dependence still exists. This means we can’t use MANOVA or other “traditional” statistics techniques effectively. As such, we will turn to machine learning techniques to help us solve the issues we are facing.
We will start by setting our X and y variables. Each individual x will be a vector of 6 observations, 1 for each of the 6 battle stats. The y variable will simply be the primary type of the Pokémon. That is, lets again assume for the sake of argument that these archetypes exist and are based on Pokémon type, then if rock type Pokémon truly do all have high defense and low offense, if I gave the classifier some proportions that had high defense and low offense, it should be able to tell me that I gave it a rock-type Pokémon. Then, I would check and make sure that the Pokémon I gave it was in fact a rock-type Pokémon. We will essentially be doing this to test the model’s accuracy.
X = pure_df.iloc[:, 5:11].values
y = pure_df.iloc[:, 2].values
The model of choice today is a simple Linear Discriminant Analysis model. It will essentially attempt to "draw a line in the sand" between the different data points to classify them according to type.
Before fitting the model, we will first scale the data so that it is normalized as best as possible. This StandardScaler will attempt to get it to a mean of 0 and standard deviation of 1 to help with the accuracy of our model and is generally good practice.
Next, we must figure out how many components should be used. One way is to first run LDA with n_components set to "None" and then store the ratio of explained variance from each component (essentially how much of the total variance that each component "explains"). Then we will add the components one by one until we can explain 95% of the total variance (a good standard number, as we would need all components to explain all 100%) and then we will say that is "good enough".
scaler = StandardScaler()
X=scaler.fit_transform(X)
lda = LDA(n_components=None)
X_lda = lda.fit_transform(X, y)
lda_var_ratios = lda.explained_variance_ratio_
#This code to calculate how much explained variance we get per component is in public domain https://creativecommons.org/publicdomain/zero/1.0/
def select_n_components(var_ratio, goal_var: float) -> int:
# Set initial variance explained so far
total_variance = 0.0
# Set initial number of features
n_components = 0
# For the explained variance of each feature:
for explained_variance in var_ratio:
# Add the explained variance to the total
total_variance += explained_variance
# Add one to the number of components
n_components += 1
# If we reach our goal level of explained variance
if total_variance >= goal_var:
# End the loop
break
# Return the number of components
return n_components
n=select_n_components(lda_var_ratios, 0.95)
print("Optimal number of components:",n)
df
Optimal number of components: 4
number | name | type1 | type2 | total | hp | attack | defense | sp_attack | sp_defense | speed | generation | legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 0.141509 | 0.154088 | 0.154088 | 0.204403 | 0.204403 | 0.141509 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 0.148148 | 0.153086 | 0.155556 | 0.197531 | 0.197531 | 0.148148 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 0.152381 | 0.156190 | 0.158095 | 0.190476 | 0.190476 | 0.152381 | 1 | False |
3 | 4 | Charmander | Fire | 309 | 0.126214 | 0.168285 | 0.139159 | 0.194175 | 0.161812 | 0.210356 | 1 | False | |
4 | 5 | Charmeleon | Fire | 405 | 0.143210 | 0.158025 | 0.143210 | 0.197531 | 0.160494 | 0.197531 | 1 | False | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
891 | 892 | Urshifu Single Strike Style | Fighting | Dark | 550 | 0.181818 | 0.236364 | 0.181818 | 0.114545 | 0.109091 | 0.176364 | 8 | True |
892 | 893 | Zarude | Dark | Grass | 600 | 0.175000 | 0.200000 | 0.175000 | 0.116667 | 0.158333 | 0.175000 | 8 | True |
895 | 896 | Glastrier | Ice | 580 | 0.172414 | 0.250000 | 0.224138 | 0.112069 | 0.189655 | 0.051724 | 8 | True | |
896 | 897 | Spectrier | Ghost | 580 | 0.172414 | 0.112069 | 0.103448 | 0.250000 | 0.137931 | 0.224138 | 8 | True | |
897 | 898 | Calyrex | Psychic | Grass | 500 | 0.200000 | 0.160000 | 0.160000 | 0.160000 | 0.160000 | 0.160000 | 8 | True |
779 rows × 13 columns
lda = LDA(n_components=n)
X_lda = lda.fit_transform(X, y)
After creating and fitting our model, we will use Repeated Stratified K Fold Cross-Validation to assess our model. This has a lot of steps so let’s break it down from its most simple form first.
The K-Fold Cross-Validation is the basis of this testing, it essentially splits up the total dataset into "folds" and then uses one of those sets as a training set for the model, and then uses the remaining folds to test the model as described above and reports back the average accuracy across the tests.
If data is getting split up and there is a chance that some folds will be missing data (for example say we have a fold that doesn’t have any Flying types in it) we use Stratified K Fold Cross-Validation which essentially makes sure that data from each of the strata (each of our 18 types) will be in every single fold.
Sometimes this method can be noisy as it can be dependent on what data is in which fold. To randomize this a bit, we use Repeated K-Fold Cross-Validation or in this case Repeated Stratified K-Fold Cross-Validation which will essentially run that whole process however many are chosen, randomizing the folds each time.
cross_validator = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=1)
scores = cross_val_score(lda, X, y, scoring='accuracy', cv=cross_validator, n_jobs=-1)
print(np.mean(scores))
0.28532432253362483
We see that after this validation, the model has a below 30% accuracy rate. 1/18 is only 5.555...% and so it is better than a model that randomly guesses the type (maybe not better if we used a weighted average) but it is still a very inaccurate model. Let’s take a look at how it managed to classify the data across its first two components.
lda = LDA(n_components=n)
X_lda = lda.fit_transform(X, y)
color_list=["red","violet","blue","g","c","m","y","b","bisque","darkorange","lime","crimson","lightslategrey","saddlebrown","seashell","turquoise","khaki","darkolivegreen"]
color_dict={}
x=0
for type in pure_df["type1"].unique():
color_dict[type]=color_list[x]
x+=1
plt.xlabel('LD1')
plt.ylabel('LD2')
plt.scatter(
X_lda[:,0],
X_lda[:,1],
c=pure_df["type1"].map(color_dict),
)
<matplotlib.collections.PathCollection at 0x1c4e3c0d840>
As we can see, when we plot on the first two axes created by the LDA process, the model doesn’t do a great job at separating the data points.
While it seems the given labels (the Pokémon types) do not accurately indicate what Pokémon’s battle stat proportions will be, it is still possible that we need to look at it through a broader lens.
We will attempt to do something very similar to LDA, Principal Component Analysis. This technique will do something very similar to LDA and uses the labels given, but instead of separating the data based solely on the labels as given, it will attempt to make “Principal Components” which are essentially hybrids of all the labels and then perform the analysis based on those. This technique also helps solve the issue we had of “hybrid” Pokémon. Where LDA would try to classify a Pokémon as a single type, PCA will say that the proportions of the stats are explained due to X amount of this primary type and Y amount of this secondary type, etc. So we will go back and use the full dataset that include both pure and hybrid Pokémon.
We will again use the same technique of finding how many components we should break the labels into as we did with finding the number of components for LDA. After scaling the data, we can split the data into training and testing data, then take a look at the resulting confusion matrix and accuracy.
X = df.iloc[:, 5:11].values
y = df.iloc[:, 2].values
scaler = StandardScaler()
X=scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
pca = PCA(n_components=None)
X_pca = pca.fit_transform(X, y)
pca_var_ratios = pca.explained_variance_ratio_
n=select_n_components(pca_var_ratios, 0.95)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
pca = PCA(n_components=n)
X_train = pca.fit_transform(X_train)
X_test = pca.fit_transform(X_test)
X_pca = pca.fit_transform(X, y)
classifier = RandomForestClassifier(max_depth=7)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
conf_mat = confusion_matrix(y_test, y_pred)
print(conf_mat)
print("Accuracy: " + str(accuracy_score(y_test, y_pred)))
[[0 0 0 0 0 1 0 0 0 3 0 0 1 0 1 0 0 5] [0 0 0 0 1 0 0 0 0 1 0 0 2 0 0 0 0 0] [0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0] [0 0 0 1 0 0 0 0 0 1 1 0 2 0 0 0 0 7] [1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1] [0 0 0 0 0 0 2 0 0 1 0 0 0 0 1 0 0 2] [0 0 0 0 0 0 0 0 0 1 0 0 4 0 0 0 0 8] [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [3 1 0 0 0 0 0 0 0 1 1 0 2 0 0 0 0 0] [1 0 1 1 0 0 0 0 0 1 0 0 7 0 0 0 0 8] [2 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 2] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3] [1 0 0 0 1 0 4 0 0 4 0 0 0 0 2 0 0 7] [1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1] [0 0 0 0 0 1 0 0 0 0 2 0 3 0 0 0 0 4] [0 0 0 1 0 0 2 0 0 0 0 0 0 0 1 0 0 2] [0 0 0 2 0 0 0 0 0 0 0 0 3 0 1 0 0 0] [1 0 0 0 0 0 2 0 1 6 0 0 4 0 1 0 0 6]] Accuracy: 0.05128205128205128
While this matrix might be hard to interpret, we see the main diagonal are the correct outcomes and the rest are incorrect. This yields a very low accuracy rate once again.
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.scatter(
X_pca[:,0],
X_pca[:,1],
c=df["type1"].map(color_dict)
)
<matplotlib.collections.PathCollection at 0x1c4e5f399c0>
Looking at how this model managed to classify the data across the first two principal components and we see it appears very random.
With all the analysis thus far, I think it is safe to throw away the labels. That means, we have found sufficient evidence that the types do not at all dictate what the proportion of the 6 battles stats will be in a Pokémon.
While this may sound disappointing, all we have found is that the types do not dictate the hidden archetypes we are looking for. This does not, however disprove the existence of these archetypes altogether. To continue our search, we will turn to unstructured learning, where we treat each proportion as completely unrelated to the other at first and then try to create our own sort of labels.
What we will do is a very simple k-means analysis, where we essentially try to create clusters of data points that are “close” to each other. You can imagine points on a graph and we try to draw k circles around points so that they each circle or “cluster” is as dense as possible with points (the points are not sparsely spread around the circle).
X = df.iloc[:, 5:11].values
X_train, X_test = train_test_split(X,test_size=0.2)
First, we need to figure out how many clusters we should draw (aka the value for k). Since k-means is such a fast algorithm, usually the best way to do this is known as the “Elbow Method”. We will run the algorithm starting with just 2 clusters and go up to a specified amount (in this analysis I believe 20 is sufficient) and then we will calculate something known as the Calinski-Harabasz (CH) index. This index essentially measures on average how far apart data points are within clusters and reports a index. The higher the score, the better that k value performed.
So, for example, imagine we had a piece of paper and data points were tightly draw in each of the four corners. If we attempted to draw circles around these points with only 2 circles, on average a lot of the points would be very far apart (there would be points on opposite ends of the paper). This situation would correlate to a low CH index. But with four circles, we could draw small circles in the each of the four corners where all of the points would be packed together closely. This would correlate to a high CH index.
Going back to the Elbow Method, what we will do is just run the algorithm across these different values for k and then use the k value that resulted in the highest CH index.
scores = {}
scale = StandardScaler()
StdScale = scale.fit_transform(X)
max_clusters=20
for i in range(2,max_clusters+1):
kmeans = KMeans(n_clusters=i)
labels = kmeans.fit_predict(StdScale)
db_index = calinski_harabasz_score(X, labels)
scores.update({i: db_index})
n=max(scores, key=scores.get)
plt.plot(list(scores.keys()), list(scores.values()))
plt.xlabel("Number of clusters")
plt.ylabel("Calinski-Harabasz Index")
plt.annotate(f"Number of clusters = {n}", xy=(n, scores[n]), xytext=(n+1, scores[n]*.95), arrowprops=dict(arrowstyle="->"))
plt.xticks(range(2,22,2))
plt.show()
kmeans = KMeans(n_clusters = n, init="k-means++",random_state=1)
kmeans.fit(StdScale)
df["cluster"] = kmeans.labels_
We have found that the optimal number of clusters is clearly 3. This means these three clusters are the best way to group together the different proportions of Pokémon battle stats. Meaning, we have found exactly what we were looking for, these clusters represent the three main archetypes present in the Pokémon games! Looking at the same pair plots as before, we can see how the clusters manage to separate the points.
sns.pairplot(df[['hp', 'attack', 'defense','sp_attack', 'sp_defense', 'speed', "cluster"]], hue = "cluster",palette="flare",)
<seaborn.axisgrid.PairGrid at 0x1c4e3bc2b00>
While they do not always perfectly separate the points into three groups, more often than not it does a good job. To get a better overall idea, lets look at that same graph of the first two principal components, except this time we will see how the clusters managed to color the points.
X = df.iloc[:, 5:11].values
y = df.iloc[:, 12].values
scaler = StandardScaler()
X=scaler.fit_transform(X)
pca = PCA(n_components=n)
X_pca = pca.fit_transform(X, y)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.scatter(
X_pca[:,0],
X_pca[:,1],
c=df["cluster"],
cmap="flare"
)
<matplotlib.collections.PathCollection at 0x1c4f516c5e0>
This looks a lot better! As a reminder, we used the entire dataset of both hybrid and pure Pokémon, meaning this is as accurate as we can get. We can investigate a little more about how each stat is distributed within each cluster by looking at some boxplots. This is different from before where we tried to lump them all into a single boxplot per stat.
cluster_box_plots = pd.melt(df, id_vars = [
"number",
"name",
"type1",
"type2",
"generation",
"cluster"
], value_vars = [
'hp', 'attack', 'defense','sp_attack', 'sp_defense', 'speed'
])
plt.figure(figsize=(12,5))
ax = sns.boxplot(x="variable", y="value", hue = "cluster", data=cluster_box_plots,palette="flare")
plt.title("Battle Stats Boxplots by Cluster")
plt.xlabel("Skills")
plt.ylabel("Proportion")
Text(0, 0.5, 'Proportion')
Here we can see many differences, for example in the first “0” cluster, these Pokémon tend to have a lot more attack than the other two, while Pokémon in the second “1” cluster has a lot more speed than the other two. We can sum up each cluster by simply looking at the average battle stat proportions in each cluster.
col_means=df.groupby(['cluster']).mean()
col_means=col_means[cols]
col_means
hp | attack | defense | sp_attack | sp_defense | speed | |
---|---|---|---|---|---|---|
cluster | ||||||
0 | 0.160290 | 0.148680 | 0.174808 | 0.183738 | 0.196262 | 0.136221 |
1 | 0.171538 | 0.221638 | 0.183640 | 0.132283 | 0.147277 | 0.143912 |
2 | 0.147464 | 0.181961 | 0.132717 | 0.182942 | 0.142590 | 0.212327 |
As a point of reference, if a Pokémon ever had even stats across the board, meaning there was no difference across each battle stat, we could call this a “balanced” archetype. When that occurs, each proportion would be equal to each other, meaning they would all be 1/6 or 0.166666… With this in mind, if we see any proportion being around that 16.6666%, we can say that battle stat is “average” or “baseline”. To better visualize these clusters, we can create some pie charts which each represent an archetype!
colors = sns.color_palette('pastel')[0:6]
for index, row in col_means.iterrows():
plt.pie(row, labels = cols, colors = colors, autopct='%.0f%%')
plt.show()
The first cluster sacrafices a bit of attack and speed to focus on somewhat high sp_attack and defense with high sp_defense.
In our second cluster is defined as having high attack, with above average defense and around average hp. However, to compensate for this it has low sp_attack and speed and relatively low sp_defense.
The third and final cluster has somewhat high attack and sp_attack while also having high speed but sacrifices the remaining three stats to compensate.
We have first found that Pokémon battle stats are definitely not based around their types. This makes a lot of sense because Pokémon has a very completive side to it, and just from a game balancing perspective this would not be healthy for the game. This is because type plays a big part in Pokémon battles, types have weaknesses to other types (fire is weak to water, water is weak to grass, etc.). That means when a player is trying to create a well-balanced, powerful team, they must keep types and archetypes in mind at the same time. For example, say you find that to round out your team you need a “tank” like character that has high defenses and health to endure a lot of damage. But also, based on the team you picked, you don’t want to be forced to chose a rock-type for example. The fact that Pokémon type has nothing to do with archetypes is good news in a situation like that!
Secondly, we managed to find what we were really looking for, the hidden archetypes in Pokémon. We managed to find how these archetypes are generally defined. However, one thing to notice is that we only did 3 clusters because that was the “best” we could do. It is possible that there is just no good way to really create clusters around this data or that we needed to use a different way to find the optimal number of clusters. However, based on the scores and visually how we saw the data clustered together, it seems we did as best a job as possible. Another thing to consider is that the battle stats used are with no special bonuses or other ways to modify a Pokémon’s stats and so this is not entirely reflective of actual gameplay.
With all that being said, we can say that generally when a Pokémon is designed, usually they tend to fall into one of our three archetypes. In the future as the newest set of Pokémon games are about to be released, it will be interesting to see if the new set of Pokémon will reinforce these clusters, or if they will end up redefining our clusters.