How we created opal v2, improved explainability through Positive Weight constraints and measured reliability through PageRank
Hello friends, thanks for checking out this post! This is still work-in-progress, thus writing will be rough and information will be lacking.
Introduction
This article assumes you have some knowledge in rhythm games
Rating Estimation
As we know, the fundamental components of Rhythm Games are the:
- Players
- Beatmaps (in short: maps)
- Results
A player would play a map, which would yield a result. Maps and Players are usually given some rating value that estimates the expected result of the play. For example, in osu!mania:
- maps are given a difficulty rating (Star Rating)
- players are given a performance rating (Performance Points)
The fundamental motivation of an accurate rating is that players can find maps that suit their current goal. For example, different players of different ratings perceives different map ratings to feel different. If a player want a casual playthrough, they may choose Easy or Medium maps, respective to their rating.
Player Rating | Easy | Medium | Hard | Impossible |
---|---|---|---|---|
5000 | 3 | 4 | 5 | 6 |
10000 | 5 | 7 | 8.5 | 11 |
15000 | 6 | 7.5 | 9 | 12 |
However, evaluating the rating on both players and maps is nothing short of difficult, which explains why most rhythm games still struggle to nail this. Rating algorithms such as:
- osu!mania’s
- Etterna’s
- Quaver’s
are some variants of automated rating algorithms. Yet, none of them seem to be unanimously agreed to be the best. This is largely due to the fact that humans evaluating them is far from objective. Who’s to say who’s more right? As hinted, the solution is something closer to objectiveness, that detaches itself from human biases.
Though, how do we achieve this? To answer this, we need to understand how players attach a rating to a map. There are 2 ways that are best detached from human bias:
Map Analysis: By looking at the map objectively, we can estimate how
difficult it is through observing its patterns.
There’s a major difference between rating systems available out there and
what we want to achieve. One, because rating systems have algorithms
that are designed by humans, it’s still, opinionated by the algorithm designer.
For example, if if map = mine then rating = 1000 else 500
is a possible
algorithm, but it’s clear that it’s far from objective. Though fortunately,
most advanced rating systems out there are more nuanced than this, removing as
much bias as necessary.
Referencing Player Scores: By referencing scores of other players, and their known rating, we can infer how difficult the map is. This is surprisingly simpler as we do not need to understand the intricacies of complex patterns, we just need to know the rating of the players who played the map.
For example, the following score table:
Player Rating | Map A Score | Map B Score |
---|---|---|
1500 | 95% | 90% |
1000 | 90% | 85% |
500 | 85% | 80% |
If we have the rule that if the Map Rating is equivalent to the Player Rating that achieves 90%, then Map A is rated 1000, B is rated 1500.
Simple enough! However, this is a recursion optimization problem, to know the Player Rating, we need to know the Map Rating, vice versa! This is what we want to tackle in this article.
We’ll propose an algorithm that uses player scores to optimize Player and Map Rating.
Our Contributions
We will demonstrate a simple custom network architecture that not only achieves low error in score prediction, but also provides valuable insights in map difficulty and player skill estimation when embeddings are learnt.
Structure
This paper, for important components are written high-level, supplemented with low-level proof in their subsections.
Opal
Unlike many Neural Network design approaches, opal prioritizes explainability, over highly-accurate models. Therefore, parameter choices are carefully selected with opal, and we’ll explain all decision choices in the following sections.
Related Work
The landscape of VSRG ML is scarce, as admittedly, not much value is gained from uncoverring this mystery.
Some common solutions in untangling this mess:
- AlphaOsu: A recommendation system that can be considered an adverserial approach against the current algorithm, it suggests maps to players that will provide the largest gain in measured “Skill”. We say “adverserial” as the game target (osu!mania) doesn’t have an optimal difficulty calculation algorithm, this means that the best recommendations are often due to flaws in the difficulty calculation.
- MugDiffusion: A map generation algorithm based on Stable Diffusion. It learns from existing maps by encoding them with a Variational AutoEncoder (VAE), trains a denoising model on these embeddings, and generates maps with a similar approach to many Diffusion networks. The key takeaway we found from this is that VAE is a potential target for encoding the complex high-dimensional map, and can be considered as an additional input to opal in further research.
- ELO-Based Approach: (We do not have an exact source) The ELO approach is based heavily off of Chess games, where an ELO system is updated based on results of tournament matches. The dataset, when scraped from tournaments, is often reliable as players can be assumed with high confidence that they are playing at a consistent skill, which is an important assumption for CF. However, this method falls short in extrapolating to unseen maps and players.
- Opal (V1): A score recommendation system, which this paper supersedes. This approach naively implements NeuMF, a class of Neural Collaborative Filtering networks, which we found only worked as we highly-parameterized the embedding space. The fundamental flaw was that player and map embeddings were compared with dot-product, losing its directional component, we hypothesize that our extremely high embeddings made up for this error.
Built upon lessons learned from opal v1, we not only want to provide a score-recommendation system, we also want to provide explainable introspection to the embeddings.
Background
We first define the high-level problem we’re tackling:
- We want to find the best model such that it minimizes the prediction error: $\arg\min_f[\text{error}]$.
- The error is defined by the difference between the predicted accuracy and actual accuracy: $\text{error}=a_\text{pred} - a_\text{actual}$
- Since we know how to get the actual accuracy, it leaves us with how to predict it. We define this predicting operation to be $f$, which takes in the player $p$ and map $m$ and must output a prediction $a$.
- We normalize the error with this $\frac{1}{MP}\sum_i^P\sum_j^M$
Putting it all together, we get this optimization
\[\arg\min_f \left[ \sum^P\sum^M \left( \underbrace{f(p_i, m_j)}_{a_\text{pred}} - a_\text{actual} \right) \right]\]Let’s take a look at how it’ll look like to compute this minimization
Actual | Map A | Map B |
---|---|---|
Player A | 100% | 90% |
Player B | 80% | 70% |
Predict | Map A | Map B |
---|---|---|
Player A | $f(m_A, p_A) =$ 90% | $f(m_B, p_A)=$ 90% |
Player B | $f(m_A, p_B) =$ 80% | $f(m_B, p_B)=$ 80% |
Difference | Map A | Map B |
---|---|---|
Player A | -10% | 0% |
Player B | 0% | 10% |
As shown, $f$ is a way to simply deduce the true accuracy depending on the player $p$ and map $m$. The smaller the difference, the better the model $f$.
At this point, we’re hit with various questions:
- How does $f(m, p)$ even work? Neither $m$ nor $p$ are numbers
- Not all players $p$ have played all maps $m$, what do we do about them?
- Players change and improve across days, how do we deal with that?
(1) Embedding
How does $f(m, p)$ even work? Neither $m$ nor $p$ are numbers
To understand this, imagine $m$ and $p$ can be described by a list of values. For example, I could describe my abilities with numbers:
- Jack: 10/100
- Speed: 70/100
- Stamina: 50/100
- … and so on
Next, we visualize this as a radar chart below:
{
"type": "radar",
"data": {
"labels": [
"Speed",
"Jack",
"Long Notes",
"Stamina",
"Coordination"
],
"datasets": [{
"label": "Evening",
"data": [70, 10, 30, 50, 35],
"borderColor": "#F00"
}]
},
"options": {
"aspectRatio": 2,
"scale": {
"r": {
"min": 0,
"max": 100
}
}
}
}
As shown, we can represent a non-numerical player numerically! But what’s more interesting, is when you also place maps in the same chart
{
"type": "radar",
"data": {
"labels": [
"Speed",
"Jack",
"Long Notes",
"Stamina",
"Coordination"
],
"datasets": [
{
"label": "Evening",
"data": [70, 10, 30, 50, 35],
"borderColor": "#F00"
},
{
"label": "Map A",
"data": [30, 10, 10, 40, 10],
"borderColor": "#0F0"
},
{
"label": "Map B",
"data": [80, 50, 50, 80, 60],
"borderColor": "#00F"
}
]
},
"options": {
"aspectRatio": 2,
"scale": {
"r": {
"min": 0,
"max": 100
}
}
}
}
By intuition, if a map’s statistics are mostly smaller than mine, then I’ll do well, see Map A. Consequently, if the map’s statistics are larger, then I’ll do worse, see Map B.
This hints on something amazing…
Let’s say we took the values of the player, then subtracted the values of the map, we have a set of values that describe the score in some way! Take for example something simple, an LN player playing a simple LN map
 | Rice | LN |
---|---|---|
Player | 30 | 100 |
Map | 10 | 60 |
Difference (Player - Map) | 20 | 40 |
We expect a high score here because Player A is strong at LNs, we then need to describe how this difference is mapped to score
Linear Contribution
The simplest model is to assume that each difference linearly contributes to the resulting accuracy, in other words:
\[w_1\text{Diff}_1 + w_2\text{Diff}_2 + ... + w_n\text{Diff}_n + b = a_ \text{pred}\]Where $w$ is the multiplicative weight of each $\text{Diff}$ component, and $b$ is the additive bias of the entire equation.
(2) Missing Scores and Reliability
Not all players $p$ have played all maps $m$, what do we do about them?
As mentioned, many scores are missing, it’s not uncommon to see the table be less than populated:
Actual | Map A | Map B |
---|---|---|
Player A | 100% | Â |
Player B | Â | 70% |
What this means is that some players and maps can be wildly inaccurate with scarce data. Take for example an extreme case:
Actual | Map A | Map B | … | Map Z (Gimmick) |
---|---|---|---|---|
Rank 1 Player | 100% | 99.95% | … | 80% |
Rank 100K Player | Â | Â | Â | 96% |
We have a Rank 1 Player, who has played 100s of maps, while a new Rank 100K player, who has played only 1. Coincidentally, they both played a Gimmick Map, a map so severely out of the norm, it might as well have its own skill set. If we went through with optimizing this as-is, we’d infer that the Rank 100K Player would score higher than the Rank 1 Player in all other maps which is egregiously wrong.
This leads us to the discussion of reliability, which is the measure on how out-of-the-norm the player, or map is.
Take For example below:
mindmap
Player A
Map A
Player B
Player G
Player H
Player I
Player J
Map B
Player F
Player E
Player D
Map C
Player C
Map E
- High Reliability: Player A has played 5 separate maps, and Map A has been played by 6 different players.
- Low Reliability: However, Player C rarely played any maps and Maps C and E have rarely been played.
One way to measure reliability is as the above bullet points, counting the number of immediate associations with each player. In Graph Theory, this is simply the degree of each node. As a matter of fact, this is an easy way to evaluate reliability. However, the degree only considers the immediate neighbours. To improve on this, we opted for PageRank, which considers the global context.
To simplify this explanation:
- Degree: Reliability is relative to the number of associated nodes
- PageRank: Reliability is relative to the number of reliable nodes
Now we can quantify reliability of each player, what do we do with them?
There are 3 choices
- We train the model as is, then report the reliability of each Player and Map
- We use the reliability as a importance weight when training
- We drop samples that are non-reliable, then train
We can’t choose (1.), because low-reliability players or maps will strongly affect the resulting embeddings when training. (3.) is a safe choice, however, finding the threshold to drop samples can be troublesome. That leaves us with (2.).
We’ll discuss more on how we used PageRank to weigh the samples later.
(3) Player Rating Variability
Players change and improve across days, how do we deal with that?
Spline fitting is not implemented yet due to complexity.
Player Ratings naturally change over time, thus we cannot pin a single rating to a player that spans across dates. Thus, we need to express the natural change in Player Rating through a function
\[f(p_i, t, \theta_i) = \text{Rating}\]For example, we could use functions like these:
{
"type": "line",
"data": {
"labels": ["Year 1", "Year 2", "Year 3", "Year 4", "Year 5", "Year 6"],
"datasets": [
{
"label": "Rating Estimatino (Stepped)",
"data": [1000, 1200, 1150, 1200, 1250, 1200],
"borderColor": "#F00",
"fill": "false",
"stepped": "true"
},
{
"label": "Rating Estimation (Smooth)",
"data": [1000, 1200, 1150, 1200, 1250, 1200],
"borderColor": "#0F0",
"fill": "false",
"tension": 0.4
}
]
}
}
A stepped function, would evaluate the Rating at every year, a smoothed function would attempt to interpolate smoothly.
Current versions of opal rely on the first method, which estimates a single constant Player Rating for each year. We decided on this largely because it’s simpler to understand and experiment with. However, there are 2 major caveats:
- As the function is constant through the year, if the player improves significantly during, the rating will be inaccurate.
- Furthermore, since the step function “jumps” between years, it’s not an accurate representation of how player’s improve.
We largely accepted this caveat in exchange for simplicity, however ideally we would want a smooth function as depicted above. This is left for future work!
Methodology
Architecture
Following all this background, we first show the architecture:
flowchart TD
P[Player] --> PE[Player Embedding]
M[Map] --> ME[Map Embedding]
PE ---+--> DE[Delta Embedding]
ME --- - --> DE --> PL
PL[Positive Network] --> A[Accuracy]
PL --> AV[Accuracy Variance] --> L
A --> L[Loss]
P --> PRW[PageRank Weight] --> L
M --> PRW[PageRank Weight] --> L
There’s a few additional components we haven’t discussed here:
- What is the Positive Linear Network?
- Why do we output Accuracy, and its Accuracy Variance?
Positive Weight Constraint for Embedding Guidance
The Positive Network is simply a stacked, linear, feed-forward network with weights that are positively constraint. What this means is that the function it learns will always be monotonically positive. However, it’s crucial that this constraint does not impede learning. To achieve this, we injected $\text{SoftPlus}(w)$ before the weights are fed into the Linear layer.
We have tried other approaches, such as:
- $\exp(w)$: Which had the side-effect of discouraging low weights due to weight-decay. This means that the model will attempt to avoid low accuracies, which is not hollistic.
- $\text{ReLU}(w)$: While we found that this worked better than $\exp$, we didn’t want to encourage dead neurons
- $\Phi(w)$: The inverse CDF of a normal distribution, while it works, it felt more complicated than necessary.
We implement this new Linear Layer like so, overriding the forward
of nn.Linear
in PyTorch.
from torch import nn
import torch
import torch.nn.functional as F
class PositiveLinear(nn.Linear):
def __init__(self, in_features, out_features, bias=True):
super().__init__(in_features, out_features, bias)
self.fn = torch.nn.Softplus()
def forward(self, x):
return F.linear(x, self.fn(self.weight), self.bias)
It’s just that simple.
Reducing Outlier Impact
Sorry this is a bit mathy
One of the biggest issues we found, when training opal, were outlier maps. What usually happens is that because gimmick maps are “hard”, but also “easy” to some players that memorize them, the model doesn’t know where to put its embeddings.
A common mitigation is to simply clean the data, however, we find that it’s not that simple to find a common criteria that works in removing them:
- Remove all less played maps -> Maps that are very difficult, will not be in pool -> Gimmick maps are not necessarily less played
- Remove maps with SVs -> Some maps, have non-intrusive SVs -> Some gimmicks don’t even use SVs
- Blacklist maps -> Will need to constantly update the blacklist -> We also introduce bias into the model.
So, instead of filtering them, we simply accept them, but reduce their contribution to training. As mentioned previously, we used PageRank to reduce the impact of less reliable maps, this is another layer similar, but to instead accept that some maps can be outliers.
Take for example, if we simply used Mean Squared Error, and had a gimmick map. During each epoch, the embeddings of players will be heavily affected by this outlier, which affects its convergence stability. To fix this, we instead model the embedding of each player as a Laplace Distribution, what this means is that we accept that each player has a decent amount of variation when exposed to different maps. Due to the longer tail of the Laplace, we find that losses calculated with respect to the distribution through the Negative Log Likelihood Loss are far more lenient!
Results
Currently, opal v2 is a significant improvement over v1:
- It has 1 embedding, over thousands from v1. It’s literally 2MB~4MB for each key mode!
- Due to that, it’s also much easier to explain. We found that even though our numbers were worse than v2, a black-box model is significantly worse, trustworthiness is crucial.
- opal v2 also uses our new osu-data project which is much easier to train with.
- opal v2 has no limitations on the required training size, which means that players who’ve played only <50 maps that year still can be found, we now just list them as more unreliable data.
Drawbacks
- As mentioned, because we don’t have limitations on the player’s required number of maps to play, or the map’s required unique players, some data may be less reliable.
- Gimmick maps still exist and can be clearly seen to be dominating ratings. While this is unintentional, there’s no easy catch-all way to dampen this effect.
- This only uses 1-dimension, which is great for explaining, however, if we can separate it into RC and LN, it could be better. Though, there’s no dataset that can “guide” the model to these dimensions.