osu! Dataset: data.ppy.sh osu!mania datasets

Status: Completed

Auto exporting osu!mania datasets to GitHub Actions using osu-data

Introduction

If you’ve attempted to use osu!’s database dump in data.ppy.sh, you may know that it takes some elbow-grease and sweaty hands to load in data locally. That’s the reason why I created osu-data, a Dockerized solution to load in your dataset through 2 commands:

pip install osu-data
osu-data -m mania -v top_1000 -ym YYYY_MM

Take a look at the article linked above if you want to know how osu-data works. In this article, we’ll discuss how we automated the exporting the dataset to GitHub Releases.

As an overview, this is our GitHub Action (job) at a high level.

sequenceDiagram
  GitHub Actions ->>+ osu-data: Start osu-data (nohup)
  GitHub Actions ->>+ osu-data: Health Check
  osu-data --x- GitHub Actions: Not Ready
  GitHub Actions ->>+ osu-data: Health Check
  osu-data ->>- GitHub Actions: Ready
  GitHub Actions ->>+ Python: Create Dataset Now
  Python ->>+ osu-data: Retrieve Dataset
  osu-data ->>- Python: Retrieve Data
  Python ->>- GitHub Actions: Return Dataset
  GitHub Actions ->>+ GitHub Releases: Create Release
  GitHub Releases ->>- GitHub Actions: Done
  osu-data ->>- GitHub Actions: Close

Memory Limited GitHub Action Runner

One of the surprising revelations was that this worked on the default free GitHub Actions runner. If one tries to run too big of a workload, they may face Error 143, indicating that the job was too “heavy” (it’s as vague as that).

I ran into that problem once during the project, I had to cut down many unnecessary columns, and reduce merging, we discuss how we reduced this computational load and resulting size in the next section

Optimizing Artifact Sizes

Never thought I’m using something I learnt from databases a few years ago.

One of the key considerations of creating the dataset is if I should create the dataset like:

Schematic A

score.csv

Map ID	Map Name	Map Speed	Player ID	Player Name	Player Year	Accuracy
1234	A - B (C)	DT	2345	Alice	2000	99.83

Or…

Schematic B

score.csv

Map ID	Map Speed	Player ID	Player Year	Accuracy
1234	DT	2345	2000	99.83

map_metadata.csv

Map ID	Map Name
1234	A - B (C)

player_metadata.csv

Player ID	Player Name
2345	Alice

If our goal is to

reduce computation, we choose A
reduce size, we choose B

Notice that if we had 2 scores on the same map,

Schematic A score.csv would have to repeat the entire Map ID & Name
Schematic B score.csv only repeats the Map ID

This is because Map ID and Map Name are One-to-One coupled together, in other words, the ID implies the Map Name, and vice versa (usually). Thus, it’s redundant to include both columns! In database theory, we call this optimization normalization, and this phenomenon, Data Redundancy.

This doesn’t mean that Schematic A is useless though, if you rarely encounter this redundancy, it may not be worth it to split it out at all.

PREVIOUSHow to detect if Docker's MySQL is done initializing

NEXTopal v2: Explainable Map and Player Embedding Optimation with Guided Constraints