Music Recommendation Datasets
While I was in the funtastic world of research, doing my PhD, I created a couple of datasets using Last.fm API.
I had in mind that "size matters", so I tried to create big datasets. I created two datasets. After two years (!!!) the first dataset is available here
- lastfm-dataset-360K contains information about total artist playcount per user (for ~360,000 users). It includes some basic user profile information (age, gender, country and signup date).
- lastfm-dataset-1K contains information of the whole listening history for 1,000 users.
lastfm-dataset-360K
Some basic stats for this dataset:
- Total Lines: 17,562,018
- Unique Users: 359,347
- Unique Artist (MBID): 159,732
- Unique Artist (string): 291,595
Now, let's remember some useful bash commands to get some more stats (frequencies, counts, etc.) from the dataset.
Top-artists
$ cut -f 2,3 usersha1-artmbid-artname-plays.tsv | sort -T . | uniq -c | sort -rn > artists.freq77348 a74b1b7f-71a5-4011-9441-d0b5e4122711 radiohead 76339 b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d the beatles 66738 cc197bad-dc9c-440d-a5b5-d52ba2e14234 coldplay 48989 8bfac288-ccc5-448d-9573-c33ea2aa5c30 red hot chili peppers 47015 9c9f1380-2516-4fc9-a3e6-f9f61941d090 muse 45301 65f4f0c5-ef9e-490c-aee3-909e7ae6b2ab metallica 44506 83d91898-7763-47d7-b03b-b92132375c47 pink floyd 41280 95e1ead9-4d31-4808-a7ac-32c3614c116b the killers 39833 f59c5520-5f46-4d2c-b2c4-822eabf53419 linkin park 37324 cc0b7089-c08d-4c10-b6b0-873582c17fd6 system of a down 34215 0383dadf-2a4e-4d10-a46a-e9e041da8eb3 queen 33247 a3cb23fc-acd3-4ce0-8f36-1e5aa6a18432 u2 33040 056e4f3e-d505-4dad-8ec1-d04f521cbb56 daft punk 32673 69ee3720-a7cb-4402-b48d-a02c366f2bcf the cure 32341 678d88b2-87b0-403b-b63d-5da7465aecc3 led zeppelin ...
This chart resembles to the actual last.fm charts. Though, notice the absence, here, of Lady Gaga (the dataset is from 2008). So, looking at the charts from December, 2008 we can see that the correlation is pretty high.
Nice, we have a big dataset, and it seems that is also representative sample of the whole (30 Million) users in last.fm
Top-artists in Mexico
1581 a74b1b7f-71a5-4011-9441-d0b5e4122711 radiohead
1414 b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d the beatles
1321 c2b37a39-c66a-44b2-b190-a69485ae5d95 café tacuba
1103 cc197bad-dc9c-440d-a5b5-d52ba2e14234 coldplay
1021 9c9f1380-2516-4fc9-a3e6-f9f61941d090 muse
981 95e1ead9-4d31-4808-a7ac-32c3614c116b the killers
974 83d91898-7763-47d7-b03b-b92132375c47 pink floyd
896 056e4f3e-d505-4dad-8ec1-d04f521cbb56 daft punk
803 e6e1e76f-afee-4990-aad0-056199d94918 babasónicos
778 8538e728-ca0b-4321-b7e5-cff6565dd4c0 depeche mode
769 69ee3720-a7cb-4402-b48d-a02c366f2bcf the cure
760 87c5dedd-371d-4a53-9f7f-80522fb7f3cb björk
757 65f4f0c5-ef9e-490c-aee3-909e7ae6b2ab metallica
733 3f8a5e5b-c24b-4068-9f1c-afad8829e06b soda stereo
683 b23e8a63-8f47-4882-b55b-df2c92ef400e interpol
...
#Users that have played The Dogs d'Amour
62
Total playcounts for The Dogs d'Amour
15994
Plays for user edb46e3e7f368a380bdeadf611585a98cab19704
2450
Where do users come from?
$ cut -f 4 usersha1-profile.tsv | sort | uniq -c | sort -rn > user-countries.freq67044 United States
31651 Germany
29902 United Kingdom
20987 Poland
19833 Russian Federation
14534 Brazil
13122 Sweden
13051 Spain
11579 Finland
9650 Netherlands
8679 Canada
7529 France
7525 Italy
7135 Australia
6637 Japan
...
When did I signup?
120808 2008
99963 2007
67951 2006
32845 2009
29933 2005
6919 2004
889 2003
39 2002
How old are you?
freq age
74900 <- empty string (users that did not set her age in the profile)
24054 21
24037 20
22261 19
22177 22
19890 23
17568 24
17014 18
15369 25
13294 26
...It's clear that we have some funny data here, that needs some cleaning. So, be careful when using the age attribute:
freq age 564 109 390 1 261 2 59 102 57 -1337 48 99 ...
Looks like there's a few grandmas, grandpas and new-borns in last.fm! :-)
--
That's all for now. I hope you can use this dataset for your research!