Music Recommendation Datasets

While I was in the funtastic world of research, doing my PhD, I created a couple of datasets using Last.fm API.

I had in mind that "size matters", so I tried to create big datasets. I created two datasets. After two years (!!!) the first dataset is available here

  1. lastfm-dataset-360K contains information about total artist playcount per user (for ~360,000 users). It includes some basic user profile information (age, gender, country and signup date).
  2. lastfm-dataset-1K contains information of the whole listening history for 1,000 users.


lastfm-dataset-360K

Some basic stats for this dataset:

  • Total Lines:            17,562,018
  • Unique Users:             359,347
  • Unique Artist (MBID):  159,732
  • Unique Artist (string):  291,595

Now, let's remember some useful bash commands to get some more stats (frequencies, counts, etc.) from the dataset.

Top-artists

$ cut -f 2,3 usersha1-artmbid-artname-plays.tsv | sort -T . | uniq -c | sort -rn > artists.freq

77348 a74b1b7f-71a5-4011-9441-d0b5e4122711   radiohead
  76339 b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d   the beatles
  66738 cc197bad-dc9c-440d-a5b5-d52ba2e14234   coldplay
  48989 8bfac288-ccc5-448d-9573-c33ea2aa5c30   red hot chili peppers
  47015 9c9f1380-2516-4fc9-a3e6-f9f61941d090   muse
  45301 65f4f0c5-ef9e-490c-aee3-909e7ae6b2ab   metallica
  44506 83d91898-7763-47d7-b03b-b92132375c47   pink floyd
  41280 95e1ead9-4d31-4808-a7ac-32c3614c116b   the killers
  39833 f59c5520-5f46-4d2c-b2c4-822eabf53419   linkin park
  37324 cc0b7089-c08d-4c10-b6b0-873582c17fd6   system of a down
  34215 0383dadf-2a4e-4d10-a46a-e9e041da8eb3   queen
  33247 a3cb23fc-acd3-4ce0-8f36-1e5aa6a18432   u2
  33040 056e4f3e-d505-4dad-8ec1-d04f521cbb56   daft punk
  32673 69ee3720-a7cb-4402-b48d-a02c366f2bcf   the cure
  32341 678d88b2-87b0-403b-b63d-5da7465aecc3   led zeppelin
  ...

This chart resembles to the actual last.fm charts. Though, notice the absence, here, of Lady Gaga (the dataset is from 2008). So, looking at the charts from December, 2008 we  can see that the correlation is pretty high.

Nice, we have a big dataset, and it seems that is also representative sample of the whole (30 Million) users in last.fm


Top-artists in Mexico

$ join -t'      ' usersha1-profile.tsv usersha1-artmbid-artname-plays.tsv | grep "   Mexico  " | cut -f 6,7 | sort | uniq -c | sort -rn > artists-mexico.freq # that is: join -t'\t' and grep "\tMexico\t"

1581 a74b1b7f-71a5-4011-9441-d0b5e4122711   radiohead
   1414 b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d   the beatles
   1321 c2b37a39-c66a-44b2-b190-a69485ae5d95   café tacuba
   1103 cc197bad-dc9c-440d-a5b5-d52ba2e14234   coldplay
   1021 9c9f1380-2516-4fc9-a3e6-f9f61941d090   muse
    981 95e1ead9-4d31-4808-a7ac-32c3614c116b   the killers
    974 83d91898-7763-47d7-b03b-b92132375c47   pink floyd
    896 056e4f3e-d505-4dad-8ec1-d04f521cbb56   daft punk
    803 e6e1e76f-afee-4990-aad0-056199d94918   babasónicos
    778 8538e728-ca0b-4321-b7e5-cff6565dd4c0   depeche mode
    769 69ee3720-a7cb-4402-b48d-a02c366f2bcf   the cure
    760 87c5dedd-371d-4a53-9f7f-80522fb7f3cb   björk
    757 65f4f0c5-ef9e-490c-aee3-909e7ae6b2ab   metallica
    733 3f8a5e5b-c24b-4068-9f1c-afad8829e06b   soda stereo
    683 b23e8a63-8f47-4882-b55b-df2c92ef400e   interpol
    ...


#Users that have played The Dogs d'Amour

$ grep "the dogs d'amour" usersha1-artmbid-artname-plays.tsv | cut -f 1 | uniq -c -i | wc -l

  62


Total playcounts for The Dogs d'Amour

$ grep "the dogs d'amour" usersha1-artmbid-artname-plays.tsv  | cut -f 4 | awk '{s+=$1} END {print s}'

  15994


Plays for user edb46e3e7f368a380bdeadf611585a98cab19704

$ grep edb46e3e7f368a380bdeadf611585a98cab19704 usersha1-artmbid-artname-plays.tsv | cut -f 4 | awk '{s+=$1} END {print s}'

   2450

Where do users come from?

$ cut -f 4 usersha1-profile.tsv | sort | uniq -c | sort -rn > user-countries.freq

67044 United States
  31651 Germany
  29902 United Kingdom
  20987 Poland
  19833 Russian Federation
  14534 Brazil
  13122 Sweden
  13051 Spain
  11579 Finland
   9650 Netherlands
   8679 Canada
   7529 France
   7525 Italy
   7135 Australia
   6637 Japan
    ...


When did I signup?

$ cut -f 5 usersha1-profile.tsv | cut -d, -f 2 | sort | uniq -c | sort -rn

120808  2008
  99963  2007
  67951  2006
  32845  2009
  29933  2005
   6919  2004
    889  2003
     39  2002


How old are you?

$ cut -f 3 usersha1-profile.tsv | sort | uniq -c | sort -rn > user-age.freq

   freq age
  74900      <- empty string (users that did not set her age in the profile)
  24054 21
  24037 20
  22261 19
  22177 22
  19890 23
  17568 24
  17014 18
  15369 25
  13294 26
    ...

It's clear that we have some funny data here, that needs some cleaning. So, be careful when using the age attribute:

   freq age 
    564 109
    390 1
    261 2
     59 102
     57 -1337
     48 99
    ...

Looks like there's a few grandmas, grandpas and new-borns in last.fm! :-)

--

That's all for now. I hope you can use this dataset for your research!