imovi.es: yet another mp3 search engine? ...or an angry developer that changed his mind?

[UPDATE: If the example links below do not work, try with http://dmnic.me/ instead of http://imovi.es ]

Today, I just came across this website named imovi.es, via @dmnic_mp3.
The whole story is bittersweet, as well as weird, and fascinating. You should read this Torrent Freak post to get a glimpse of what I'm talking about. Anyway, it goes -more or less- like this:

Dmnic (aka Dominic) gets angry because none wants to buy his (patent-pending) anti-piracy software he's created, as well as he has detected that the main companies serving ads in pirate sites are content owners.
According to his research, the ads served in pirate sites come mostly from "InterActiveCorp which is run by the ex CEO (Barry Diller) of both Fox, and Paramount films."
So, he changes his mind, and converts the anti-piracy platform into a mp3 web search and downloader.

Now, let's just focus on the site and forget about the rest. First of all, the interface is simple and neat. Second, I'm quite impressed of the results, as it (nearly) passed all my usual tests:

But then, I thought that this quote (taken from the site) was somewhat strange, so to say:

"I do not index any mp3 song or information whatsoever, as you search I locate public mp3 files and streams and share them with you." 


And then there's also this quote from the TorrentFreak blog post:

“Each and every time a search is run, or [a] download [is] initiated, our servers locate a match on the web and serve it back. Hence results could change every time you ran a search or downloaded a file. 


Wow! Wait. What? so I can type in anything, it does a web search (in "public mp3 files"), and gets back with a list of results... all in a few milliseconds. Sounds like magic, to me!

As someone who has done some basic work on (Web)IR, crawling, indexing, etc. I thought this is quite difficult to achieve, unless you have an index, or even a plain .csv file that has some basic info about the tracks. At least the title, artist and its location (URL).

So I did a few more tests. As it's quite common nowadays to use a "Lucene-like" syntax in the search box, I tried some other searches, and it happens to work well:

  1. http://imovi.es/?q=artist:u2
  2. http://imovi.es/?q=artist:u2+album:joshua
  3. http://imovi.es/?q=artist:u2+album:joshua+title:with+or+without

Incidentally, all these metadata fields (artist:, album:, and title:) are exactly the ones one can find in the ID3 tag of the mp3. 

Now, my question to you lot (in case anyone reads this, which I highly doubt it!) is:

How come any software can do a search "and locate public mp3 files" for the song With or Without You, from the Joshua Tree album by U2, without having any information whatsoever about the tracks?

I'm not a lawyer, so I don't really care whether there's an index that stores some metadata about the tracks, or not. But it's too obvious that that platform has lots of information about the crawled tracks, and its location (URLs). Else, it would be the first system I'm aware of that crawls the web, but does not store the results anywhere!
Indeed, that's the same type of information Google has been collecting for years, but -as of today- they do not put it on the search results...

Another interesting point is that -it seems to me- this anti-piracy platform has an Audio Fingerprinting technique to identify the tracks (???). And this makes sense.
Yet, that's just my guess, based on the clean results I always get. In my tests, I never got any result like this, and the metadata (artist name and song title) reported in the results is also crystal clear. This does not happen very often in the real, dirty, music metadata world.

Last but not least, wouldn't it be nice to have:

  • an integration with my last.fm profile. That is, whenever I scrobble a track -say on any streaming service such as Spotify, or Grooveshark- I can go there, and download the full tracks I've been listening to.
  • an XSPF or M3U playlist of the returned results. This way, I can listen to the whole page without having to click each song, one by one.

Meanwhile it's still alive, and none complains -or is willing to shutdown the service-, I'll continue using it. At least to listen to those songs I can't find with my (premium subscriber) Spotify account...

Python wrapper for 7 Digital API

Hi,
during last London Music Hack Day I tried to do a remote hack (I was in Mexico City by then!).

I thought that a Python wrapper for 7 digital API could be useful, in order to build some more stuff on top of that. So, that's what I did.

You can download the code here: http://github.com/ocelma/7-digital/downloads
Of course, you'll need an APIKEY from them.

And here are some usage examples:

import py7digital

#Search artist
results = py7digital.search_artist('stones')
print results.get_total_result_count()
for artist in results.get_next_page():
  print artist.get_name()
  print '\tTop tracks:'
  for top_track in artist.get_top_tracks():
    print '\t\t', top_track.get_title()top_track.get_version()
  print '\tRec. Albums:'
  for rec_album in artist.get_recommended_albums():
    print '\t\t', rec_album, rec_album.get_year(), album.get_label()
  for album in artist.get_albums(5):
    print '\t', album, album.get_year(), album.get_barcode()
    for sim_album in album.get_similar():
      print '\t\tSimilar:', sim_album, sim_album.get_artist()
    for track in album.get_tracks():
      print '\t\t', track, track.get_isrc(), track.get_audio()

#Browse artists starting with letter 'J'
results = py7digital.browse_artists('j')
print results.get_total_result_count()
for artist in results.get_next_page():
  print artist.get_name(), artist.get_image(), artist.get_tags()
  for album in artist.get_albums(2):
    print '\t', album, album.get_tags(), album.get_label()
    for track in album.get_tracks():
      print '\t\t', track.get_title(), track.get_audio()

#Search albums
searcher = py7digital.search_album('u2')
print searcher.get_total_result_count()
while searcher.has_results():
  for album in searcher.get_next_page():
    print album.get_title(), album.get_similar()

#Search tracks
searcher = py7digital.search_track('u2 one')
print searcher.get_total_result_count()
while searcher.has_results():
  for track in searcher.get_next_page():
    print track.get_title(), track.get_album()

# New releases in a given period of time
results = py7digital.album_releases('20100901', '20100924')
for album in results.get_next_page():
  print album.get_title(), album.get_artist(), album.get_tags()
  for track in album.get_tracks():
    print '\t', track, track.get_url(), track.get_audio()

# Album charts in a given period of time
results = py7digital.album_charts('month', '20100901')
for album in results.get_next_page():
  print album, album.get_release_date(), album.get_added_date()
  for track in album.get_tracks():
    print '\t', track, track.get_isrc()

Enjoy!

 

Groovify your music tweets

On May, 2010 I traveled to San Francisco to attend both SF - MusicHackDay and SF - MusicTech.

During that intense weekend at MusicHackDay, I created a little toy named Groovify. After not much sleep, Groovify was born on May, 16th.
Groovify (aka yet another twitter bot) replies with a tweet each time someone tweets a song.
That's when a user:

  • is listening to a song @grooveshark
  • tweets a spotify link
  • #loves a #lastfm song
  • is #nowplaying a track, or
  • just loved a song @hypem

Then Groovify replies (an unsolicited tweet!) to the user a playlist based on the tweeted song. At this point, some users start following @groovify. From now on, each music tweet by a follower is replied by @groovify with a playlist.

Most of the times I got very good feedback of the playlist:

Groovify-feedback

This week Groovify reached 500 followers (much more than I would ever expect!), so I thought it would be nice to explain how this little bot works.  

HOW DOES THIS WORK?

Groovify uses:

  • BMAT API to generate the playlist, given a seed song
  • Twitter API to get tweets about tracks played (or loved) in @grooveshark, @spotify, @lastfm or @hypem
  • Spotify Metadata API to gather the spotify track URI, and generate a full track playlist
  • Grooveshark API to generate a full track playlist
  • Amazon API for the album covers

Once a song tweet has been detected, it queries our BMAT API to get a playlst given that seed song.
Then, it creates a page like this, with a playlist (only 30 secs. sample per song). Each song has also some links to Amazon, iTunes, Spotify, Grooveshark, and Musicbrainz.
To listen to full tracks it uses Spotify and Grooveshark APIs, and it allows the user to open the playlist there.

SOME STATS

Since June, 6th I started logging all the tweets from @groovify. This way, I can compute some basic stats about which songs are tweeted the most:

40 Alejandro               Lady GaGa
37 Hey, Soul Sister        Train
29 Not Afraid [Explicit]   Eminem
22 Burn It To The Ground   Nickelback
19 Bad Romance             Lady GaGa
18 Let's Go Surfing        the drums
18 Creep                   Radiohead
18 Billionaire (ft. Bruno Mars)  Travie McCoy
17 Your Love Is My Drug    Ke$ha
17 Viva La Vida            Coldplay
17 Sex On Fire             Kings Of Leon
17 Need You Now            Lady Antebellum
16 Wavin' Flag             k'naan
16 I'm Yours               Jason Mraz
16 Going Down              Jeff Beck
16 Find Your Love          Drake
16 Breathe Me              Sia
15 Uprising                Muse
15 Just Breathe            Pearl Jam
14 Karma Police            Radiohead
(...)

And aggregate this data at artist level:

112 Lady GaGa
109 Michael Jackson
106 Muse
 87 Radiohead
 72 David Bowie
 71 Green Day
 70 The Rolling Stones
 66 Paramore
 65 Coldplay
 64 Guns N' Roses
 63 The Killers
 62 Duran Duran
 54 Eminem
 53 Oasis
 52 Nirvana
 51 Metallica
 50 Train
 49 Daft Punk
 48 U2
 48 The Smiths
 (...)

Notice that some artists have one song that's repeatedly tweeted, still they do not appear in the list of top-artists (e.g. Nickelback, the drums or Travie McCoy).
On the other hand, there are artists that appear in the top-artist list without having a clear one-hit song (Michael Jackson, David Bowie or Green Day). In this case, users tweet different songs from these artists.

Then, I also aggregated all Groovify tweets on a weekday scale.

Days-inverted
Here we can see that #MusicMonday and Saturday are clearly the days that users tweet more songs.

Well, that's all for now!
I've already have some ideas for my next music hack. So, I hope to see you in the upcoming MusicHackDay in Barcelona!

Spotify: When did Oasis record a Caribbean dance song?

The other day I was trying Spotify Radio. Whenever I have to test anything I usually type 'U2' (yes, sometimes I'm lazy).

So there I was, listening to Spotify's U2 radio: I got a couple of U2 songs, Bruce Springsteen, U2 again, Queen. So far, so good (or so boring?). 

Spotify-u2-radio

Anyway, all of a sudden I got this...

"All right, put your hands up in the air. Everybody's having fun!. Con las maaanos pa' arriba!"

Con las manos pa arriba by Oasis Canoa

Ops!

Incorrect metadata resolution (in this case, wrongly assigning a track to an artist) is a common problem in music recommendation.

Indeed, wrong metadata resolution gives you, well, just horrible recommendations.

You can have the best recommendation/similarity algorithm in the world, but if your data is not clean enough (or you didn't correctly match the data record labels provide you), one ends up with "serendipitious" -so to say- recommendations.

The problem is even worst when the data to gather comes from >30M users. Paul once asked in his (old) blog: "How many ways there are to spell 'Guns N Roses'?" (here's the answer, BTW).

 

Back to Spotify, I assume they do not have this problem. Instead, they provide a nifty XML that record labels and aggregators have to fill-in. The problem comes when trying to resolve this data with the information they already have.

There are some other wrongly assigned albums such as Los Campesinos! (Luz y Colores, Vuelta por el Jefe), or Chenoa (Road of Life, Spirit of Salishan). Are you aware of any other examples?

Using MusicBrainz or Discogs information might help to avoid (or maybe detect) these problems.

(To be fair, according to The Pansentient League Oasis band was removed from Spotify UK. However, it seems Spotify still uses Oasis band internally when trying to match information?) 

 

Last but not least, I hope Spotify is learning from (or making any use of) user skipping behaviour.  I for one will continue skipping 'Caribbean Oasis songs' while listening to U2 Radio. I know that eventually these tracks will disappear from the earth...I mean, from Spotify's U2 radio.

Stereotyping the users

Is there any way we can stereotype users' listening habits?

(download)

Here's a very simple example with user's age attribute, using this last.fm dataset (~360,000 users).

The following (A, B and C) lists contain most frequent users' ages for artists: Bonnie Raitt, Radiohead and Miley Cirus.

Can you guess which list belongs to each artist?

A     B     C
==========
21    19    24
22    17    48
20    18    20
23    20    39
24    21    56
19    16    53
25    22    49
26    15    50
27    23    42
28    24    27

Pretty simple, right?

  • A: Radiohead , with an average top-10 age (non weighted) of 23.6
  • B: Miley Cyrus , with an average top-10 age (non weighted) of 19.5
  • C: Bonnie Riatt , with an average top-10 age (non weighted) of 40.8

Imagine that your cool collaborative filtering $1M worth algorithm tells you that Madonna and Miley are similar.
Now, we have a user ImStillYoung, which is 45 years old. He likes Madonna, and other 80s pop music. Would you recommend Miley to that user?

 

Music Recommendation Datasets

While I was in the funtastic world of research, doing my PhD, I created a couple of datasets using Last.fm API.

I had in mind that "size matters", so I tried to create big datasets. I created two datasets. After two years (!!!) the first dataset is available here

  1. lastfm-dataset-360K contains information about total artist playcount per user (for ~360,000 users). It includes some basic user profile information (age, gender, country and signup date).
  2. lastfm-dataset-1K contains information of the whole listening history for 1,000 users.


lastfm-dataset-360K

Some basic stats for this dataset:

  • Total Lines:            17,562,018
  • Unique Users:             359,347
  • Unique Artist (MBID):  159,732
  • Unique Artist (string):  291,595

Now, let's remember some useful bash commands to get some more stats (frequencies, counts, etc.) from the dataset.

Top-artists

$ cut -f 2,3 usersha1-artmbid-artname-plays.tsv | sort -T . | uniq -c | sort -rn > artists.freq

77348 a74b1b7f-71a5-4011-9441-d0b5e4122711   radiohead
  76339 b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d   the beatles
  66738 cc197bad-dc9c-440d-a5b5-d52ba2e14234   coldplay
  48989 8bfac288-ccc5-448d-9573-c33ea2aa5c30   red hot chili peppers
  47015 9c9f1380-2516-4fc9-a3e6-f9f61941d090   muse
  45301 65f4f0c5-ef9e-490c-aee3-909e7ae6b2ab   metallica
  44506 83d91898-7763-47d7-b03b-b92132375c47   pink floyd
  41280 95e1ead9-4d31-4808-a7ac-32c3614c116b   the killers
  39833 f59c5520-5f46-4d2c-b2c4-822eabf53419   linkin park
  37324 cc0b7089-c08d-4c10-b6b0-873582c17fd6   system of a down
  34215 0383dadf-2a4e-4d10-a46a-e9e041da8eb3   queen
  33247 a3cb23fc-acd3-4ce0-8f36-1e5aa6a18432   u2
  33040 056e4f3e-d505-4dad-8ec1-d04f521cbb56   daft punk
  32673 69ee3720-a7cb-4402-b48d-a02c366f2bcf   the cure
  32341 678d88b2-87b0-403b-b63d-5da7465aecc3   led zeppelin
  ...

This chart resembles to the actual last.fm charts. Though, notice the absence, here, of Lady Gaga (the dataset is from 2008). So, looking at the charts from December, 2008 we  can see that the correlation is pretty high.

Nice, we have a big dataset, and it seems that is also representative sample of the whole (30 Million) users in last.fm


Top-artists in Mexico

$ join -t'      ' usersha1-profile.tsv usersha1-artmbid-artname-plays.tsv | grep "   Mexico  " | cut -f 6,7 | sort | uniq -c | sort -rn > artists-mexico.freq # that is: join -t'\t' and grep "\tMexico\t"

1581 a74b1b7f-71a5-4011-9441-d0b5e4122711   radiohead
   1414 b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d   the beatles
   1321 c2b37a39-c66a-44b2-b190-a69485ae5d95   café tacuba
   1103 cc197bad-dc9c-440d-a5b5-d52ba2e14234   coldplay
   1021 9c9f1380-2516-4fc9-a3e6-f9f61941d090   muse
    981 95e1ead9-4d31-4808-a7ac-32c3614c116b   the killers
    974 83d91898-7763-47d7-b03b-b92132375c47   pink floyd
    896 056e4f3e-d505-4dad-8ec1-d04f521cbb56   daft punk
    803 e6e1e76f-afee-4990-aad0-056199d94918   babasónicos
    778 8538e728-ca0b-4321-b7e5-cff6565dd4c0   depeche mode
    769 69ee3720-a7cb-4402-b48d-a02c366f2bcf   the cure
    760 87c5dedd-371d-4a53-9f7f-80522fb7f3cb   björk
    757 65f4f0c5-ef9e-490c-aee3-909e7ae6b2ab   metallica
    733 3f8a5e5b-c24b-4068-9f1c-afad8829e06b   soda stereo
    683 b23e8a63-8f47-4882-b55b-df2c92ef400e   interpol
    ...


#Users that have played The Dogs d'Amour

$ grep "the dogs d'amour" usersha1-artmbid-artname-plays.tsv | cut -f 1 | uniq -c -i | wc -l

  62


Total playcounts for The Dogs d'Amour

$ grep "the dogs d'amour" usersha1-artmbid-artname-plays.tsv  | cut -f 4 | awk '{s+=$1} END {print s}'

  15994


Plays for user edb46e3e7f368a380bdeadf611585a98cab19704

$ grep edb46e3e7f368a380bdeadf611585a98cab19704 usersha1-artmbid-artname-plays.tsv | cut -f 4 | awk '{s+=$1} END {print s}'

   2450

Where do users come from?

$ cut -f 4 usersha1-profile.tsv | sort | uniq -c | sort -rn > user-countries.freq

67044 United States
  31651 Germany
  29902 United Kingdom
  20987 Poland
  19833 Russian Federation
  14534 Brazil
  13122 Sweden
  13051 Spain
  11579 Finland
   9650 Netherlands
   8679 Canada
   7529 France
   7525 Italy
   7135 Australia
   6637 Japan
    ...


When did I signup?

$ cut -f 5 usersha1-profile.tsv | cut -d, -f 2 | sort | uniq -c | sort -rn

120808  2008
  99963  2007
  67951  2006
  32845  2009
  29933  2005
   6919  2004
    889  2003
     39  2002


How old are you?

$ cut -f 3 usersha1-profile.tsv | sort | uniq -c | sort -rn > user-age.freq

   freq age
  74900      <- empty string (users that did not set her age in the profile)
  24054 21
  24037 20
  22261 19
  22177 22
  19890 23
  17568 24
  17014 18
  15369 25
  13294 26
    ...

It's clear that we have some funny data here, that needs some cleaning. So, be careful when using the age attribute:

   freq age 
    564 109
    390 1
    261 2
     59 102
     57 -1337
     48 99
    ...

Looks like there's a few grandmas, grandpas and new-borns in last.fm! :-)

--

That's all for now. I hope you can use this dataset for your research!