AOL Release (and quickly remove) Search Records of 0.5m Users
[EDIT] You can find some mined gems from this data over at the plentyoffish blog (and while your there, learn about a guy who makes >$10k PER DAY from Adsense on his free dating site).
According to this post AOL released, and very, very promptly removed, the entire search records of 500,000 users collected over a three month period.
Apart from the obvious privacy concerns (most likely the reason for the removal), this data represents a unique opportunity to research the what people search for and the iterative approach they take within their searches. You can see the initial search patterns people use and how they refine those search patterns to find the results they want.
Interesting also because, to the best of my knowledge, AOL search repackages Google’s Search so in essence this is really Google data (Google also recently announced its intention to release 30GB of word/phrase data).
From the ReadMe.txt :
This collection consists of ~20M web queries collected from ~650k users over three months.
The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}.
AnonID – an anonymous user ID number.
Query – the query issued by the user, case shifted with
most punctuation removed.
QueryTime – the time at which the query was submitted for search.
ItemRank – if the user clicked on a search result, the rank of the
item on which they clicked is listed.
ClickURL – if the user clicked on a search result, the domain portion of
the URL in the clicked result is listed.Each line in the data represents one of two types of events:
1. A query that was NOT followed by the user clicking on a result item.
2. A click through on an item in the result list returned from a query.
Normalized queries:
36,389,567 lines of data
21,011,340 instances of new queries (w/ or w/o click-through)
7,887,022 requests for “next page” of results
19,442,629 user click-through events
16,946,938 queries w/o user click-through
10,154,742 unique (normalized) queries
657,426 unique user ID’s
And this being the wonderful Internet, the 439MB compressed file is still floating around with the filename AOL-data.tgz. Here are some mirrors I know of:
http://www.yousendit.com/transfer.php?…BB5BE
http://rapidshare.de/files/2848….01.txt.gz
http://rapidshare.de/files/2848….02.txt.gz
http://rapidshare.de/files/2848….03.txt.gz
http://rapidshare.de/files/2848….04.txt.gz
http://rapidshare.de/files/2848….05.txt.gz
http://rapidshare.de/files/2848….06.txt.gz
http://rapidshare.de/files/2848….07.txt.gz
http://rapidshare.de/files/2848….08.txt.gz
http://rapidshare.de/files/2848….09.txt.gz
http://rapidshare.de/files/2848….10.txt.gz
[...] Durante el verano del año 2006 la empresa americana AOL dio acceso a los datos de navegación de cerca de 500.000 usuarios. En seguida se dieron cuenta que era un gran error estratégico (¿lo era?) y quitaron este acceso. Como suele ser habitual en estos casos hubo alguien muy rápido y generoso que los volvió a colgar (aun se pueden descargar aquí) [...]
Pingback by Como estimar el tráfico de una palabra clave | Isaac Sunyer — November 4, 2008 @ 8:16 pm
These downloads no longer work. any other mirrors?
Comment by Bill Allen — April 27, 2013 @ 9:36 am