Still wondering how likely is it to land on a drive-by download page when doing a (Google) search, I analyzed the infamous AOL search data to try to answer this question.
Conclusion: for every 2800 click throughs, 1 landed on a spamdexing site. 1% of the AOL users clicking through landed on a spamdexing site.
The AOL search data was collected over a period of 3 months (01 March, 2006 – 31 May, 2006), it contains 19,442,629 user click through events. A click through event is an entry in the database indicating that the AOL user clicked on the link presented in a Search Engine Result Page (SERP). These are the fields of a click-trough event entry:
- AnonID – an anonymous user ID number.
- Query – the query issued by the user, case shifted with most punctuation removed.
- QueryTime – the time at which the query was submitted for search.
- ItemRank – if the user clicked on a search result, the rank of the item on which they clicked is listed.
- ClickURL – if the user clicked on a search result, the domain portion of the URL in the clicked result is listed.
Of the 19,442,629 user click through events, 15,066 events have a URL of the following format: digits.alphanums.info.
Expressed as a Perl regular expression, this format is: %\d+\.\w+\.info$ (\w is not strictly limited to alphanumerical characters, the underscore character is also included).
I search for URLs of this format because it was used by the original drive-by download I discovered.
Extracting the main domain (alphanums.info) of the URL of these 15,066 click through events produces a list of 1099 unique domains.
I wrote a script to retrieve and analyze one page for each these 1099 domains. 874 of these pages have the same look and feel as the original drive-by download page:
- they contain an iframe to one of these URLs:
- they contain lots of keywords
- they look like Google SERPs
- they contain lots of links to other digits.alphanums.info pages
This leads me to believe that these 874 domains form a network of sites that use spamdexing techniques to rank high in SERPs. They use
- lots of keywords
- lots of links to different domains (874 domains)
- lots of different IP addresses (352 unique IP addresses)
From now on, I’ll refer to these sites as Spamdexing “R” Us.
These domains are used now (October 2006) for spamdexing, and I assume they were also used for spamdexing 6 months ago (time frame of the AOL data).
Of the 19,442,629 user click through events, 6,988 events landed on a Spamdexing “R” Us site (i.e. one of the 874 domains I identified). This is 0,04%, or around 1 hit per 2800 SERP click throughs! According to some people I talked with, this is an excellent result for Spamdexing “R” Us: for every 2800 SERP click throughs the AOL users executed, 1 landed in their spider web.
Spamdexing “R” Us rank high on the SERPs:
41% of the traffic comes from the 3 highest ranking click troughs.
How do Spamdexing “R” Us sites compare to the other click through sites in the AOL search data? Ranking all the click throughs per URL shows that Spamdexing “R” Us sites rank high: 142th place. As a side note, it’s interesting to mention that the number 1 in the ranking is http://www.google.com, with 366,623 click throughs.
Here’s a selection of some well-known sites that are in the same click through range as the Spamdexing “R” Us sites:
|142||Spamdexing ‘R Us||6988|
The AOL search data contains 657,426 unique user ID’s. 521,694 users clicked on links in the SERPs, and 4,952 users landed on Spamdexing “R” Us sites. That’s about 1 AOL user per 100 (0,95%) in a 3 month period.
Some caveats / remarks concerning this research:
- I don’t feel I’m prying into AOL users private lives, the URLs I analyzed are meaningless and I didn’t analyze the queries.
- The published AOL search data is only a fraction of the AOL search data for that time period. I don’t know how the selection was made.
- My research is post factum. I assume that the Spamdexing “R” Us sites were already spamdexing sites since 01 March, 2006.
- There can be other spamdexing sites in the AOL search data that don’t use digits.alphanums.info URLs.
- I crawled the Spamdexing “R” Us sites over a period of a couple of weeks, during which the iframe to the drive-by download site disappeared.
- The size of the Spamdexing “R” Us network is probably larger than I mention (874 domains, 352 IP addresses). I only looked at the part of the spider web that trapped AOL users.
- I talk about AOL users, but more precisely, I should talk about AOL search users. I suppose not AOL users use AOL search.
- I did not analyze the Query and QueryTime fields
- joy thinks AOL search is powered by Google
- The WHOIS data for the Spamdexing “R” Us sites is complete nonsense
- I don’t know what the relationship is between cleansearch.info, http://www.cucush.info, http://www.veryfastsearch.info and the Spamdexing “R” Us sites
- I’ve found Spamdexing “R” Us pages in English, French, German, Spanish and Italian
- The Spamdexing “R” Us sites use DNS wildcards
- It’s difficult to judge on the success of Spamdexing “R” Us without knowing their business model, costs and revenues. If it’s pay per click (0,04%), I don’t know. If it’s installing a bot on the computers of AOL search users, it’s successful (1%)