You Say Toe-mato I say Tomar-to
If you want to make search bad, call it scraping. If you want to make it sound good, use crawling.
The interesting nuance has become even more pronounced during the furor over Craigslist not allowing Oodle to crawl its listings.
In the comments of a Silicon Beat post Dave McClure of Simply Hired articulates a great definition of each, which I would unequivocally agree with. But people are using it in a tabloid news way to further a point of view.
Tom Foremski, a former Financial Times journalist, says that crawlers add little value and take up lots of resources. After analyzing his log files he comes to the conclusion that “The search-and-scrapers sucked out one-third of my bandwidth and provided just 3.7 percent of the traffic!”.
Tom makes the horrible mistake of thinking that one page of crawling equals one page of viewing. Firstly, and most obviously, a human referred to by the ’search-and-scrap’ sites is likely to look at more than one page.
Is it more than 10 pages? That is also a futile argument. The cost to serve a page is miniscule compared to the monetization rate of each page. Even with the most crappy of techniques people should be able to get $2 CPM, which equates to .2 cents per page. It does not cost anywhere near .2 cents to serve a page. The rate of bandwidth cost is also declining at a Moore’s-law-like rate.
So even with a 3.7% referral volume and one third bandwidth, Tom still ends up on top by a mile. The fact that he only has 3.7% referral volume from the search engines should also worry Tom, as that is markedly below what you would normally expect.
There is also the delicious irony that Tom searches and scrapes his own site. Everyone who has an RSS feed does. What it is essentially doing is creating a feed of summaries that link back to the main site. Tom even includes the full text in his RSS feed. Don’t get me wrong, I read and enjoy his musings every day.
RSS has the potential to radically change the crawling world. Crawlers can essentially subscribe to feeds (Oodly should do this in the absense of not being allowed to crawl Craigslist). A better pinging and notification mechanism is needed to cut down on the bandwidth so that the crawlers don’t have to dumbly guess when to come back but I have faith that will be solved.
In the meantime, ignorant arguments over scraping and stealing will continue to be made, however in jest they may have originally intended to be.

Niki, enjoying the issues you tackle and the logic behind your arguments.
+ + +
Indeed, Tom is probably only acknowledging the tip of the iceberg in terms of search. The 3.7% number is actually from the “hits” column. (Hits?) A more accurate analysis, for the purposes of judging the true value of search, would involves tagging every visitor who was introduced to the site via a search engine, and all subsequent visits, as search traffic, IMHO. Then there are people who found your site via a search engine, then told a friend, and so on…
Niki, thank you for running through the numbers, I am always pleased ot be educated by my readers! However, there is a problem in your suggestion of using rss to scrape, because not all sites put out all their content in rss, and also craigslist users delete posts so oodle is frantically trying to update to keep fresh.
I still think it is a high resource cost compared with what comes back. It isn’t so much the cost for the bandwith ($2 per GB) but the fact that the internet must be groaning under the weight of the search and scraper crawlers, especially as ever more come online.
My point is simple: srape and crawl is cheap, anyone can do it: content such as mine and those of other journalists is not cheap to produce. They need to offer something of greater value otherwise they are vulnerable to being turned away at the door…
“groaning”? c’mon tom… the reason these sites are in business is to serve up millions of pages & handle huge #’s of views/users.
i’m not sure what the story with oodle’s crawling was, and perhaps there was a throttling issue to be corrected, however i n general, i seriously doubt the % of traffic from search engine crawlers is anything higher than low single-digits (if that) for any of the major portals / search engines out there.
re: content being valuable — i absolutely agree with you. however, being FINDABLE is also pretty important. and furthermore, building applications on top of structured data *does* deliver significant value, across a wide variety of verticals.
you (and craigslist and everyone) always have the right to opt-out of being indexed, but i would argue it’s likely not in most folks’ best interests to not be found on Google or Yahoo… similarly, it’s likely not in most classified listing publishers’ best interests to not be found on search engines that provide info on that domain (or in our case, job listings & job search engines).
like Google & Yahoo, vertical search engine sites like Simply Hired and Oodle (and others in real estate, travel, shopping, etc) deliver millions of users / page views in traffic to data publishers by sending them to sites that might not otherwise be found. if those sites / publishers don’t think that’s of value then it’s certainly their prerogative to remain hidden in the “dark web”, however it seems like a rather Luddite position to take on the matter.
walled garden approaches to data access either by data publishers or portals that host that data, will only result in their marginalization and accelerated irrelevance. aside from the original content owner & data publishers, there are very few of those sources that can be kept proprietary, or can remain so in the future with an open & competitive market for data access & transparency.
for more on the subject of “walled garden” vs “open access”, see our recent blog post:
http://blog.simplyhired.com/archives/2005/10/google_there_ca_1.php
- dave mcclure
http://www.simplyhired.com
[…] Jim Buckmaster, CEO of Craigslist and star of the Oodle crawler controversary, should take a leaf out of Hilton Hotels book. […]
[…] The meme of search engines preying upon content sites has been gaining steam, which first started with Tom Foremenski and Craigslist CEO Jim Buckmaster and the decision by Craigslist to ban Oodle, a classifieds aggregator. […]