Protecting Content And Handling Robots
Your customer has been building up his database for years. He made arrangements with his suppliers, dealers, acquired reputation and customers. He has paid his full-time staff to fill the tables, write descriptions, verify prices, scan and process images, his database administrator to maintain the database. He hired you to develop his new e-commerce search engine optimized database driven site. You have spent some 3 or 4 months working hard to program the application and attract visitors to the site. You did a good SEO job, got the site indexed and ranked, and now a few hundred unique visitors come from search engines every day.
Here comes the trouble. Besides intelligent and welcomed robots like Google, Yahoo, MSN, dozens of others, unknown or nameless flood you site ignoring
robots.txt eating your bandwidth and slowing down your server. Some users try to download the entire site with standard ripper software. Here are a few examples:
- Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
- Teleport Pro/1.29
- WebCopier v4.1
Some are stealing your content with dedicated custom-made software. For instance, your customer’s competitor may wish not to do the same hard job that your customer has done. As an option he hired a programmer to write automated software to download and structure contents from competing projects. And finally, a malicious hacker spends hours analyzing your site looking for bugs and ways to break in or scans the site with security scanners like XSpider looking for security holes.
As a developer who works on such a project, you have to deal with these issues. Some of the above things have happened to the author of this tutorial and brought about a working solution.
There are two big problems that you can expect with a big (over 100000-page or bigger) project:
Robots never stop crawling such a web site. Their hits exceed the number of visitors by hundred times. An intelligent robot normally crawls from a hundred to two thousand pages a day. That means if only 3 robots, Google and Yahoo! and MSN, are crawling your site simultaneously (and they are) you may have some 5000 hits a day.
But actually, you may have about 10 robots simultaneously, and not all of them follow an intelligent crawling pattern. My customer’s server stopped serving pages when BigmirSpider had crawled 25,000 pages within a few hours.
Nowadays many Internet projects or programmers try to develop their own crawler for one purpose or another. Many of those only use your site and they will never bring significant number of visitors for various reasons: they either look for images, email addresses or their main audience is in different language, or they do not have adequate facilities to serve the pages they index, etc. In any case when you analyze your web traffic you find out that 80 or 90% of search engine attracted visitors come from just a few major search engines, and the other dozens of robots just use your server resources. You will need to prevent them from indexing.
2. Content rippers
Remember that any information retrieved from a database and included in web pages is uniform and structured and can as well undergo a reverse process of reading, filtering, html tags stripping and inserting into someone else’s database (I will probably write a tutorial on how to do that one day :). There is a number of standard software for this, and custom developed for special purpose, or even customized for a particular web site. You have to detect, intercept, and limit attempts to use such software.