Protecting Content And Handling Robots
Solving The Problem
We will develop step by step from a set of simple scripts to an intelligent system.
On the first stage we will look at protection against mass download software using one session, and we will stop such software within session and allow a human user to browse the web site. We will set a limit for the number of pages a visitor can browse, and when the limit is reached the visitor will be taken to the form to enter a generated validation code. By doing this we will filter off most robots that work within a session. A robot will not be able to crawl pages over the limit within one session.
Then we will think how to allow robots that we need to index the site. And finally we will look for more advanced ways to handle visitors using different parameters.
To test the scripts and model various situations, we will write a little web-site simulator to generate hundreds of thousands of pages. Or even millions of pages. (I actually work for spare parts vendor with some 8 million items in the database (Firebird), and about 15 million generated pages).