Protecting Content And Handling Robots
Common Ways To Prevent Content Ripping
All the most common ways to prevent undesired activity boil down to blocking a visitor by user agent or IP address verifying it against a hand-made black list made based on web traffic analyses. You can often read on webmasters’ forums something like ‘While I was doing statistics analysis on my logs, I noticed …’ Another way is to disallow a robot in robots.txt file. Also, using forms as alternative to links for navigation can stop (but not always) automated software.
These ways are not good because:
- Web statistics analyses is done AFTER something bad happens
- By blacklisting IPs and robots you protect the web site from repeated crawling which means you do not prevent problems
- If you blocked an IP, it may be an ISP’s IP and you thus cut off that ISP’s users from your web site
- You do this manually
- Not all robots follow robots.txt instructions
- Using forms as alternative to links will stop all robots and the pages will not be indexed
Also some software does not identify itself and your can not tell by the user agent string whether it is a robot because it looks like a usual browser.
What do we really need ideally?
We need a system that will:
- Allow authorized robots to crawl the site without any obstacles
- Prevent unauthorized robots automatically
- Detect and block unauthorized activity (mass downloads) and notify us
- Track suspicious behaviors or patterns and notify us
- Be able to analyze behavior and distinguish real users from robots
- Enable us to make settings based on visitor’s parameters and specifics of an individual project.
A human visitor browses an average number of pages during a session. You can figure out this number by analyzing your web site statistics. A search robot or content ripper program can crawl thousands of pages within a session or during a day. We have to allow user to see as many pages as he/she wants but we need to stop downloading our site pages by automated software or crawling by robot for some reason. Some robots scan sites within one session; some make pauses and make one hit within a session.
We can allow visitors or limit their hits by setting limits against various parameters: session ID, IP address, user agent (browser or crawler name).
We can distinguish human visitors from robots by number of visits within a certain period, elapsed time between hits, actions that can be done by user and are beyond robot’s capabilities.