Jerry Wayne Odom Jr.

Internet Spider Agents and Bots

too aggressive, too damn hungry

Aggressive Internet spiders (web bots) have become a real pain in the ass for myself and anyone else who writes software that handles Internet traffic. Programmers and Developers listen please control your programs.

They come running through machines sucking down everything the can find, not obeying robots.txt files and generally throwing network administrators into a panic. So please if you're writing an Internet agent for whatever reason please control how much it grabs from a particular machine and how often. They seem to get trapped in my machines and request far more than they actually should, looping about through sites over and over again.

Rules for Internet Spider Agents

  • Obey robots.txt files
  • Be polite and don't run-amuck through other peoples sites requesting thousands of times in a few minutes.
  • Don't set your bot to identify itself as an Internet browser(Mozilla). We don't want spiders here for a reason and covering it up via a browser disguise is only going to work until we've discovered you requesting 1000 pages every 10 minutes.

Basically have your web bots behave themselves as you would or would like others to act. I think many programmers simply don't think when they write their bots for whatever task. When I first began writing bots I had alot of experiences in getting my systems banned by other administrators. If everyone writes more polite bots then we won't have this problem. I won't have to sit around figuring out how to block dipshit spider v1.0 over the weekend because it gets caught up scraping through one of my machines!!! Thank you and happy programming.

Userful links

libwww-perl-5.800 - ala PERL this is the best way to write basic but effective Agents. Please program responsibly.