Page 1 of 2 12 LastLast
Results 1 to 25 of 26
  1. #1
    Grandma broke her coccyx! Uncle Rico's Avatar
    Join Date
    May 8th, 2007
    Location
    North Carolina
    Posts
    2,238
    List of Robots To Disallow?
    Is there a list of some nasty robots that it would be a good idea to disallow them all together? Obviously, there are a ton of robots out running around. I would like to find the ones that most sites would not want around.

  2. #2

  3. #3
    Full Member
    Join Date
    October 22nd, 2006
    Posts
    200
    The robots you want to disallow are the one that don't take any notice of robots.txt

  4. #4
    ABW Ambassador 2busy's Avatar
    Join Date
    January 17th, 2005
    Location
    Tropical Mountaintop
    Posts
    5,636
    When you decide which ones to dissallow it is better to deny access via .htaccess. I have seen so many lists with no idea of how old/current/valid any of them are but there are several listed if you look around. A definitive current list would be valuable.

  5. #5
    Full Member
    Join Date
    January 18th, 2005
    Posts
    396
    If you are starting your list: put Twiceler.com and sitesell.com or SBIder right near the top

    there are a couple of robots - one from Russia and one from India that have a real nasty trick - they took over (wonderful neat tool bar) several bazillion computers into their badbot network - now they decide to work over one of my domains and some how coordinate the attack from MANY IPs - each IP usually hits me for an hour or so at a rate of several targeted hits per second - as soon as I block a few, more take over - unfortunately I can't see any user_name pattern to name block them and to get a good IP block I'd have to knock out half the Internet including most of my wife's email friends.

  6. #6
    http and a telephoto
    Join Date
    January 18th, 2005
    Location
    NYC
    Posts
    17,708
    There are some older robots.txt files that have been posted here, but an up to date one would be useful!
    Deborah Carney
    TeamLoxly.com BookGoodies.com ABCsPlus.com

  7. #7
    general fuq mrbshouse's Avatar
    Join Date
    January 18th, 2005
    Location
    Argieville
    Posts
    1,381
    Be real careful when you edit the htaccess one spacein the wrong place and your totally offline. you will need to update the htaccess for each FOLDER not just the domain. for sitesell they have since changed their bot name at least once so i just deny the ip block

    The way I see it is regardless of the agent name, if i have a server hitting me aside from g,msn or y they are suspect and killing an ip block in the form of "deny from 205.205.34.0/205.205.34.255" might block a lesser bot, but it sure keeps the scrapers out too



    here is my running list, i'm sure there are many that should be added to it.

    <Files 403.shtml>
    order allow,deny
    allow from all
    </Files>

    SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot

    SetEnvIfNoCase User-Agent "^EmailWolf" bad_bot

    SetEnvIfNoCase User-Agent "^ExtractorPro" bad_bot

    SetEnvIfNoCase User-Agent "^CherryPicker" bad_bot

    SetEnvIfNoCase User-Agent "^NICErsPRO" bad_bot

    SetEnvIfNoCase User-Agent "^Teleport" bad_bot

    SetEnvIfNoCase User-Agent "^EmailCollector" bad_bot

    SetEnvIfNoCase User-Agent "^bot/1.0" bad_bot

    SetEnvIfNoCase User-Agent "^LinkWalker" bad_bot

    SetEnvIfNoCase User-Agent "^Zeus" bad_bot
    SetEnvIfNoCase User-Agent "^cfetch/1.0" bad_bot
    SetEnvIfNoCase User-Agent "^bot/1.0" bad_bot

    SetEnvIfNoCase User-Agent "^Java/1.4.1-p3" bad_bot
    SetEnvIfNoCase User-Agent "^Java/1.4.1_04" bad_bot
    SetEnvIfNoCase User-Agent "^Java/1.4.1_01" bad_bot
    SetEnvIfNoCase User-Agent "^Java1.3.1" bad_bot
    SetEnvIfNoCase User-Agent "^Java/1.4.1" bad_bot
    SetEnvIfNoCase User-Agent "^Java/1.6.0" bad_bot
    SetEnvIfNoCase User-Agent "^Java/1.6.0_01-ea" bad_bot

    SetEnvIfNoCase User-Agent "^ " bad_bot
    SetEnvIfNoCase User-Agent "^-" bad_bot

    SetEnvIfNoCase User-Agent "^TMCrawler" bad_bot

    SetEnvIfNoCase User-Agent "^libwww-perl/5.47" bad_bot
    SetEnvIfNoCase User-Agent "^libwww-perl/5.53" bad_bot
    SetEnvIfNoCase User-Agent "^libwww-perl/5.63" bad_bot
    SetEnvIfNoCase User-Agent "^libwww-perl/5.64" bad_bot
    SetEnvIfNoCase User-Agent "^libwww-perl/5.65" bad_bot
    SetEnvIfNoCase User-Agent "^libwww-perl/5.69" bad_bot
    SetEnvIfNoCase User-Agent "^libwww-perl/5.73" bad_bot
    SetEnvIfNoCase User-Agent "^libwww-perl/5.76" bad_bot
    SetEnvIfNoCase User-Agent "^libwww-perl/5.79" bad_bot
    SetEnvIfNoCase User-Agent "^libwww-perl/5.801" bad_bot
    SetEnvIfNoCase User-Agent "^libwww-perl/5.803" bad_bot
    SetEnvIfNoCase User-Agent "^libwww-perl/5.805" bad_bot
    SetEnvIfNoCase User-Agent "^libwww-perl/5.808" bad_bot


    SetEnvIfNoCase User-Agent "^ISC Systems iRc Search 2.1" bad_bot
    SetEnvIfNoCase User-Agent "^sitesell" bad_bot
    SetEnvIfNoCase User-Agent "^ia_archiver" bad_bot
    SetEnvIfNoCase User-Agent "^psbot/0.1" bad_bot
    SetEnvIfNoCase User-Agent "^IRLbot/3.0" bad_bot
    SetEnvIfNoCase User-Agent "^panscient.com" bad_bot
    SetEnvIfNoCase User-Agent "^IRLbot/3.0" bad_bot
    SetEnvIfNoCase User-Agent "^twiceler/3.0" bad_bot

    <Limit GET POST>
    Order Allow,Deny
    Allow from all
    Deny from env=bad_bot

  8. #8
    Comfortably Numb John Powell's Avatar
    Join Date
    October 17th, 2005
    Location
    Bayou Country, LA
    Posts
    3,432
    Quote Originally Posted by mrbshouse
    Be real careful when you edit the htaccess one spacein the wrong place and your totally offline.
    Man is that ever true. The other night late I was doing some IP blocking and put "deny form" instead of "deny from". It was the next day before I discovered that all attempts to access the site just gave a nice 500 server error. Ohwhatagooseiam!


  9. #9
    ABW Ambassador meadowmufn's Avatar
    Join Date
    January 18th, 2005
    Location
    Seattle
    Posts
    2,587
    Quote Originally Posted by loxly
    There are some older robots.txt files that have been posted here, but an up to date one would be useful!
    *sniff sniff* I smell an OFFICIAL thread in the Affiliate Academy section. Haiko, would you want to set one up that folks can add bad bots to when they run across them?
    -Don't criticize anyone til you've walked a mile in their shoes. Then when you do criticize them, you'll be a mile away and have their shoes.
    - Silence is golden. Duct Tape is silver.

  10. #10
    Grandma broke her coccyx! Uncle Rico's Avatar
    Join Date
    May 8th, 2007
    Location
    North Carolina
    Posts
    2,238
    Quote Originally Posted by bumpaw
    Man is that ever true. The other night late I was doing some IP blocking and put "deny form" instead of "deny from". It was the next day before I discovered that all attempts to access the site just gave a nice 500 server error. Ohwhatagooseiam!
    Probably a good idea to run your robots.txt file through one of the online robots.txt syntax checker after an edit.

  11. #11
    Newbie greenehawke's Avatar
    Join Date
    January 26th, 2008
    Location
    Alabama
    Posts
    48
    Woe. and Wow. I never thought I needed to block robots!


    Well. I just learned about spys and cloaking affiliate ID's so I will keep it in perspective.

    one step at a time.

  12. #12
    Comfortably Numb John Powell's Avatar
    Join Date
    October 17th, 2005
    Location
    Bayou Country, LA
    Posts
    3,432
    Probably a good idea to run your robots.txt file through one of the online robots.txt syntax checker after an edit.
    True but we had drifted into .htaccess where "deny from" is used.


  13. #13
    Grandma broke her coccyx! Uncle Rico's Avatar
    Join Date
    May 8th, 2007
    Location
    North Carolina
    Posts
    2,238
    Quote Originally Posted by bumpaw
    True but we had drifted into .htaccess where "deny from" is used.
    Oh yes, that's right. My mistake.

  14. #14
    ABW Ambassador 2busy's Avatar
    Join Date
    January 17th, 2005
    Location
    Tropical Mountaintop
    Posts
    5,636
    I have many lists that were clipped from various places but never got around to consolidating them - partly because they are in different formats, using different methods and many 'bots are repeated over and over but if anyone wants to try their hand at it here is my collection. Had to split it up, too big to upload.
    Attached Files Attached Files

  15. #15
    What's the word? Rhia7's Avatar
    Join Date
    January 13th, 2006
    Posts
    9,578
    Wow! 2Busy, that's quite a number. Is there one command that will disallow all of them or must each one from the list be enumerated?
    ~Rhia7 -- Remember the 7
    Twitter me

  16. #16
    Comfortably Numb John Powell's Avatar
    Join Date
    October 17th, 2005
    Location
    Bayou Country, LA
    Posts
    3,432
    Quote Originally Posted by meadowmufn
    *sniff sniff* I smell an OFFICIAL thread in the Affiliate Academy section. Haiko, would you want to set one up that folks can add bad bots to when they run across them?
    I like that idea, and we could stick some discussion of automated bot blockers in there which is what I use mostly.


  17. #17
    ABW Ambassador writerguy's Avatar
    Join Date
    January 17th, 2005
    Location
    Springfield, Missouri, USA
    Posts
    3,248
    Quote Originally Posted by Rhia7
    Wow! 2Busy, that's quite a number. Is there one command that will disallow all of them or must each one from the list be enumerated?
    I second Rhia's comments. Wow!

    So, how would I go about consolidating these two lists and putting them into my .htaccess file?
    Generate more fake news.

  18. #18
    ABW Ambassador 2busy's Avatar
    Join Date
    January 17th, 2005
    Location
    Tropical Mountaintop
    Posts
    5,636
    It needs to be separated - the bots' name from the formatting and then choose the format that is best for your server and add back the bots' names. It can be done in PSPad by converting to columns or by removing X number of chars from each line. I just haven't done it because I don't have two straight hours to do anything yet and (for me anyway) it would be mostly manual. Once the bots' names are separated they need to be sorted like you would in Excel so that duplicates can be removed. That would give you a column of bot names that could be added to. Merge that coulumn with the formatting that works best with your site's environment and voila - or something.
    Many of the bots on that list may not even exist anymore(?) as I have no way of knowing how old some of that information is or was when I came across it. I just thought it might do more good in here than in a moldy folder that I may never get around to.
    I was hoping some energetic person with the free time could jump in and find a way to do it. If the formatting was all the same it would have been done before I uploaded it.

  19. #19
    Comfortably Numb John Powell's Avatar
    Join Date
    October 17th, 2005
    Location
    Bayou Country, LA
    Posts
    3,432
    From what I read you can experience a page load slow down if .htaccess gets to full of entries. Somewhere I saw that Yahoo hosted sites are allowed 100 blocked IPs.


  20. #20
    Comfortably Numb John Powell's Avatar
    Join Date
    October 17th, 2005
    Location
    Bayou Country, LA
    Posts
    3,432
    Here is the source of the bot blocker I use. I found it on WMW where it was posted back in June 2004.

    It's important to add the PHP file that makes it work to robots.txt a few days prior to going live with it. That keeps it from blocking G, Y, Live or any other bots that obey robots.txt.

    I added a couple of lines to the bottom of the script to make it email me when an IP is blocked. Those are fun to get and also alert you to whats happening.


  21. #21
    general fuq mrbshouse's Avatar
    Join Date
    January 18th, 2005
    Location
    Argieville
    Posts
    1,381
    I added mine to your 2 list 2busy, cut the code, sorted and here is the list ;-)

    be sure to go over this list before using it as I mentioned earlier even one extra space and you have screwed yourself. I'd also check each name as I saw a few lesser search engine bot names in here...like snap, giga etc... heck i could have even added googlebot for those of you that just want to feed off the work of others
    Attached Files Attached Files

  22. #22
    ABW Ambassador 2busy's Avatar
    Join Date
    January 17th, 2005
    Location
    Tropical Mountaintop
    Posts
    5,636
    That is great mrbshouse! Anyone can use that list to keep 294 bots out at least. There are a lot more on the list though, are all those others now out of service or did they get lost in the process? The first time I tried to do that I quit in frustration because of all the different formatting making 'sort' impossible.
    I am afraid that by the time a definitive list is available it may prove to be waaay too long to be usefull. By the time the server determines that your visitor isn't one of them, the visitor has given up, gone away, married, had children and retired.

  23. #23
    general fuq mrbshouse's Avatar
    Join Date
    January 18th, 2005
    Location
    Argieville
    Posts
    1,381
    2busy,

    i pulled your lists together in excel, find and replace did a bunch of the work, but i had to removed the dups manually and in some cases there were 4 listings of some once compiled with my list.

    My list is much shorter and is a specific listing of agent strings i've seen in my logs over about a year worth of culling.

    I had a question regarding your listings for agents like librperl and java is your syntax correct for keeping all agents regardless of version???? the damn perl crap just keeps rolling up and up on the versions ie:
    Java/( yours
    Java/1.4.1 mine specific listing

    or

    libwww-perl yours
    libwww-perl/5.47 mine


    Someone was talking about doing a config at root level to keep the bots at bay before they even hit a page/folder and i could not find the post after searching......any code gurus know how to put a list like this at root level???

  24. #24
    ABW Ambassador 2busy's Avatar
    Join Date
    January 17th, 2005
    Location
    Tropical Mountaintop
    Posts
    5,636
    is your syntax correct for keeping all agents regardless of version????
    I am not currently using all these lists anywhere. I picked the list that had the most on it that I have seen in access logs and worked that into what I am using but versions? syntax? your guess is probably better than mine.
    They are lists I found when I was looking up information about this on several occasions. The syntax is not mine, and I do not know if any bots on that list are still crawling around, sorry.

  25. #25
    general fuq mrbshouse's Avatar
    Join Date
    January 18th, 2005
    Location
    Argieville
    Posts
    1,381
    2 busy

    the list were great no need to say sorry ;-) just looking for more specifics like how to kill all java bots in one line

+ Reply to Thread
Page 1 of 2 12 LastLast

Similar Threads

  1. Restricted by robots.txt without robots.txt?
    By mayfly in forum Search Engine Optimization
    Replies: 10
    Last Post: August 26th, 2009, 05:13 PM
  2. Should I Disallow OPENFIND?
    By Doc Sawyer in forum Search Engine Optimization
    Replies: 1
    Last Post: March 30th, 2003, 12:44 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •