Results 1 to 9 of 9
  1. #1
    Full Member
    Join Date
    January 18th, 2005
    Posts
    396
    Does this do what I hope in robots.txt?
    I am still fighting weirdbots that shape-, IP- and name-change faster than I can keep up. Some of them do LOOK at robots.txt and I am looking for a general solution to keep them at a lesser roar. Will a robots.txt file like below allow Google and the rest of the good guys in and tell the BELIEVERS from the bad-bots to please go away?



    User-agent: *
    Disallow: /cgi-bin/

    User-agent: Googlebot
    Disallow:

    User-agent: msnbot
    Disallow:

    User-agent: Teoma
    Disallow:

    User-agent: Slurp
    Disallow:

    User-agent: Alexa
    Disallow:

    User-agent: *
    Disallow: /


    Thanks - Charles

  2. #2
    Affiliate Manager adambha's Avatar
    Join Date
    October 20th, 2006
    Posts
    301
    Quote Originally Posted by micheck
    Will a robots.txt file like below allow Google and the rest of the good guys in and tell the BELIEVERS from the bad-bots to please go away?
    Try using .htaccess instead.

    Code:
    SetEnvIfNoCase User-Agent .*BadSpider.* bad_bot
    SetEnvIfNoCase User-Agent .*ScraperBot.* bad_bot
    
    order allow,deny
    deny from env=bad_bot
    deny from 123.456.789.10
    allow from all
    You can have as many SetEnvIfNoCase as you'd like, one for each bot. And you can also deny by IP, as shown.

    Anything you list here will get a 403 Forbidden response when attempting to access any page.

    Robots.txt is for the good guys who 1) Look at it 2) Choose to obey it.

    Using .htaccess doesn't give them any choice.

    Enjoy!

  3. #3
    Full Member
    Join Date
    January 18th, 2005
    Posts
    396
    Thank you for your response!

    I don't use .htaccess because of the impact it brings to my servers - but I do use a version of your script in my apache configuration scripts - the problem is that the bad bots transform their name so fast I can't keep up banning them - it works great when they're slow to change

    Your way of 'stringing the names together' is very interesting - I'll try it

    I was trying to kill the FEW bad bots that DO look at robots.txt

    Charles

  4. #4
    general fuq mrbshouse's Avatar
    Join Date
    January 18th, 2005
    Location
    Argieville
    Posts
    1,381
    Charles,

    seems this is the same problem you had before. The way you worded it above your disallowing the SE spiders i think.

    What negative impact does an htacess file have on your server...btw bad bots don't really care what you put in the robot.txt

    look at the User-Agent strings and you might see a pattern evolving around the java bots. It's kinda funny this is the second time i've seen you post on this issue and your getting the same advise...hmmm

    http://forum.abestweb.com/showthread...ighlight=agent

  5. #5
    Full Member
    Join Date
    January 18th, 2005
    Posts
    396
    Well - yes - sort of. Previously I was looking for a general blocking system and I ended up using the ones you all have suggested - in my Apache config file rather than the .htaccess files - It works fine if the badbots continue to use the same name (or name pattern) I put in their name and they get 403's just like they're supposed to.

    Now I am hitting a bunch that change their name pattern and/or their IP sometimes hourly so I am trying to see any other ways to block them.

    Surprisingly some semi-badbots (10% or so) do actually read the robots.txt and then go away.

    RE: performance knocks from .htaccess - I haven't measured it - I'm relying on some of the Apache folks that say to avoid the extra hits with .htaccess put your configuration in the httpd.conf Apache file if you are allowed to.

    I reread the Google information stuff on robots.txt and their explanation was that they read from the top and when they find what lines pertain to them they don't read any more in that file. So since today is kind of slow I did put the new robots.txt file into production - it hasn't had any bad effect on Google... and (knock on wood) my badbot problem is fairly low - go figure.

    Charles

  6. #6
    Full Member
    Join Date
    January 18th, 2005
    Posts
    396
    A second related question - if I block a badbot 'signature' like say FunWebProducts successfully and it also blocks an IE customer who as been captured by FunWebProducts - is that OK - my understanding is that even with them clicking my link, Fun... is going to overwrite me and get credit. Thoughts???

    Charles

  7. #7
    general fuq mrbshouse's Avatar
    Join Date
    January 18th, 2005
    Location
    Argieville
    Posts
    1,381
    Charles,

    I read with open eyes and now i need to smarten up about the config stuff as one main advantage off the top is that you only make that change in one place if i'm guessing correct (not per htacess folder).

    You might try doing a search for honeypot+bot I saw some guys at wmw hooking that up...as the bot hits the honey page, it is added to the bad bot list ;-)

    Hey...pass the salt for my crow would you

  8. #8
    Full Member
    Join Date
    January 18th, 2005
    Posts
    396
    Smile
    Soon as I finish with the salt for all of the crow I've got piled up - I've found pepper helps also 8-)

  9. #9
    general fuq mrbshouse's Avatar
    Join Date
    January 18th, 2005
    Location
    Argieville
    Posts
    1,381
    pass the Tabasco :-)

  10. Newsletter Signup

+ Reply to Thread

Similar Threads

  1. Restricted by robots.txt without robots.txt?
    By mayfly in forum Search Engine Optimization
    Replies: 10
    Last Post: August 26th, 2009, 05:13 PM
  2. Robots.txt
    By Rhia7 in forum Midnight Cafe'
    Replies: 0
    Last Post: April 18th, 2009, 12:34 AM
  3. Google wants nothing but robots.txt!
    By login in forum Search Engine Optimization
    Replies: 2
    Last Post: November 19th, 2004, 09:05 AM
  4. Do you use a robots.txt?
    By Mr. Sal in forum Voting Booth
    Replies: 11
    Last Post: November 12th, 2003, 07:29 PM
  5. robots txt
    By reflections in forum Programming / Datafeeds / Tools
    Replies: 5
    Last Post: December 26th, 2002, 06:22 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •