Results 1 to 11 of 11
  1. #1
    ABW Ambassador writerguy's Avatar
    Join Date
    January 17th, 2005
    Location
    Springfield, Missouri, USA
    Posts
    3,248
    Struggling to kill the "Robots.txt From Hell"
    Hey, I may have messed things up on one of my sites trying to use robots.txt while I was building it. But maybe not.

    I started putting this site together three or four days ago. I put up a robots.txt file in the domain root to disallow all bots for the whole site.

    I finished the site earlier today (about 6 hours ago) and deleted the robots.txt.

    About half an hour ago, I submitted a sitemap.xml at Google Webmaster Tools. Or, I should say, I TRIED to submit it. No matter what I do, I get this message from Google when I hit the "Submit" button:

    "URL restricted by robots.txt
    We encountered an error while trying to access your Sitemap. Please ensure your Sitemap follows our guidelines and can be accessed at the location you provided and then resubmit."

    How long does it take before Google will quit obeying the robots.txt that I have deleted from the site?? I put up a new robots.txt just a half hour ago or so that has "User-agent: Googlebot" and "Allow: /" -- will that help?

    Any suggestions?
    Generate more fake news.

  2. #2
    notary sojac Herb ԿԬ's Avatar
    Join Date
    January 18th, 2005
    Location
    Central/Western NY State
    Posts
    7,741
    Lightbulb
    you can also put a line in the robots.txt file to tell the SEs where to find the sitemap:

    Sitemap: http://www.yoursite.com/sitemap.xml

  3. #3
    ABW Ambassador writerguy's Avatar
    Join Date
    January 17th, 2005
    Location
    Springfield, Missouri, USA
    Posts
    3,248
    Quote Originally Posted by Herb ԿԬ
    you can also put a line in the robots.txt file to tell the SEs where to find the sitemap:

    Sitemap: http://www.yoursite.com/sitemap.xml
    Hey, I'll do that.

    So -- I can assume that sooner or later Google will wake up and figure out the change?
    Generate more fake news.

  4. #4
    ABW Ambassador 2busy's Avatar
    Join Date
    January 17th, 2005
    Location
    Tropical Mountaintop
    Posts
    5,636
    There's a robots.txt analyzer in your GWT panel to help you see what they're seeing - and troubleshoot it. Download the one you just uploaded to be certain that your new robots.txt overwrote the old one.

  5. #5
    ABW Ambassador Boom or Bust's Avatar
    Join Date
    February 3rd, 2008
    Posts
    3,955
    Hey, can I join this meeting of the old-timers club? [Added]Oops, got 2busy!

    Bet ya somewhere stuck deep in a Google database there's an entry that says "Gary has a robots.txt file on his site disallowing us from crawling, so don't bother". I'm certain that will dissipate before too long but I wouldn't even guess how long...



    X

  6. #6
    ABW Ambassador writerguy's Avatar
    Join Date
    January 17th, 2005
    Location
    Springfield, Missouri, USA
    Posts
    3,248
    Okay. I used the robots.txt analysis tool in GWT. It showed I had a blank robots.txt.

    I clicked on the URL to that "blank" robots.txt as it showed in that same GWT analysis screen -- and it showed the exact content I cited in the post above -- allowing Googlebot to the entire site.

    I did see in a google discussion group about robots.txt some mention that it might remain indexed in a google cache somewhere for 24 hours, but the person suggesting that wasn't sure.

    The really irritating this is that I never decided to use the robots.txt until late last evening to block the site until I could get it built. Then I deleted the robots.txt file early this afternoon when I had the site ready.

    In that small space of time, the google gods found the robots.txt in a pretty big hurry -- and now they can't remember to forget it anywhere nearly as quickly?? Grrrrrr.
    Generate more fake news.

  7. #7
    ABW Ambassador Boom or Bust's Avatar
    Join Date
    February 3rd, 2008
    Posts
    3,955
    A technique that I use is to create a dummy home page for the SEs that goes nowhere. Temporarily name your new home page something that Google would never guess and test by addressing the page specifically. Then when you've completed your work, remove the dummy page and rename your new one.



    X

  8. #8
    Full Member Code Monkey's Avatar
    Join Date
    June 11th, 2007
    Posts
    337
    Google reads my robots.txt file about once a week..

  9. #9
    Full Member Tech Evangelist's Avatar
    Join Date
    March 16th, 2005
    Location
    Mesa, AZ
    Posts
    374
    Hi writerguy

    The first thing to understand is that spiders do not read the robots.txt file every time they visit. Google only reads it "periodically", which means after a random number of spider visits. You are getting the message because Google is responding to their cached version of the robots.txt file. That cached version will not be cleared until Google finds a new robots.txt file.

    The best thing to do is to put a valid robots.txt file in the root directory. Every site should have one or you will generate error 404s when the spiders request it.

    Just put a basic robots.txt file in place that doesn't disallow anything, or just disallows the directories you do not want spiders to read. It would also be a good idea to add the sitemap link as Herb suggested.


    User-agent: *
    Disallow:

    Sitemap: http://www.yourdomain.com/sitemap.xml


    If this is a new site, it is hard to tell when GoogleBot will read it again. I suspect the issue will be resolved in less than a week or two.

    Get a few links pointed to the site from high traffic sites. That will prompt GoogleBot to visit more often.
    There's good, fast and cheap. Pick any two.
    [url=http://www.topranksolutions.com]Phoenix SEO[/url] :: [url=http://www.tech-evangelist.com/category/affiliate-marketing/]Affiliate Marketing Tutorials[/url]

  10. #10
    ABW Ambassador 2busy's Avatar
    Join Date
    January 17th, 2005
    Location
    Tropical Mountaintop
    Posts
    5,636
    Tech Evangelist is right though Gary, Google understands
    dissallow:
    (meaning disallow: nada) but doesn't do as well with
    allow: /

  11. #11
    Full Member Tech Evangelist's Avatar
    Join Date
    March 16th, 2005
    Location
    Mesa, AZ
    Posts
    374
    I have seen the same situation several times.

    I have also see Google and Yahoo spiders start to index a site while it is being installed and before any links to the site exist. I don't know how they are aware of a new site, but I suspect that they are tapped into DNS changes. If it is going to take a few days to get a site and the content set up, I sometimes block spider access and then just plan on waiting a few weeks for the spiders to come a calling again after I change the robots.txt file.

    It is better if you can get everything set up all at once and not block the spiders, but with some sites that is not feasible.
    There's good, fast and cheap. Pick any two.
    [url=http://www.topranksolutions.com]Phoenix SEO[/url] :: [url=http://www.tech-evangelist.com/category/affiliate-marketing/]Affiliate Marketing Tutorials[/url]

  12. Newsletter Signup

+ Reply to Thread

Similar Threads

  1. Restricted by robots.txt without robots.txt?
    By mayfly in forum Search Engine Optimization
    Replies: 10
    Last Post: August 26th, 2009, 05:13 PM
  2. Google "HELL"o and Goodbye??? Is Hell Finally Going Away?
    By Rollerblader in forum Search Engine Optimization
    Replies: 6
    Last Post: December 20th, 2007, 04:10 PM
  3. Replies: 2
    Last Post: July 29th, 2005, 01:51 PM
  4. meta name "robots" has sth. changed?
    By Roland in forum Search Engine Optimization
    Replies: 0
    Last Post: July 8th, 2005, 03:29 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •