Results 1 to 7 of 7
  1. #1
    Comfortably Numb John Powell's Avatar
    Join Date
    October 17th, 2005
    Location
    Bayou Country, LA
    Posts
    3,432
    Google and robots.txt character limit
    I was at Google sitemap looking at their reports for one of my sites. I noticed they had "Test your robots.txt file and try out changes" under the robots.txt tab.

    After a try it came up with an error that said Must be at most 2000 characters Now that was a surprise and I can't seem to find more info anywhere.


  2. #2
    Lite On The Do, Heavy On The Nuts Donuts's Avatar
    Join Date
    January 18th, 2005
    Location
    Winter Park, FL
    Posts
    6,930
    I haven't heard of this specific limit that you ran into before, but I think it isn't so much associated with the seo aspects of crawling, but rather having their crawlers avoid spider traps (a scripted never-ending file or slow-replying robots.txt file, designed to snarl spiders). I think they've decided for the robot's speed and safety (and to not inadvertently cause ddos attacks when a spider chokes) they've choosen a max file size to spider.

  3. #3
    Lite On The Do, Heavy On The Nuts Donuts's Avatar
    Join Date
    January 18th, 2005
    Location
    Winter Park, FL
    Posts
    6,930
    There are people using cgi scripting to parse their robots.txt files so that they can detect which spider is asking to read the robots.txt file and then morph the file (as the webmaster sees fit) for that particular one. This can keep your file small, but if I were an algorithm (that could see it morph), I'd seriously wonder about the "why" behind your dynamic file activity.

  4. #4
    Comfortably Numb John Powell's Avatar
    Join Date
    October 17th, 2005
    Location
    Bayou Country, LA
    Posts
    3,432
    Donuts, Some of that heavy technology you speak of must be behind the present mysterious set up over at WmW. They have cloaked their robots.txt and have "honeypots" to trap bad bots. It's real interesting. You might want to have a look.


  5. #5
    Fear and Arrogance jrrl's Avatar
    Join Date
    January 18th, 2005
    Location
    Pittsburgh
    Posts
    485
    Quote Originally Posted by bumpaw
    I was at Google sitemap looking at their reports for one of my sites. I noticed they had "Test your robots.txt file and try out changes" under the robots.txt tab.

    After a try it came up with an error that said Must be at most 2000 characters Now that was a surprise and I can't seem to find more info anywhere.
    Not sure what Google's deal is, but I can assure you when we came up with the robots.txt thing waaaaaaay back when (at a bar in Darmstadt, Germany), nobody was talking about a maximum size. No offense to the big G (please, oh great Google, let manna and position rain down upon my links), but this smacks of lazy programming.

    -John.
    There's a reason army's wear uniforms even though it makes them easier to spot. Sometimes that's what you want. Uniforms suggest organization, power, and numbers. These, in turn, inspire fear. And, as any good operative knows, there is no more effective weapon than fear.

    Hosting Comparison - HostScope - jrrl.com

  6. #6
    Comfortably Numb John Powell's Avatar
    Join Date
    October 17th, 2005
    Location
    Bayou Country, LA
    Posts
    3,432
    Googlebot is now plowing through some files that I thought I had blocked in robots.txt. Maybe someone can double check this for me.

    Given: mydomain.com/stuff/detail-stuff.php

    Does this block detail-stuff.php?

    User-agent: *
    Disallow: /detail-stuff.php


    Or do I need to show it as:

    User-agent: *
    Disallow: /stuff/detail-stuff.php


    I've been using the first example and googlebot is grabbing /stuff/detail-stuff.php


  7. #7
    Lite On The Do, Heavy On The Nuts Donuts's Avatar
    Join Date
    January 18th, 2005
    Location
    Winter Park, FL
    Posts
    6,930
    use the second

  8. Newsletter Signup

+ Reply to Thread

Similar Threads

  1. No change to Robots.txt but google says pages blocked
    By JI7009 in forum Search Engine Optimization
    Replies: 7
    Last Post: November 7th, 2009, 11:34 AM
  2. Restricted by robots.txt without robots.txt?
    By mayfly in forum Search Engine Optimization
    Replies: 10
    Last Post: August 26th, 2009, 05:13 PM
  3. SIDs, character limit?
    By buy-tees in forum Commission Junction - CJ
    Replies: 10
    Last Post: July 7th, 2009, 05:40 PM
  4. Google, Ask Jeeves ignoring robots.txt?
    By ~Michelle in forum Search Engine Optimization
    Replies: 5
    Last Post: April 16th, 2005, 06:05 PM
  5. Google wants nothing but robots.txt!
    By login in forum Search Engine Optimization
    Replies: 2
    Last Post: November 19th, 2004, 09:05 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •