Results 1 to 13 of 13
  1. #1
    Member
    Join Date
    January 18th, 2005
    Location
    Las Vegas
    Posts
    61
    I have been using a robot.txt file in all my sites that looks like this:

    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /logs/

    But when I started looking at some of my logs, it appears as though sites that use the store templates (which are built almost entirely around cgi calls to Amazon.PL) aren't being spidered correctly.

    Am I blocking the search engines from spidering my sites? What concerns me is that I might be blocking links like this:

    http://www.priceviews.com/cgi-bin/am...wse&mode=books

    I am a little confused about the way the search engines spider site like these.

  2. #2
    ABW Ambassador
    Join Date
    January 18th, 2005
    Location
    United Kingdom
    Posts
    1,797
    Yes, you are excluding the cgi-bin in your robots.txt file, so they are not spidering those pages.

    Change your robots.txt to this:

    User-agent: *

    Disallow: /logs/

    And the next time the spiders come by, they will spider all those pages in your cgi-bin.

    Search Engine Positioning - 1 Design 4 Life

  3. #3
    ABW Ambassador cusimano's Avatar
    Join Date
    January 18th, 2005
    Location
    Toronto, Canada
    Posts
    1,369
    If you use those "disallow" lines in your robots.txt file then spiders will not spider amazon.pl since it is in your /cgi-bin/ directory.

    Without those "disallow" lines, spiders (such as Google) will spider your cgi-bin links. If you want spiders to spider your cgi-bin links, remove the "Disallow: /cgi-bin/" line. Note that this would allow spiders to spider all your other cgi-bin links too.

    If you want somethings in your /cgi-bin/ to be disallowed, try moving them into a subdirectory of /cgi-bin/ such as /cgi-bin/other/ (use whatever sudirectory name you like) and then use "Disallow: /cgi-bin/other/" -- this will still allow "/cgi-bin/" to be spiderable.

    Yours truly,
    Cusimano.Com Corporation
    per: David Cusimano

  4. #4
    "An Englishman In New York" TJ's Avatar
    Join Date
    January 18th, 2005
    Posts
    3,282
    David, were there not some issues where servers crashed if you allow the SE's to spider the cgi-bin folder?

    I seem to remember an issue.... was it Crazy-Guy that posted about this in the past??

  5. #5
    Member
    Join Date
    January 18th, 2005
    Location
    Las Vegas
    Posts
    61
    <BLOCKQUOTE class="ip-ubbcode-quote"><font size="-1">quote:</font><HR>Originally posted by cusimano:
    If you use those "disallow" lines in your robots.txt file then spiders will not spider amazon.pl since it is in your /cgi-bin/ directory.

    Without those "disallow" lines, spiders (such as Google) will spider your cgi-bin links. If you want spiders to spider your cgi-bin links, remove the "Disallow: /cgi-bin/" line. Note that this would allow spiders to spider all your other cgi-bin links too.

    If you want somethings in your /cgi-bin/ to be disallowed, try moving them into a subdirectory of /cgi-bin/ such as /cgi-bin/other/ (use whatever sudirectory name you like) and then use "Disallow: /cgi-bin/other/" -- this will still allow "/cgi-bin/" to be spiderable.

    Yours truly,
    Cusimano.Com Corporation
    per: _David Cusimano_<HR></BLOCKQUOTE>

    So let me just clarify this. If I remove that line from my robots.txt, then a link like this will get spidered properly:

    http://www.priceviews.com/cgi-bin/am...wse&mode=books

    Since my cgi-bin has no other executable links to it, then the actual content of of cgi-bin shouldn't be spidered, right? I mean, things like the store templates and cache files aren't linked to anything directly so they should be ignored, right? Many thanks, David.

  6. #6
    ABW Ambassador cusimano's Avatar
    Join Date
    January 18th, 2005
    Location
    Toronto, Canada
    Posts
    1,369
    <BLOCKQUOTE class="ip-ubbcode-quote"><font size="-1">quote:</font><HR>Originally posted by TjĀ©:
    David, were there not some issues where servers crashed if you allow the SE's to spider the cgi-bin folder?<HR></BLOCKQUOTE>To prevent server overload (typically caused by excessive spidering), amazon.pl (v2.11.25 and higher) monitors the server's 1-minute load average (as reported by the linux/unix uptime command). The maxload configuration variable acts like a fuse and specifies the maximum load allowed before amazon.pl shuts down and simply reports a "Server is too busy" error message. Default of maxload is 50. (Note: maxload only works on linux/unix servers).

    Yours truly,
    Cusimano.Com Corporation
    per: David Cusimano

  7. #7
    ABW Ambassador cusimano's Avatar
    Join Date
    January 18th, 2005
    Location
    Toronto, Canada
    Posts
    1,369
    <BLOCKQUOTE class="ip-ubbcode-quote"><font size="-1">quote:</font><HR>Originally posted by beggers:
    So let me just clarify this. If I remove that line from my robots.txt, then a link like this will get spidered properly:

    http://www.priceviews.com/cgi-bin/am...wse&mode=books

    Since my cgi-bin has no other executable links to it, then the actual content of of cgi-bin shouldn't be spidered, right? I mean, things like the store templates and cache files aren't linked to anything directly so they should be ignored, right? Many thanks, David.<HR></BLOCKQUOTE>Remove that "disallow" line to allow spidering of amazon.pl links (and any other /cgi-bin/ link you may add in the future).

    To prevent spidering of HTML files in your /cgi-bin/ directory, you might need to create the following .htaccess file in your /cgi-bin/ directory:

    &lt;Files ~ "\.(htm|html|shtml|tmp)$"&gt;
    Deny from all
    &lt;/Files&gt;

    You may or may not need this depending upon how your web server is configured to handle /cgi-bin/ requests. Try accessing your amazon.html results template file via www.mydomain.com/cgi-bin/amazon.html (replace the domain name with yours) and see what happens. Adjust this URL if your results template is called something else or is in a subdirectory.

    Yours truly,
    Cusimano.Com Corporation
    per: David Cusimano

  8. #8
    Member
    Join Date
    January 18th, 2005
    Location
    Las Vegas
    Posts
    61
    <BLOCKQUOTE class="ip-ubbcode-quote"><font size="-1">quote:</font><HR>Originally posted by cusimano:
    <BLOCKQUOTE class="ip-ubbcode-quote"><font size="-1">quote:</font><HR>Originally posted by beggers:
    So let me just clarify this. If I remove that line from my robots.txt, then a link like this will get spidered properly:

    http://www.priceviews.com/cgi-bin/am...wse&mode=books

    Since my cgi-bin has no other executable links to it, then the actual content of of cgi-bin shouldn't be spidered, right? I mean, things like the store templates and cache files aren't linked to anything directly so they should be ignored, right? Many thanks, David.<HR></BLOCKQUOTE>Remove that "disallow" line to allow spidering of amazon.pl links (and any other /cgi-bin/ link you may add in the future).

    To prevent spidering of HTML files in your /cgi-bin/ directory, you might need to create the following .htaccess file in your /cgi-bin/ directory:

    <Files ~ "\.[htm|html|shtml|tmp)$">
    Deny from all
    </Files>

    You may or may not need this depending upon how your web server is configured to handle /cgi-bin/ requests. Try accessing your amazon.html results template file via http://www.mydomain.com/cgi-bin/amazon.html (replace the domain name with yours) and see what happens. Adjust this URL if your results template is called something else or is in a subdirectory.

    Yours truly,
    Cusimano.Com Corporation
    per: _David Cusimano_<HR></BLOCKQUOTE>

    I get "Internal Server Error" when I try that.

  9. #9
    ABW Ambassador cusimano's Avatar
    Join Date
    January 18th, 2005
    Location
    Toronto, Canada
    Posts
    1,369
    <BLOCKQUOTE class="ip-ubbcode-quote"><font size="-1">quote:</font><HR>I get "Internal Server Error" when I try that.<HR></BLOCKQUOTE>Look to see what error shows up in the server's error_log file when you try doing an access. The error_log should have more information than what is displayed in the web brower. Ask your hosting company about this issue. Perhaps something in the server's configuration needs to be set to allow this command in an .htaccess file in your cgi-bin directory.

    Yours truly,
    Cusimano.Com Corporation
    per: David Cusimano

  10. #10
    Member
    Join Date
    January 18th, 2005
    Posts
    84
    I am using an NT server and it kept on going down when using Amazon.pl script. Are there any ways to prevent server overload for NT servers like you have for Linux/Unix Servers?

  11. #11
    ABW Ambassador cusimano's Avatar
    Join Date
    January 18th, 2005
    Location
    Toronto, Canada
    Posts
    1,369
    The load checking (set via the maxload configuration variable) is designed to only work on Linux/Unix servers.

    What do you mean by "going down"? Does the server completely crash and must be rebooted? Or do you mean that the server just slows down to a crawl?

    What version of the NT server do you have? Does it have the latest service packs? What version of IIS is running? What version of ActivePerl is running (I assume ActivePerl is what you have on your server to run perl)?

    Yours truly,
    Cusimano.Com Corporation
    per: David Cusimano

  12. #12
    Member
    Join Date
    January 18th, 2005
    Posts
    84
    Something was going on in the directory cgi-bin/amazon-cache that's eating up CPU. CPU usage goes up to 100%. Is it due to compression?

    Using Windows 2000 Advnaced Server with .Net enabled and latest patches, and latest version of ActivePerl.

  13. #13
    ABW Ambassador cusimano's Avatar
    Join Date
    January 18th, 2005
    Location
    Toronto, Canada
    Posts
    1,369
    I do not know why the CPU usage on your NT goes up to 100% (on a Linux/Unix server, the load from amazon.pl XML is typically very low). How often is this occurring? Does it happen as soon one user accesses amazon.pl on your server or when a lot of accesses happen simultaneously (such as when a search engine spider visits)?

    Try reducing the size of the caches by setting the following two configuration variables in your amazon.ini:

    cache.size 1mb
    imgCache.size 250kb

    You can try turning off compression by setting:

    cache.compress no
    imgCache.compress no

    If you have any further details, send them using the Support form.

    Yours truly,
    Cusimano.Com Corporation
    per: David Cusimano

  14. Newsletter Signup

+ Reply to Thread

Similar Threads

  1. Google Robot Searching for robot.txt
    By ahmar in forum Search Engine Optimization
    Replies: 4
    Last Post: December 26th, 2004, 01:26 PM
  2. Robot.txt example...
    By eggerda in forum Search Engine Optimization
    Replies: 7
    Last Post: September 18th, 2003, 11:56 AM
  3. A Lesson About Robot.txt
    By seaslug44 in forum Search Engine Optimization
    Replies: 1
    Last Post: February 22nd, 2002, 07:49 PM
  4. Robot txt
    By mousejockey in forum Programming / Datafeeds / Tools
    Replies: 12
    Last Post: January 14th, 2002, 05:05 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •