Results 1 to 14 of 14
  1. #1
    Member
    Join Date
    January 18th, 2005
    Posts
    56
    I'm trying to understand how a search engine spider would see the amazon.pl results. So here are some questions:

    1. I realize bots don't read script, so I should use HTML links. The spider would follow those, but exactly where _are_ they? I "disallowed" my cgi-bin in my robots.txt file, but if that's where the bot thinks the html results pages are, maybe I _shouldn't_ disallow it?

    2. Since many people may be generating the same html results page at once on my server, is there a danger Google will see those pages at some moment in time (wherever they are) as identical and flag them as spam?

    3. Or, in the best of all worlds, would Google see all html-generated results pages as separate and give my site credit for having lots and lots of pages linking to my home page (probably helping Page Rank)?

    Thanks,
    Tom

  2. #2
    ABW Ambassador cusimano's Avatar
    Join Date
    January 18th, 2005
    Location
    Toronto, Canada
    Posts
    1,369
    Based on what I see in my server logs and what other users have reported and what appears when I search at google.com, Googlebot does follow amazon.pl links (unless robots.txt specifies disallow cgi-bin).

    There are no HTML pages created when amazon.pl is run. The web page address (URL) of the results page shown to the user (or indexed by a spider) is the full URL including parameters after the question mark. For example, amazon.pl?asinsearch=B00006LHYC is different than amazon.pl?asinsearch=B00005YW4H and spiders see then as two different pages.

    Typically, you should allow the cgi-bin directory to be read by spiders, that is, do _not_ put "disallow cgi-bin" in your robots.txt file.

    With amazon.pl, you do effectively have "lots and lots" of pages on your website.

    Yours truly,
    Cusimano.Com Corporation
    per: David Cusimano

  3. #3
    Member
    Join Date
    January 18th, 2005
    Posts
    56
    Thanks, David...just so I'm clear...EACH click on the SAME html amazon.pl link generates a unique URL (like in your example, B00006LHYC as opposed to B00005YW4H) ?

    Sorry to belabor the point (and probably your patience) but I do worry about the possibility of being flagged for spamming with identical pages. Sounds like that's not going to happen because there's not an actual page generated (a new concept to me).

    Thanks, and I'll remove the "disallow" from my robots.txt as you suggest.

    I really appreciate your having this discussion forum -- it's great customer service!

    Tom

  4. #4
    Member
    Join Date
    January 18th, 2005
    Posts
    56
    Forgot my other question:

    If the spider only sees a URL, and there's no actual html page generated by amazon.pl, then page rank isn't helped because the spider doesn't see any backward links (to the home page)...is that a correct interpretation?

    Thanks again,
    Tom

  5. #5
    Newbie
    Join Date
    January 18th, 2005
    Posts
    43
    i found a neat software program that will call every page on your webserver and leave it in your cache directory. I was using it to speed up results. Is it possible ( me=newbie) that the spider would included all that cached results as hits on their searches?

  6. #6
    ABW Ambassador cusimano's Avatar
    Join Date
    January 18th, 2005
    Location
    Toronto, Canada
    Posts
    1,369
    Search engines look at the entire URL (that includes the part following the question mark). Thus http://www.bime.com/cgi-bin/amazon.p...rch=B00006LHYC and http://www.bime.com/cgi-bin/amazon.p...rch=B00005YW4H are completely different since these two URL's are different. For example, http://www.bime.com/cgi-bin/amazon.pl shows 25 products (numbered 1 thru 25) and have 25 different amazon.pl links (each with a different value of the asinsearch= parameter).

    To have your results pages link back to your home page, edit your results template (amazon.html) and add a link that goes to your home page. If you are using any of the supplied templates, they all include a "Home" link that links to / (i.e.: to the home page of your website).

    I would recommend against using a program (such as a link checker) to "pre-fetching" every link on your website. The cache is a fixed size so when the cache eventually fills up, the oldest data is simply flushed. Also you would be generating a lot of unnecessary traffic fetching data for pages that may never be displayed by a user. The most efficient use of traffic/cache is to just let users view pages. Popular pages will thus automatically be in the cache (except the first time they are viewed or they are stale in the cache).

    Yours truly,
    Cusimano.Com Corporation
    per: David Cusimano

  7. #7
    Member
    Join Date
    January 18th, 2005
    Location
    Las Vegas
    Posts
    61
    Two things can get you banned or have your Google PageRank greatly reduced are:

    1. Creating a bunch of web sites by simply duplicating the original site under different domain names.

    2. Excessive cross-linking between your sites.

  8. #8
    Newbie
    Join Date
    January 18th, 2005
    Posts
    43
    i disallowed some of the folders in the cgi-bin. the format folder. can i do this without disallowing the entire cgi-bin?

  9. #9
    ABW Ambassador phillyburbs's Avatar
    Join Date
    January 18th, 2005
    Location
    in the PhillyBurbs!
    Posts
    3,097
    Beggers: Define "excessive"


    Karl Smith
    phillyBurbs - Your Internet Starts Here

  10. #10
    ABW Ambassador cusimano's Avatar
    Join Date
    January 18th, 2005
    Location
    Toronto, Canada
    Posts
    1,369
    foofighter,

    You can say "dispallow cgi-bin/amazon-format" in your /robots.txt file. When you specify a directory path, spiders are supposed to skip any URL that starts with that directory path.

    Note that robots.txt is only a "suggestion" to spiders/visitors. If you want to block access, you would have to create .htaccess file in the appropriate directory. I have the following in my cgi-bin/.htaccess file:

    <Files ~ "\.(htm|html|shtml|.tmp)$">
    Deny from all
    </Files>

    This blocks web browser access to any *.htm, *.html, *.shtml, and *.tmp files located in cgi-bin (the directory where this .htaccess file is) and in any subdirectories.

    Try http://www.bime.com/cgi-bin/amazon-format/details6.html and you'll see "403 Forbidden" error which indicates that the .htaccess file successfully blocked web browser access (the file can still be read by amazon.pl).

    Yours truly,
    Cusimano.Com Corporation
    per: David Cusimano

  11. #11
    Newbie
    Join Date
    January 18th, 2005
    Posts
    43
    Thanks david. I have one other question...How do change the font size on the results page if I am not using the images in the results. I have seen the color options, but I do not see a size option in the ini file. By the way, This is a great product and hopefully I will be purchasing another script soon. I am very impressed.

  12. #12
    ABW Ambassador cusimano's Avatar
    Join Date
    January 18th, 2005
    Location
    Toronto, Canada
    Posts
    1,369
    There are not configuration variables to set the font and size in results. If you are talking about the list of results such as at http://www.bime.com/cgi-bin/amazon.pl then you have to modify the formatting library. If you are talking about the details pages such as at http://www.bime.com/cgi-bin/amazon.p...rch=0767908155 then edit the formatting file that you are using (I'm using amazon-format/details6.html).

    To use your own custom formatting library, make a copy of amazon-format/library.fmt to amazon-format/mylibrary.fmt and then set (in amazon.ini):

    format.library mylibrary

    The list of results is formatted using the format.list1.* formatting codes (or format.list2.* if you have list.style set to 2). Open mylibrary.fmt in Notepad or some other simple text editor (not Word), and edit the
    <!--format.list1.cell--> .... <!--/format.list1.cell-->
    block. You'll see <font> tags in there. Note: We update the library from time to time, so you'll have to redo your changes at that time.

    Yours truly,
    Cusimano.Com Corporation
    per: David Cusimano

  13. #13
    Newbie
    Join Date
    January 18th, 2005
    Posts
    43
    googlebot only indexed 2 links from my frontpage. I have 256 links on it. What is that all about?

  14. #14
    ABW Ambassador
    Join Date
    January 18th, 2005
    Posts
    532
    <BLOCKQUOTE class="ip-ubbcode-quote"><font size="-1">quote:</font><HR>Originally posted by foofighter:
    googlebot only indexed 2 links from my frontpage. I have 256 links on it. What is that all about?<HR></BLOCKQUOTE>

    (I'm basing these comments on what I have heard on various forums and what I recall off-hand...these commnents are not necessarily true or accurate.)
    First...I believe that Google does not necessarily index every link within a site on it's initial visit to that site. Google is supposed to be somewhat server friendly and supposedly often nibbles at the links...repeatedly visiting a site to expand the site index.
    Secondly...Google may not index all links on a site depending on the sites popularity level and how much fluff it encounters between links. Too little fluff it looks like just a links list. Too much fluff and Googlebot may not read down far enough to pick up the links.

    -------
    Stanley

  15. Newsletter Signup

+ Reply to Thread

Similar Threads

  1. googlebot ???
    By Neil in forum Midnight Cafe'
    Replies: 4
    Last Post: March 4th, 2005, 08:29 PM
  2. What's this Googlebot?
    By Leader in forum Search Engine Optimization
    Replies: 5
    Last Post: June 24th, 2003, 01:32 PM
  3. Is This Googlebot
    By Abigail in forum Search Engine Optimization
    Replies: 2
    Last Post: June 26th, 2002, 10:51 AM
  4. Googlebot's Out and About
    By marinbob in forum Search Engine Optimization
    Replies: 2
    Last Post: May 25th, 2002, 05:42 PM
  5. Did you see googlebot?
    By cowox in forum Search Engine Optimization
    Replies: 11
    Last Post: April 27th, 2002, 06:15 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •