Results 1 to 8 of 8
  1. #1
    Member
    Join Date
    January 18th, 2005
    Location
    Las Vegas
    Posts
    61
    Question Blocking bots from spidering AE.pl
    I have one of my AE.pl sites set for "details" and it's great for users. But the fact that Google follows all the links and ends up spidering 36,000+ pages doesn't help me. In fact, it just puts a big load on my server. AE.pl essentially runs continuously as the bots follow all the related product links.

    I would like to block all search engine bots from spidering any AE.pl links. Is this possible through robots.txt?

  2. #2
    Full Member
    Join Date
    December 20th, 2005
    Posts
    413
    By crawling your site, Google and other search engines will later bring you customers. That's the beauty of AE: you can set up a site devoted to a particular topic, but still sell items that you never even knew Amazon carried.

    How is this not a good thing?

  3. #3
    Member
    Join Date
    January 18th, 2005
    Location
    Las Vegas
    Posts
    61
    Quote Originally Posted by DoctorMike
    By crawling your site, Google and other search engines will later bring you customers. That's the beauty of AE: you can set up a site devoted to a particular topic, but still sell items that you never even knew Amazon carried.

    How is this not a good thing?
    As someone who used to make over $7000/qtr from Amazon and who now makes less than $100, I can tell you that this used to be true a few years ago (when I had 65 web sites using AE.pl) but it isn't anymore. The problem is that there are hundreds of thousands of sites dumping exactly the same Amazon content into Google. The odds of anyone actually finding you in the search engine results now is virtually zero. Also, all these spidered pages are ranked PR0 by Google, which can hurt your overall site ranking. So the bottom line is that Google spiders these Amazon pages constantly which does nothing but waste server resources and produces no traffic or sales.

    Actually, Ask.com is worse than Google. Ask.com has wasted over 6 Gigabytes of my bandwidth this month doing nothing but spidering all the Amazon pages on one site. Over and over and over. No sales or traffic, just endless spidering.

  4. #4
    Full Member
    Join Date
    December 20th, 2005
    Posts
    413
    Well, I'm going to guess that with 65 sites, you may not have had time to add a lot of unique content to each. (No offense intended - I haven't seen your sites; I'm just speaking in general here.) I've found that more and more, this is a necessity, because as you say, otherwise Google sees your site as identical to all the others.

    There are two ways to do this (I mean for real, not counting what some of the bad guys do out there by copying content, etc.). One is to modify your AE templates, so that the results are not the same as everybody else's. That used to be enough, at least in my experience. Apparently of more importance now, based on what I see in my sites, is to actually have good old-fashioned content that gets updated regularly.

    In other words, what Google seems to want these days is an informative site that happens also to have links that can include AE.

  5. #5
    ABW Ambassador cusimano's Avatar
    Join Date
    January 18th, 2005
    Location
    Toronto, Canada
    Posts
    1,369

  6. #6
    Member
    Join Date
    January 18th, 2005
    Location
    Las Vegas
    Posts
    61
    I've tried using this .htaccess code to prevent bots from spidering one of my stores but it's not working (Google picks up 129,000 AE pages). Do you see any mistakes?

    Code:
    RewriteEngine on
    
    RewriteBase /
    RewriteCond %{HTTP_USER_AGENT}  (bot|Teoma|Jeeves|Google|Yahoo)
    RewriteRule ^amazon/asinsearch_([^,/]+)/?$  http://www.amazon.com/exec/obidos/ASIN/$1/ref=nosim/entertaisiteforw?dev-t=D2Y5TUCCVJ7DGE [L,R=301]
    
    RewriteBase /
    RewriteRule ^(amazon)$ $1/ [R]
    RewriteRule ^(amazon)/(.*)(\.[a-z]+)$ cgi-bin/amazon.pl?virtual=$2&virtual.dir=$1 [L]
    RewriteRule ^(amazon)/(.*)$ cgi-bin/amazon.pl?virtual=$2&virtual.dir=$1 [L]

  7. #7
    Affiliate Manager MINDsprinter's Avatar
    Join Date
    August 18th, 2006
    Location
    Washington, DC
    Posts
    1,436
    What you want is a robots.txt file or a Google sitemap.

    Robots.txt is easier: http://www.robotstxt.org/wc/exclusion-admin.html

    Google sitemaps might help you with SEO: www.google.com/webmasters/sitemaps/

    Both will allow you to block access of any spider to any page or section of your website.

    --Jason

  8. #8
    ABW Ambassador cusimano's Avatar
    Join Date
    January 18th, 2005
    Location
    Toronto, Canada
    Posts
    1,369
    In the RewriteRule statement, replace:

    ^amazon/asinsearch_([^,/]+)/?$

    with:

    ^amazon/asinsearch_([a-zA-Z0-9]+).*$

    The part in red changed. This pattern is more general and allows for product name in URL's (added in v6.07.26-beta).

    Yours truly,
    Cusimano.Com Corporation
    per: David Cusimano

  9. Newsletter Signup

+ Reply to Thread

Similar Threads

  1. Blocking Bots in .htaccess
    By Witzer in forum Programming / Datafeeds / Tools
    Replies: 11
    Last Post: January 28th, 2011, 03:58 PM
  2. Time between spidering and indexing
    By hellojello134 in forum Midnight Cafe'
    Replies: 1
    Last Post: July 13th, 2005, 06:48 PM
  3. Slow down spidering
    By dollardaze in forum Cusimano.com Scripts
    Replies: 2
    Last Post: November 18th, 2004, 07:07 AM
  4. Inktomi Spidering finally
    By cjk in forum Search Engine Optimization
    Replies: 1
    Last Post: January 24th, 2004, 09:14 AM
  5. Google Spidering...
    By eggerda in forum Search Engine Optimization
    Replies: 14
    Last Post: September 15th, 2003, 01:44 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •