Results 1 to 4 of 4
  1. #1
    ABW Founder Haiko de Poel, Jr.'s Avatar
    Join Date
    January 18th, 2005
    Location
    New York
    Posts
    21,609
    From PHPWizard.net By: Tobias Ratschiller

    <BLOCKQUOTE class="ip-ubbcode-quote"><font size="-1">quote:</font><HR>You know all about the advantages dynamically generated web sites offer - but if you want your site to be indexed by search engines, you have to keep in mind how search engines work. This article shows some search engine basics and provides you with guidelines on making your dynamic web sites search-engine-friendly.

    By Tobias Ratschiller on September 28th, 2001.

    The problem
    ==============
    If ecommerce-applications, web-based schedule planners, or personalized portals - dynamic sites are often generated for one user specifically. Web-applications for example often assign a session-ID to unambigously identify a user. A URL would for example look as follows: http://www.foo.com/script.php?ID=b6a...6078abf044cdb5
    This makes it possible to recognize users over different separate pages, and possibly also show their shopping cart in an online shop. For a search engine it does not make much sense to show the contents of such a site: usually the session expires after a certain time-span or the content of a site is not traceable anymore.

    For this reason many search engines do principally not indicate sites whose address (URL) looks like "dynamics". Part of these are for example addresses which contain "cgi-bin", "pl", "?" or "&". A few search engines just leave the parameters ("?ID") away and call up the page alone ( "script.php").

    This perfectly understandable behaviour leads to one problem, though: many bigger sites generate pages in a dynamic way, for example through the use of databases. These should obviously be indicated by search engines. But as already depicted there are problems with URLs like http://www.foo.com/script.php?category=PHP.

    Fooling robots
    ==============
    The robots of search engines, however, are also normal HTTP-clients and do absolutely not see how a site is created on the server side. And with PHP almost anything can be created that can be sent from a web-server to a client. To make the search-robot indicate a dynamically generated page, it is sufficient to make it believe that the site is page. Instead of the ending "php" for a php-generated site you assign an ending like "html", for example. The URL of your example script now looks as follows: http://www.foo.com/script.html?category=php. If a search engine calls up a page without these parameters, a standard page should come up. This works well with pages that do not need any parameters. Sometimes, though, the parameters do really indicate the content which is connected with certain parameters: An article from the category "PHP" is completely different than an article from the category "Perl": the parameter "category" is thus very important.

    Thus the developer has to find another possibility to transfer parameters. The following for example simulates a static html-site: http://www.foo.com/script.html/PHP/. For the robot this looks like a normal index structure: The path component of this URL is /script.html/PHP/. The web-server though executes it as "script.html". The parameter "PHP" is then manually extracted from the path environment ($PATH_INFO). A more elegant way: Apache can directly assign a MIME-type to the file. You simply call the file "script" (without ending) and with Apache's "force-type" directive you assign the type application /x-httpd-php to it.The URL of the script is now: http://www.foo.com/script/PHP/, and the parameter is again visible from the path. All search engines indicate such a page without problems, because they are not different from the static HTML pages anymore.

    Making magic with Mod_Rewrite
    With Mod_Rewrite it is possible to do without the manual use of the path environment. With Ralf S. Engelschall's Mod_Rewrite URLs can be rewritten on-the-fly; because for these rewrite-rules (thus the instructions according to which the URLs are to be programmed) regular expressions can be used, almost anything imaginable can be done. Further information about this can be found in the documentation under http://www.apache.org/docs/mod/mod_rewrite.html. Please notice that this module is not compiled with Apache in a standard way; you have to give the configure-script the following instructions to also compile mod-rewrite: --activate-module=src/modules/standard/mod-rewrite.o

    For our use a few simple rewrite-rules are sufficient. First the rewrite-engine has to be switched on. For this you write the following configuration directives into a .htaccess-file: RewriteEngine on

    With the following rule all URLs with the form news<id>.html are transformed in shownews.php?id=<id>. So news01.html becomes shownews.php?id=01:

    RewriteRule ^news(.*)\.html$ shownews.php?id=$1

    Your script may access the variable $id as usual. The browser of the user does not notice the change - for the browser the file is still called news01.html.

    Another example:

    RewriteRule ^(.*)\.html$ shownews.php?id=$1

    This line transforms URLs like foo.html into shownews.php?id=foo.

    Conclusion
    ==============
    With a few tricks it is possible to make spiders and robots believe to have found static sites which they display in the usual way. The methods presented in the article can be easily integrated in own scripts and with the respective adaptation they also work with other server-side script languages without problems.
    <HR></BLOCKQUOTE>

  2. #2
    ABW Ambassador
    Join Date
    January 18th, 2005
    Posts
    1,086
    I think the search engines have pretty much ironed out the problems with query strings. A lot of major content sites include parameters on their pages.

    I have several sites where the articles have the form article.html?article_id=xxx . The major search engines have no problem finding and indexing these pages. I haven't seen any indication that Google gives demerits for the query string in the URL, and my logs show Google diligently reading every article.

    They might assign different weights to articles referred to with parameters, but I doubt that...since so many of the highest quality content sites use query strings.

    The one area of concern is cookies: You need to make sure the bots can crawl your site without cookies.

  3. #3
    Member
    Join Date
    January 18th, 2005
    Posts
    161
    Yes. Currently both Inktomi and Google will spider dynamic urls. It's a snap to do the site.com/articles/4/ thing.

    The biggest pitfall is duplicate content issues with google. If you start a session, google picks it up as site.com?PHPSESSID=sflksfdlksdlfjdsfsdflkjsdklfj

    Which means if you carry that session on other pages google will get several sets of what it thinks are unique pages but are in fact the same. The PHPSESSID changes each time it visits. Bad news!

    Here's someting I put together to stop this:

    <pre class="ip-ubbcode-code-pre">
    &lt;?
    // searcengine-detect.inc
    /* Use this to start a session only if the UA is *not* at search engine
    to avoid duplicate content issues with url propagation of SID's */ $searchengines=array("Google", "Fast", "Slurp", "Ink", "ia_archiver", "Atomz", "Scooter");
    $is_search_engine=0;
    foreach($searchengines as $key =&gt; $val) {
    if(strstr("$HTTP_USER_AGENT", $val)) {
    $is_search_engine++;
    }
    } if($is_search_engine==0) { // Not a search engine

    /* You can put anything in here that needs to be
    hidden from searchengines */
    session_start();

    } else { // Is a search engine

    /* Put anything you want only for searchengines in here */
    $foo=$bar;

    }
    ?&gt;
    </pre>

  4. #4
    Newbie
    Join Date
    January 18th, 2005
    Posts
    15
    Personally, my solution to this has always been to just create a .html page consisting of one line -- a Server Side Include to the script destination.

  5. Newsletter Signup

+ Reply to Thread

Similar Threads

  1. Removing pages from search engines?
    By Verbalkent in forum Newbie Affiliate FAQs & Helpful Articles
    Replies: 7
    Last Post: August 16th, 2004, 04:43 PM
  2. Naming pages for search engines
    By Vinny O'Hare in forum Search Engine Optimization
    Replies: 10
    Last Post: September 8th, 2003, 02:08 AM
  3. Dynamic Pages and Search Engines
    By Haiko de Poel, Jr. in forum Midnight Cafe'
    Replies: 1
    Last Post: August 5th, 2002, 11:59 PM
  4. Dynamic Pages and Search Engines
    By Haiko de Poel, Jr. in forum Programming / Datafeeds / Tools
    Replies: 0
    Last Post: August 5th, 2002, 12:53 PM
  5. Search engines and encrypted pages
    By eaglefire in forum Search Engine Optimization
    Replies: 18
    Last Post: June 16th, 2002, 03:22 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •