URLs Friendly to Search Engines - PHP

Search engines such as Google and Alta Vista attempt to explore the entire Web. They have become an essential resource for Internet users, and anyone who maintains a public site benefits from being listed. Search engines use robots, or spiders, to explore pages in a Web site, and they index PHP scripts the same way they index HTML files. When links appear in a page, they are followed. Consequently, the entire site becomes searchable.

Unfortunately, robots will not follow links that appear to contain form variables. Links containing question marks may lead a robot into an endless loop, so they are programmed to avoid them. This presents a problem for sites that use form variables in links. Passing form variables in anchor tags is a natural way for PHP to communicate, but it can keep your pages out of the search engines. To overcome this problem, data must be passed in a format that resembles URLs.

First, consider how a Web server accepts a URI and matches it to a file. The URI is a virtual path, the part of the URL that comes after the host name. It begins with a slash and may be followed by a directory, another slash, and so forth. One by one, the Web server matches directories in the URI to directories in the filesystem. A script is executed when it matches part of the URI, even when more path information follows. Ordinarily this extra path information is thrown away, but you can capture it.

This script works with Apache compiled for UNIX but may not work with other Web servers. It relies on the PATH_INFO environment variable, which may not be present in a different context. Each Web server creates a unique set of environment variables, lthough there is overlap.

You may be accessing the code, you are connecting to a local server that contains a directory named corephp/figures in its document root. A default installation of Apache might place this in /usr/local/apcache/htdocs. The name of the script is 21-5.php, and everything after the script name is then placed in the PATH_INFO variable. No file named 1234.html exists, but to the Web browser it appears to be an ordinary HTML document. It appears that way to a spider as well.

Using Path Info

Using Path Info

The code doesn't really do much. It uses a regular expression to extract the numbers between the last slash and the .html extension. The script pretends this is an identifier. It could be referencing a record in a relational database. I've added some code to use a random number to create a link to another imaginary record. This method could be applied, and each message would appear to be a single HTML file.

I've introduced only the essential principles of this method. There are a few pitfalls, and a few enhancements to be pursued. Keep in mind that Web browsers do their best to fill in relative URLs, and using path information this way may foil their attempts to request images that appear in your scripts. Therefore, you must use absolute paths. You might also wish to name your PHP script so that it doesn't contain an extension. This is possible with Apache by setting the default document type, using the DefaultType configuration directive. You can also use Apache's mod_rewrite.

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

PHP Topics