March 11, 2010
WebWize Services

houston web site design
houston web hosting
cold fusion development
hd video services
photo services
email services
web site design portfolio
search engine optimization
domain registration & dns
 
Email/Anti-Spam/Virus

email configuration
spam firewall features
spam firewall screenshots
 
Company

what WebWize does
why do biz with us?
information request
email WebWize
WebWize Home
Call Glenn Brooks at
713-682-7111

WebWize, Inc.
1006 W 42nd St
Houston, Tx 77018
United States
How Google Indexes the Web

Google set up a crawler-type software, named Googlebot. It is a robot indexing Web pages (and now other types). Its principle is simple (but not its implementation!): when it reads a page, it adds to its list of pages to visit all those linked to the page in the current process.

Theoretically, it should thus be able to know the majority of the pages of the Web, i.e. all those which are not orphan (a page is known as orphan if no other links to it). The volume of data to be treated being important, this robot is a program distributed on hundreds of servers.

In addition to the knowledge of the greatest number of pages, Google also wants to index them regularly, because many the pages are updated from time to time. Moreover the frequency of visit of Googlebot on a Web page depends on its PageRank : the larger it is, the more it will often index it. From one passage to another, Googlebot can detect a page become non-existent ("error 404").

This colossal mass of information will be analyzed by Google in full details. Each word or sentence will be associated to a type, based on HTML tags. Thus a word contained in the title will be considered to be more significant than in the body text. These types may be classified according to their importance (title of the page , headings H1 to H6, bold, italic, etc). This preprocessing, associated with other criteria including the PageRank, makes it possible to provide the most relevant results in first.



original article


Links and Resorces



  © Copyright       1994 - 2010            WebWize, Inc. All Rights Reserved