Skip navigation.
 
Home
Thursday 04th 2010f February 2010 09:38:52 PM

Why Complete Indexing of the Site takes time and how to expedite it

Whole of Site optimization revolves around getting maximum links for the and getting complete or maximum pages of the site indexed in search engine. Each search engine has its own proprietary algorithm which it keeps very close to its chest, revealing only as much as it wants. Only clue about their spidering and indexing process can be had by deciphering the logs of the site. Any theory or algorithm about indexing of the site is only a matter of speculation based on the experience and knowledge of spidering process. Google being most popular search engine, most of the webmasters are keen to understand its indexing process and experience shows that if you manage to program your site well to ensure deep crawl by Google bot, it also gets maximum pages listed on MSN and Yahoo, thereby leading to believe that other search engines too follow similar indexing process.

Google has multiple Internet Data Centre, numbering about 12. Any body searching through Google is redirected to nearest Google data centre. Though these are synchronized with each other, sometimes, their data base is synchronized and hence a page may show different number of indexed pages in different data centers. Similarly it has more than 1000 crawlers and these too are independent of each other. Thereby multiple crawlers from Google may visit your site, who could be from different data centre.
Each crawler could have a different objective; some could be only to grab the url while others could be to detect urls of your site in other sites and urls of other sites in your site data. These crawlers have date stamp on crawl so that on their subsequent visit , they only index updated content. You can stop them from visiting private areas as back office or administration area of your web site by configuring robots.txt file.
These crawlers precede visit of special crawlers who would process the information in various classifications as detection of New URLs detected , error pages as 301 or redirected urls, old urls, and other url pages. The last step in crawl is done by Deep Crawlers who would record the URLs and deep crawl each URL to index all the text, HTML, images, flash etc.
One very important point emerges form this that if some how your earlier page has not been updated with right keywords (happened with some clients of mine, who had not programmed their site to include keywords in meta tags), it is advisable to change and update the content as these crawlers would index the updated content. Next, begins the indexing of new urls with preference to urls that figure on other sites too. Thus popular sites which are referred to by many other sites are indexed on hourly basis or some sites are indexed even on minute-by-minute basis.
Old urls which are not updated are ignored. If such urls stop getting traffic and over the period such links could be dropped too especially in low traffic sites which are not regularly updated. Urls which are dynamic URLs, have session IDs, PDF documents, Word documents, PowerPoint presentations, Multimedia files etc. are visited by another set of crawlers which assess whether they are worth indexing and to what depth. You can find out the paged in the queue to be indexed by typing "site:www.domain.com". Urls that appear in the results with no description are the ones which Deep Crawlers shall index soon. Google arranges the site in order of priority depending on how frequently the content is updated, how much traffic the site gets, uniqueness of the content etc and sites with poor record are sometimes indexed after as many as 4 to 8 weeks.
Deep Crawlers, after indexing, record the content in Internet Data centre from where they come from and the data is synchronized with other data centers. When different data centers report different indexing statistics, this is called Google Dance. Earlier the data centers used to synchronize once in 10 days or so, now it seems they are synchronized on hourly basis.
If you are a growing webmaster, your pages could be taking up to 10 weeks to get indexed. Best way to shorten this period is to have maximum natural links (natural and not reciprocal), get your article submitted to few sites that accept links to your articles, keep posting regularly new and unique content. This would enhance your influence on search engine spiders. Having site map on your site would also help deep crawling the site. you can use Google sitemaps too.