Skip navigation.
 
Home
Saturday 06th 2010f February 2010 12:59:07 AM

Falling prey to rogue proxies

To give meaningful results for any search query, any search engine would like to filter results to avoid duplicate content. Filtering results is not easy as a poor filter may weed out genuine results too leaving the visitor with limited choice of results.

The biggest problem in this filtering process comes from plagiarism tools available on the net. These tolls work differently with some changing characters a, e, i, o etc to ế, ề, ỉ, ặ, ẹ etc or just change the text or syntax of the sentence. Text length, common sentences, keyword density are some other identifiers that could help a search engine identify duplicate result. Identifying rephrased sentences becomes difficult for search engines; however Google seems to have perfected this art. Whether filtering by Google is through some algorithm or a manual process (which seems unlikely considering magnitude involved) is not known.
However this filtering causes some sites to be knocked off in the results shown against a query which are shown in supplemental index. Rarely a visitor would visit a supplementary index for the information he is looking for. Now an issue that needs to be addressed is as which page to be included in main index and which one in supplemental index. At least Google seems to have perfected the art of identifying but still some glitches seem to remain.
Of course if you feel that your page has been classified in duplicate content page you have the option to request Google to re-index the page, but all this is not simple for a webmaster whether small or big.
Now coming to the problem. If a con or say your competitor sets up a proxy and through use of spiders sets up a like of your site to always keep it as updated as yours and makes it faster accessible to a search engine; would a search engine classify that proxy as a duplicate or your own site. In the absence of any word from Google we can at best guess, but in my perception about Google, Google is rarely lagging behind to catch these tricks by more than 2 months.
Some of these proxy sites successfully spoof their origin to masquerade as Google spiders.
A successful webmaster that has grown beyond a point to catch his competitor’s evil intention may do well to have some software installed on his server to identify such rogue spiders. It would be worth it.