To avoid undesirable content in the search indexes, webmasters can instruct spiders not to crawl certain files or directories through the standard robots.txt file in the root directory of the domain. Additionally, a page can be explicitly excluded from a search engine's database by using a meta tag specific to robots (usually ). When a search engine visits a site, the robots.txt located in the root directory is the first file crawled. The robots.txt file is then parsed and will instruct the robot as to which pages are not to be crawled. As a search engine crawler may keep a cached copy of this file, it may on occasion crawl pages a webmaster does not wish crawled. Pages typically prevented from being crawled include login specific pages such as shopping carts and user-specific content such as search results from internal searches. In March 2007, Google warned webmasters that they should prevent indexing of internal search results because those pages are considered search spam.
Google knows who links to you, the “quality” of those links, and whom you link to. These – and other factors – help ultimately determine where a page on your site ranks. To make it more confusing – the page that ranks on your site might not be the page you want to rank, or even the page that determines your rankings for this term. Once Google has worked out your domain authority – sometimes it seems that the most relevant page on your site Google HAS NO ISSUE with will rank.
Think about how Google can algorithmically and manually determine the commercial intent of your website – think about the signals that differentiate a real small business website from a website created JUST to send visitors to another website with affiliate links, on every page, for instance; or adverts on your site, above the fold, etc, can be a clear indicator of a webmaster’s particular commercial intent – hence why Google has a Top Heavy Algorithm.
QUOTE: “Returning a code other than 404 or 410 for a non-existent page (or redirecting users to another page, such as the homepage, instead of returning a 404) can be problematic. Firstly, it tells search engines that there’s a real page at that URL. As a result, that URL may be crawled and its content indexed. Because of the time Googlebot spends on non-existent pages, your unique URLs may not be discovered as quickly or visited as frequently and your site’s crawl coverage may be impacted (also, you probably don’t want your site to rank well for the search query” GOOGLE
While most of the links to your site will be added gradually, as people discover your content through search or other ways and link to it, Google understands that you'd like to let others know about the hard work you've put into your content. Effectively promoting your new content will lead to faster discovery by those who are interested in the same subject. As with most points covered in this document, taking these recommendations to an extreme could actually harm the reputation of your site.
To prevent users from linking to one version of a URL and others linking to a different version (this could split the reputation of that content between the URLs), focus on using and referring to one URL in the structure and internal linking of your pages. If you do find that people are accessing the same content through multiple URLs, setting up a 301 redirect32 from non-preferred URLs to the dominant URL is a good solution for this. You may also use canonical URL or use the rel="canonical"33 link element if you cannot redirect.