To avoid undesirable content in the search indexes, webmasters can instruct spiders not to crawl certain files or directories through the standard robots.txt file in the root directory of the domain. Additionally, a page can be explicitly excluded from a search engine's database by using a meta tag specific to robots (usually ). When a search engine visits a site, the robots.txt located in the root directory is the first file crawled. The robots.txt file is then parsed and will instruct the robot as to which pages are not to be crawled. As a search engine crawler may keep a cached copy of this file, it may on occasion crawl pages a webmaster does not wish crawled. Pages typically prevented from being crawled include login specific pages such as shopping carts and user-specific content such as search results from internal searches. In March 2007, Google warned webmasters that they should prevent indexing of internal search results because those pages are considered search spam.
I do not obsess about site architecture as much as I used to…. but I always ensure my pages I want to be indexed are all available from a crawl from the home page – and I still emphasise important pages by linking to them where relevant. I always aim to get THE most important exact match anchor text pointing to the page from internal links – but I avoid abusing internals and avoid overtly manipulative internal links that are not grammatically correct, for instance..
You can confer some of your site's reputation to another site when your site links to it. Sometimes users can take advantage of this by adding links to their own site in your comment sections or message boards. Or sometimes you might mention a site in a negative way and don't want to confer any of your reputation upon it. For example, imagine that you're writing a blog post on the topic of comment spamming and you want to call out a site that recently comment spammed your blog. You want to warn others of the site, so you include the link to it in your content; however, you certainly don't want to give the site some of your reputation from your link. This would be a good time to use nofollow.
QUOTE: “So it’s not something where we’d say, if your website was previously affected, then it will always be affected. Or if it wasn’t previously affected, it will never be affected.… sometimes we do change the criteria…. category pages…. (I) wouldn’t see that as something where Panda would say, this looks bad.… Ask them the questions from the Panda blog post….. usability, you need to work on.“ John Mueller, Google.
You may not want certain pages of your site crawled because they might not be useful to users if found in a search engine's search results. If you do want to prevent search engines from crawling your pages, Google Search Console has a friendly robots.txt generator to help you create this file. Note that if your site uses subdomains and you wish to have certain pages not crawled on a particular subdomain, you'll have to create a separate robots.txt file for that subdomain. For more information on robots.txt, we suggest this Webmaster Help Center guide on using robots.txt files13.
The reality in 2019 is that if Google classifies your duplicate content as THIN content, or MANIPULATIVE BOILER-PLATE or NEAR DUPLICATE ‘SPUN’ content, then you probably DO have a severe problem that violates Google’s website performance recommendations and this ‘violation’ will need ‘cleaned’ up – if – of course – you intend to rank high in Google.
Website owners recognized the value of a high ranking and visibility in search engine results, creating an opportunity for both white hat and black hat SEO practitioners. According to industry analyst Danny Sullivan, the phrase "search engine optimization" probably came into use in 1997. Sullivan credits Bruce Clay as one of the first people to popularize the term. On May 2, 2007, Jason Gambert attempted to trademark the term SEO by convincing the Trademark Office in Arizona that SEO is a "process" involving manipulation of keywords and not a "marketing service."
QUOTE: “If you want to stop spam, the most straight forward way to do it is to deny people money because they care about the money and that should be their end goal. But if you really want to stop spam, it is a little bit mean, but what you want to do, is sort of break their spirits. There are lots of Google algorithms specifically designed to frustrate spammers. Some of the things we do is give people a hint their site will drop and then a week or two later, their site actually does drop. So they get a little bit more frustrated. So hopefully, and we’ve seen this happen, people step away from the dark side and say, you know what, that was so much pain and anguish and frustration, let’s just stay on the high road from now on.” Matt Cutts, Google 2013
Google engineers are building an AI – but it’s all based on simple human desires to make something happen or indeed to prevent something. You can work with Google engineers or against them. Engineers need to make money for Google but unfortunately for them, they need to make the best search engine in the world for us humans as part of the deal. Build a site that takes advantage of this. What is a Google engineer trying to do with an algorithm? I always remember it was an idea first before it was an algorithm. What was that idea? Think “like” a Google search engineer when making a website and give Google what it wants. What is Google trying to give its users? Align with that. What does Google not want to give its users? Don’t look anything like that. THINK LIKE A GOOGLE ENGINEER & BUILD A SITE THEY WANT TO GIVE TOP RANKINGS.
QUOTE: “I think that’s always an option. Yeah. That’s something that–I’ve seen sites do that across the board,not specifically for blogs, but for content in general, where they would regularly go through all of their content and see, well, this content doesn’t get any clicks, or everyone who goes there kind of runs off screaming.” John Mueller, Google
That content CAN be on links to your own content on other pages, but if you are really helping a user understand a topic – you should be LINKING OUT to other helpful resources e.g. other websites.A website that does not link out to ANY other website could be interpreted accurately to be at least, self-serving. I can’t think of a website that is the true end-point of the web.