31
July
2012

Duplicate Content & SEO

Understanding SEO issues related to Duplicate Content

Duplicate content is very common on the web. It comes in the forms of content syndication, mirror sites, content scraping, quoting, reusing content, data fed sites etc.

Duplicate content is a challenge for the search engines for at least the following 3 reasons:

  • Duplicate content drains search engine resources in the form of disk space, CPU power and manpower, which could be used for the processing of other content
  • When users submit a query to a search engine, most users do not want links to (and descriptions of) web pages that have duplicate information
  • Duplicate content can be used by web spammers and content thieves
  • There are essentially 2 types of duplicate content - duplicate documents and query-specific duplicate documents.

Duplicate Documents

Duplicate documents are web pages that are the same or almost the same. The duplication here operates on the whole page, unlikeā€¦

Query-Specific Duplicate Documents

Sometimes documents can be quite different on the whole, but share common paragraphs. Examples:

  • one document quoting paragraphs of another document
  • two pages data fed the same product information but having a different navigation around the duplicate text
  • link directories can have common website listings (anchor text and link descriptions)
  • In essence, when two documents are different on the whole, but are detected to share common parts, these are called query-specific documents. Why query-specific? Because this duplicate detection is based on the user supplied query (more on this later).

At the moment all major search engines are good at detecting whole duplicate documents, but Google is the only master at filtering documents that only share duplicate parts of them (query-specific docs).

The above reason makes Yahoo and MSN open to content spamming currently (for example SERPs dominated by the same data fed sites). I guess, Yahoo and MSN will follow Google's aggressive duplicate content filtering and the future ain't bright for all forms of duplicated content.

Google has two patents on duplicate content - "Detecting Duplicate and Near Duplicate Files" and "Detecting Query-Specific Duplicate Documents".

I will focus my article on Google's patents because Google is most open to its algorithms and currently has the best duplicate content filtering technology.

Before I continue, let's repeat again the major difference between duplicate documents and query-specific duplicate documents. Duplicate documents are essentially the same or almost the same page. Query-specific duplicate documents can be quite different but share duplicate parts in them.

Detecting Duplicate Content

Google detects the different types of duplicate content at the different stages of operation of the search engine.

Whole duplicate pages are detected after a page is crawled. Google generates fingerprints that are compared to the fingerprints of the other pages in the repository. At this "after crawling / before indexing" phase, whole duplicate content pages are discovered and labeled.

Detecting duplicate parts of pages is a tricky business because the number of combinations of page snippets to compare is astronomical. Google detects query-specific duplicate pages at query time.

Here's an overview:

A visitor places a search query. Google generates the top 1000 most relevant pages according to its current algorithm.

For every page in the final set of 1000 pages, Google fetches the raw html from the data repository (not the index), cleans it from tags and possibly stop words. After that Google scans every page in this final candidate set for the query terms. Google pulls the parts of the pages that have the most keywords in them.

Finally, Google compares these query-specific snippets with the snippets pulled for the other pages in the final set and if there is a match (exact or close), Google will not show the page with the lower relevancy score.

Within this final set of 1000 pages, Google tries to filter out pages that offer the same content related to the query. If query-specific duplicate pages are detected, Google shows only the page with the highest relevancy score. Basically, all query-specific duplicates fight for one place and only one page gets it. All other pages are omitted from the results (note: omitted does not mean penalized!).

Let me give an example. You have an affiliate data fed widget store. You pull widget product information from an xml feed. This same feed is used by thousands of other affiliates.

Your site shows the widget descriptions taken from the feed, which are essentially the same as all other affiliate sites. The product description is not unique, but your text around it (your navigation and other texts) is unique.

Google crawls your widget site. After crawling the pages, Google does not detect that your pages are duplicate to other pages (based on your unique navigation).

At query time a potential customer issues a query related to widgets. This query uses keywords that are found in your duplicate product descriptions. Your page gets in the final 1000 top ranked pages. However, let's say 100 other affiliates get in the top 1000.

Now, Google scans all pages for query related snippets. Because the query uses keywords that are used heavily in the data fed descriptions, the generated snippets for your store and the 100 other affiliates is the same. Google will show only the top ranked page and all the rest will be omitted from the results.

If you want to rank such a site, your pages need to be the top ranked out of all other affiliates.

This final query-specific document detection is what currently separates Google from Yahoo and MSN in terms of duplicate content removal from the SERPs.

Google also uses the final query-specific dup detection phase to generate the actual snippets of text shown within the SERPs.

As you see from the above example, it is very difficult to rank high when the target keywords are used within duplicate parts of your pages.

I want to make a point very clear. Google does not penalize duplicate content. Google simply shows only one page and filters the other duplicates.

Let's go back to the 2 types of duplicate content and their implications.

Whole duplicate documents are detected between the crawling and the indexing phase. The purpose of this detection is to:

  • Detect mirror sites. When a site is down, Google may redirect the searcher to the mirror site.
  • Reduce storage space by not indexing and placing the duplicates in the repository. Only the oldest / most established / highest PR page gets indexed.
  • Decrease the frequency with which Google crawls duplicate pages.

Query-specific duplicates are detected at query time. The purpose of their detection is:

  • Filter out duplicate pages from the SERPs. Show only one page from the duplicates (the page that is scored highest) and omit the rest.
  • Mark pages / sites that raise the query-specific duplicate detection. If a site raises too many detections, add it to the list of sites that need inspection by the spam-team.

My Duplicate Content Recommendations

  • Don't build duplicate content sites (sites containing primarily duplicate content - articles, feeds etc). They have no chance ranking at Google (and probably at Yahoo and MSN in the future).
  • Pages that are built around small snippets of duplicate content but in a unique order (such as link directories) have decent chances to rank well, but will still sometimes get filtered for certain queries.
  • The bigger the duplicate chunks, the smaller your chances of avoiding duplicate content detection for more queries.
  • It is perfectly safe to use small duplicated snippets of text (such as quoting other articles). There is NO penalty. There is only a chance to not get shown in the SERPs for queries that use keywords primarily placed within the quoted duplicate snippets.
  • If your site cannot avoid some duplicate content that is overused by other affiliate sites (for example, you have a pharmacy store and you can't really rewrite the drug descriptions), then place the duplicate content on pages that are forbidden for crawlers (with robots.txt, or form buttons disguised as links).
  • Avoiding duplicate detection at the whole page level is very easy (change navigation and some text). Avoiding query-specific duplicate detection when you don't have unique content is difficult.
  • Think common sense. Duplicate content has its place on the web. Penalties may be issued only on sites based primarily on scraped or free duplicated content.

Here's a real example.

Search Google for "diprolene af" without the quotes (that is a cream sold at many dup content online pharmacies). Instead of showing 1000 results, (at the moment) Google shows 89.

There is a text at the bottom of the last page that says: "In order to show you the most relevant results, we have omitted some entries very similar to the 89 already displayed. If you like, you can repeat the search with the omitted results included." Click the link to search with the omitted results included, and you will see 966 (at the moment) results (they are not 1000 because other pages are filtered - probably pages from the same domain as other listed pages).

In this extreme example, Google lists about 9% of all the final 1000 page candidates.

I think in the future, we will see duplicate link detection (detecting and devaluing duplicate anchor text and link descriptions for incoming links) but I am going to write about this in the future.

Remember the golden rule of duplicate content: make it unique or hide it from search engines.

Development Articles