Monday, April 16, 2012

SharePoint Search: How to exclude part of a page / partial page from being indexed by SharePoint Search?

What?

How to exclude part of a page / partial page from being indexed by SharePoint Search?
Out of the box, SharePoint provides a mechanism for excluding publishing pages from the search index by any number of criteria, but what if you want to exclude only parts of a page? This becomes useful when you have content on numerous pages that contains common keywords. 

Why?

Scenario 1:

For example, say you have a search results webpart in your page layout to display your latest press release: "Product A just announced". That means all the pages that are created using that page layout will have "Product A" in them.  When someone performs a search for "Product A" they will get back every page in your site that is created using that page layout, instead of just the press release and related product pages. To prevent this, we need to prevent SharePoint from indexing that content when it’s performing a crawl.
OR
Scenario 2:

You want to index a regular non-SharePoint web site. Usually, it’s your company’s public-facing web site. That site has common navigation on every page with terms such as Contact, Locations, Privacy Policy, About, etc. that you don’t want to be indexed. If it is indexed, every time a user types in "contact", they end up having every page on the site returned in the search results.
How?

Scenario 1:

Ref: My colleague Scott Tindall has built this control and blogged about it here

Web Control:

System.Web.UI.WebControls.Panel control would be a good model to build the search crawl exclusion control on. It allows you to easily drop it in the page layout using SharePoint Designer, and you can put other html and controls within it.  It is not good to  inherit from the Panel control though, because it adds unwanted ‘div’ tags to the rendered output. The key to the Panel control’s behavior are the following two attributes on the class: [ParseChildren(false), PersistChildren(true)]. These attributes allow the content within the control to persist as controls and not properties of this control.

User Agent:

The second part of the equation is knowing when to show or hide the contents of the web control. SharePoint gives us a way to identify that it’s performing a crawl through the UserAgent property of the http request by adding "ms search" to it.

Code:

[ParseChildren(false), PersistChildren(true)]
public class SearchCrawlExclusionControl : WebControl
{
        private string userAgentToExclude;

   public string UserAgentToExclude
   {
      get
      {
         return (string.IsNullOrEmpty(userAgentToExclude)) ? "ms search" : userAgentToExclude;
      }
      set
      {
         userAgentToExclude = value;
      }
   }

   protected override void CreateChildControls()
   {
      string userAgent = this.Context.Request.UserAgent;
      this.Visible = (!string.IsNullOrEmpty(userAgent)) ? !userAgent.ToLower().Contains(UserAgentToExclude) : true;
      base.CreateChildControls();
   }
}

Using It:

<SearchUtil:SearchCrawlExclusionControl ID="SearchCrawlExclusionControl1" runat="server">
    <div>Some Content To Exclude</div>
</SearchUtil:SearchCrawlExclusionControl>

Scenario 2:

Ref: Corey Roth has blogged about this trick here

Let's assume that we have the links "Contact Us" and "Privacy Policy" in the footer of the site.

Our goal here is to exclude the contact us and privacy policy links in the navigation from our search results. How do we do that? It’s pretty simple actually. Just put the content that you do not want indexed in a div tag with a class of noindex. Let’s look at the complete HTML of the home page..

<html>
<head>
    <title>Super Neat Home Page</title>
</head>
<body>
    <div>
        Welcome to our awesome site. We are the best! <a href="test.html">Awesome Stuff</a>
        If you need to get a hold of us, click <a href="contactus.html">here</a>. Worried,
        we'll <a href="privacy.html">sell you out?</a>
    </div>
    <div class="noindex">
        <a href="contactus.html">Contact Us</a> <a href="privacy.html">Privacy Policy</a>
    </div>
</body>
</html>

You can see that the Contact Us and Privacy Policy links are inside <div class=”noindex”>. You might have noticed that the body of the page also has links to these two pages.  We included these so that those pages would get indexed. Since the common navigation is excluded there was no way for the crawler to follow those links. This is something you will want toconsider when you are designing master pages because you will need to have at least one link to each page on the site somewhere.

The noindex class works great with FAST Search for SharePoint as well as with SharePoint 2010 Search.

It is a highly recommended approach to make use of the noindex attribute any time you want to index a non-SharePoint site, such as your public-facing company web site. By excluding redundant sections of the page, you make your search results much more usable.

1 comment:

  1. This can be use in Moss 2007 ?

    ReplyDelete