In the first two parts of this series, I introduced Context and Content Acquisition as two of the “four pillars” involved in successfully creating a search-driven killer app. In this blog, I’ll introduce and discuss the Information Enrichment pillar. A common misconception is that simply crawling and indexing the content in your content sources will be enough to provide a good search experience.
Acquiring the content and its associated metadata is only the first step. Quite often the metadata in the content source is simply not enough to find the content, drive relevancy or provide the proper context for the search results. This is where information enrichment comes into play. I’ll explain two methods of information enrichment – auto-classification and metadata expansion. The Information Enrichment pillar is a lengthy topic to cover in a single blog, so I’m going to break the exploration of this pillar into two parts. I’ll begin with autoclassification in this part and then cover metadata expansion in the next blog.
Classification is the process of tagging or categorizing information based on shared qualities or characteristics. Indexing the body of a document will certainly give you hits on keywords, but without tagging from categorization, it’s just that, a list of keyword hits. Having the right tags associated with your content is paramount to your users having a great search experience because it gives the content general context and provides relevancy. For example, imagine you have a lot of documents with the term “shuttle” in the title or body. Without classification to add the proper tags, there is no way to tell if a document is about the space program as in the space shuttle, or if the document is about the shuttle bus to the airport. The tags provide that extra level of context that your users can use to find results that are more relevant to them than just a simple keyword-based hit in the index. In addition, the tags can be used to drive refinement of the search results. Classification generally relies on a taxonomy to supply the tags used to categorize content. SharePoint 2013’s Managed Metadata Service and the taxonomies available via the Term Sets are powerful classification tools that can be utilized to tag content before it’s indexed.
One of the big challenges with enterprise search is that the content is often missing meaningful tags in the content metadata. The content management application will add some metadata like author and date by default, but this is simply not enough. This is because the content sources rely on the people creating the content to tag the content and quite often, people are not interested or motivated in tagging their content. They simply want to add their content and go about their business. One of the reasons that Google is so successful in finding the content you are looking for is because the content owner (site owners) are motivated to tag their pages with the keywords that will allow Google to surface their pages in the search results. Lack of keywords or tags in the pages means that Google will find their content less relevant than other pages that have the tags and push those pages further down in the search results. Therefore, most content owners on the web are very motivated to curate and tag their pages. Most enterprise content owners don’t have the same motivation because they know where to find their content and are not thinking about making it easy for others to find this information.
Organizations have traditionally tried two methods of getting their users to tag content. They either make it mandatory for the users to provide the tags before the content is added, or they make it optional and hope that the user will add the tags. Neither method works. Users find it a chore when it’s mandatory to add the tags and often enter or pick the wrong tags just to get done with the process of adding the content. If tagging is optional, they simply don’t bother adding anything. Autoclassification is an attempt to resolve this problem by removing the burden of tagging the content from the end users and instead uses a systematic approach to tagging content.
Autoclassification employs a taxonomy to tag content just like the manual user driven process of tagging content. Unlike the manual process, autoclassification uses rules to determine how the content should be tagged. Using the earlier example of “shuttle”, tagging the content will be based on a rule that determines if the shuttle is referring to NASA or the airport. The rule can be as simple as this: If the content contains “space” AND “shuttle”, then the tag is with the term “NASA”. If the content contains “airport and “shuttle”, then tag it with the term “Airport”. I’m oversimplifying for the purpose of illustrating how the rules can provide a systematic method of tagging content without user intervention.
BA Insight has an autoclassification product called the AutoClassifier. Just like its name implies, it works in conjunction with the terms within the term sets in the Managed Metadata Service and a set of rules associated with each term. As the content is crawled and processed through content processing, the AutoClassifer will inspect the content for matches against the rules defined for each term in the term sets. If there is a match on a rule, then that term is added as a tag. A pretty straightforward and effective way to tag content without asking the content creators to do it.
Below are some screenshots of how rules based auto-classification can be set up, and an example of using the results from the auto-classified tags to drive refinement.
Figure 1 – BA Insight AutoClassifier
Figure 2 – Refinement based on Autoclassified Tags
As you can see, classified content can provide a meaningful and relevant search experience for your users. I’ll continue the exploration of the Information Enrichment pillar in the next blog, with the topic of metadata expansion.