In part one of this series, I introduced Context as one of the “four pillars” necessary to successfully create a search-driven killer app. In this blog, I’ll introduce and discuss the Content Acquisition pillar. In its most basic form, content acquisition is a no brainer requirement for implementing search since you obviously need content in the index to search for stuff. However, content acquisition is something that needs to be carefully thought out, planned, and curated in order to provide your users with a meaningful and relevant search experience.
One of the biggest missteps I’ve repeatedly seen is an overabundance of content being crawled and indexed. I refer to this phenomenon as ocean boiling – the belief that in order for a search implementation to be successful, every piece of content must be indexed and made available for search. I’ve heard all sorts of justifications for ocean boiling, but the most common theme I’ve heard goes something like this, “We can’t find anything now and we just got this super-duper search technology. So why not throw everything at it because it’s really good at finding things, right?” No, not exactly. I think one of the contributing factors to this is that most people successfully find just about everything on the web through Google. The perception is that Google can find just about everything and therefore it must crawl and index everything out on the web. This is simply not true. There are varying theories out there about how much web content Google actually indexes, and most of the people debating this believe that it’s less than 10%. Only Google knows, and I’m sure they are not going to publish this information.
I’ll give you two real reasons why you shouldn’t index every piece of content in your organization. The first reason is that there is a lot of “garbage” content out there sitting in file shares, team sites, and various other dark corners of your organization storage infrastructure. Everything from team outing memos to holiday party pictures to those jokes everyone likes to email around, and the list goes on. There is more of this stuff than you actually realize. Adding all of this meaningless content to the index is going to introduce a significant amount of noise in your search results, which is going to impact the relevancy of the search results. Poor relevancy is one of the main complaints people have with most implementations of enterprise search. The correct strategy is to identify the authoritative sources of content in your organization, crawl those content sources, and only consider the non-authoritative ones on an “as needed” basis. Examples of authoritative content sources include the following: Document management systems with documents that are profiled and managed by context or subject, accounting or financial systems for financial information, HR systems for information about your people, etc. These content sources contain precise and pertinent information regarding a particular topic or subject that will provide the most consistent and relevant search results. Authoritative sources will also help drive the context-driven search experience I discussed in my first blog in this series. If you want a real world example of authoritative sources, you don’t have to look any further than Google. Google has identified certain sources as authoritative sources. Some of these sources are Wikipedia for most things related to a proper noun, TripAdvisor for travel-related information, and Yelp for restaurant reviews. Remember the discussion about how much Google indexes? There is a reason why Google doesn’t have to index the entire web. They don’t have to because they’ve determined that the authoritative content sources will give the vast majority of their users the most relevant search results, without having to index the outliers.
The second reason for not crawling and indexing everything in your organization is that there are direct infrastructure and performance costs associated with indexing all of this extra and meaningless content. SharePoint, like most search engines, stores the index on disk. The more content you crawl and index, the more server and disk resources will be required. In addition to the material cost of servers and disks, the amount of content crawled also has a direct impact to crawling and indexing performance. You certainly don’t want to make your users wait for that important piece of content to be available for search because it’s included in a batch of documents that are not of use to the vast majority of your users. So, carefully consider the content sources you want to crawl. Once you’ve identified the content sources, a good strategy is to develop a timeline on when the content sources are going to be available in search. Group the content sources by order of importance, and introduce the groups in phases. This is a much more manageable process than trying to crawl and index all of your sources on the first go.
Once you’ve identified your authoritative sources, the acquisition part of content acquisition needs to be considered. SharePoint 2013 has the ability to crawl only a limited set of content sources such as SharePoint Sites, File Shares, and Web Sites. Most organizations have their authoritative content sources in external line of business systems like DMS, CRM, ERP, or even relational databases, and there is no out of the box connectivity to these systems. You could write a connector to these systems with BCS (Business Connectivity Services), but that requires a certain level of technical expertise that most organizations don’t have readily available. The other challenge is to ensure that the security and permissions assigned to the content in these external sources are respected when your users search upon it. Exposing content that was secured at the source but made available to everyone in your search results is not a desirable outcome.
Luckily, there are third-party solutions to help with content acquisition. BA Insight, for instance, offers the Content Connectivity Engine, which provides connectivity to securely crawl content from a variety of line of business systems like DMS (Document, Livelink, FileNet), ECM, ERP, relational databases and more. One of defining features of the Content Connectivity Engine is the ability to map (or convert) the native security permissions of the content to an equivalent Active Directory permission. This ensures that the permissions from the original content sources are respected by SharePoint Search and that your users will only be presented with search results that they have been given access to in the original content source.
To summarize, content acquisition comes down to two basic principles- identification of the authoritative content sources and the proper connectivity technology to acquire that content with its native security permissions intact.
In the next blog of this series, I’ll cover the information enrichment and the crucial role of metadata and classification in driving findability.