March 2, 2011 |
Q: What makes relevance such a challenge in the enterprise? IT pros are often shocked when a search proof of concept appears highly successful, but when deployed into production users are anything but pleased, and so the inevitable question keeps coming up…"Why can't it just be like Google?" Users will struggle to explain what they are actually asking for but clearly relevance is a big part of their disappointment. So what went wrong with the POC?
IT will normally conduct tests against a small set of documents in the POC relative to the size of the corpus in production. The problem here is that this can set false expectations around the type of relevance the search engine can serve up...and here is why.
Let's say you have 1000 documents in your POC and you run a query. If 5% of those match the query, you have 50 hits or 5 pages of results and there is a reasonably good chance a relevant hit shows up on page one. But now lets run that same query in production where we have 10,000 documents. The total number of hits would potentially rise to 500 or 50 pages. Would relevant content appear on page one? Maybe, maybe not. Considering most organizations have documents counts in the hundreds of thousands you can imagine why ranking algorithms working with a mere two or three keyword query, fail to do so. And it is only going to go down from here. A study conducted by IDC presents just how much information we're already drowning in and the trend is that it will clearly get worse. (See Figure 1. below).

Many organizations now have as many documents on their network as was publicly available on the Internet in the 1990's. For those of you that remember, search engines were dismal at serving up relevant content at that time.
Enter Google. Google's approach was simple yet elegant. They simply provide the ranking algorithm with additional information above and beyond the two or three keywords that users were providing in the past. The additional information is in the form of links. When quality content is published on the Internet, other users link to it and reference it. The very fact that they took the time to do so is an explicit indication that quality content has been found. Google boosts the relevance of this content. As if overnight people could find relevant content easily.
The bad news is that Google's breakthrough on the Internet doesn't translate to Enterprise Search. The Enterprise has many different file and data types that are not web based. Hence the linking structure that makes up the fabric of the web does not exist in the Enterprise.
Even if it did, there are other factors that make Enterprise Search a more challenging problem to solve. When someone creates a website and publishes content they want the information to be found so they go to great lengths to make it as search engine friendly as possible. Enterprise users are often rushing to hit deadlines, and are less inclined to take the time to worry about findability. Security is another challenge in the Enterprise that is a non issue for Internet Search Engines. According to Gartner a typical Enterprise has six business systems supporting operations. Each of these has their own security model controlling access to the data. To provide this information in a search result, the search engine must honor the security of these systems; not a trivial undertaking.
So given the challenges of providing relevant search results, what are the options for an IT Professional?
- First have realistic expectations on what a search engine can deliver out of the box given the challenges above.
- Avoid going index crazy. If you can identify content of low value, don't index it for the sake of indexing it. The smaller the index the better chance SharePoint has to serve up relevant content.
- Consider implementing 3rd party add-ons that automatically tune relevance.
- If the resources are available, consider manually tuning the ranking algorithm.
- Use Scopes! A scope is a pre-search filter that can be used to search against a particular file type, file location, etc. Users often have a very good sense of what the information they are looking for is or where it resides. Reducing the number of documents being searched will produce greater relevancy.
- Tag content. While this is labor intensive, nothing is better than having human being classify/categorize content.
|
|
|