What Can You Crawl in Office 365?

As organizations move toward the cloud, there has been some confusion around management and optimization of search. I’ve had more than one conversation with customers about the continued need for a search strategy, and then someone asks, “But doesn’t the Microsoft Graph pick up everything? Why do I need to still worry about search?”

Whether online or on-premises, search remains critical to SharePoint, and there are still steps to be taken to ensure your content is properly tagged, classified, and indexed. Capturing content is fundamental to search — if it’s not crawled and indexed, you can’t find it! The process of connecting to content sources, crawling them to get content, and making that content searchable is far more complex than most people realize. It has been one of the most frustrating areas to manage with SharePoint. And yes, there are key differences in how content is crawled and indexed in SharePoint Online (Office 365) and SharePoint on-prem.

Crawler Overview

As a quick orientation, the basic function of a crawler is shown in the figure below. The concept is simple enough: the crawler connects securely to a given content source, maps the content from the source system to the crawled properties of the search engine, and feeds the engine in either a full crawl or an incremental crawl (which finds any changes).

The search results you see in the content search web part or the search results web part are not coming directly from your lists and libraries, but from the search index. The index can be considered as one big bucket with all the searchable content and only stuff that is in the index can be found through search.

When you change managed properties or add new ones, the changes to your search results take effect only after the content has been re-crawled. In SharePoint Online, crawling happens automatically based on the defined crawl schedule. When you add a new property to a list or to a library, or when you change properties that are used in a list or library, search must re-crawl the content before your changes will be reflected in the search index.

Because your changes are made in the search schema, and not to the actual site, the search index will not automatically re-crawl the list or the library. To make sure that your changes are crawled, you can specifically re-index a list or library. When you do this, the list or library content will be re-crawled so that you can start using your new managed properties in queries, query rules, and display templates. All content in that library or list is marked as changed, and the content is picked up during the next scheduled crawl and re-indexed.

What makes content capture different from one search engine to the next is the breadth of connectors, coverage of different security models and data types, the performance (both throughput and latency), the robustness, and the ease of administration. Most connectors are supplied by Microsoft’s partners, not Microsoft.  BA Insight, for example, has connectors to over 60 enterprise systems.

Diagram

For example, SharePoint 2016 supports multiple crawl components, crawl databases, and content sources as shown below. There are several connectors included out of the box:

  • SharePoint sites (from SPS2003 through SP2016)
  • HTTP (websites)
  • File shares
  • Business Data Connectivity (BDC) Framework — also includes these connectors that are built on the BDC framework:
    • Exchange Public Folders
    • Lotus Notes
    • Documentum
    • Taxonomy Connector (connects to MMS)
  • People Profile Connector

In SharePoint Online, content is automatically crawled based on a defined crawl schedule. There are two variants of the crawl: 1) the continuous crawl that runs every 15 minutes and picks up new and changed documents or items and 2) the incremental crawl that follows a Microsoft-defined schedule to pick up any changes in the search configuration.

Control of the Crawl

In your SharePoint on-prem environment, administrators can control the type or frequency of crawls – but within SharePoint Online, there is an automated schedule that cannot be changed. Microsoft manages the frequency of these crawls, which typically run every 4 to 8 hours from the previous incremental crawl.

When people search for content on your SharePoint sites, what is found in your search index determines what they’ll find. The search index contains information from all documents and pages on your site. Only managed properties are kept in the index. This means that users can only search on managed properties. To get the content and metadata from the documents into the search index, the crawled properties must be mapped to managed properties.

While administrators can re-index a site, a document library, or a list within SharePoint Online, where this remotely-controlled crawl becomes an issue is when you’ve mapped a crawled property to a managed property, and want the managed property updated to reflect this change. With an on-premises SharePoint environment, you can initiate a full-crawl to capture the change and re-index your environment. However, with SharePoint Online, there is no option to re-index. Instead, you’ll need to open a support ticket with Microsoft to re-index your tenant.

Know Your Schema

Customizing the search experience can have a direct impact on end user adoption. The search schema controls what you can search for, how you search, and how you present the results on your website or intranet. Search “discovers” information by crawling items on your site. The discovered content and metadata are called “properties” of the item. The search schema has a list of crawled properties that helps the crawler decide what content and metadata to extract. By changing the search schema, you can customize the search experience in SharePoint Online.

Why would you change your search schema? So that you can provide a search experience that best matches your unique organizational requirements and help your end users find the data they’re looking for more easily. For example, you might modify your schema to sort search results based on Managed Metadata columns that are unique to your organization, such as prioritizing results based on Product Type or Template.

More Information on On-Prem vs. Online Search

Moving to Office 365 and SharePoint Online can require a major shift in how you manage not just search, but all aspects of your SharePoint environment. Many aspects of SharePoint administration have been streamlined, and many granular controls have been removed. To better understand the scope and limitations of search within SharePoint Online, be sure to check out the many resources available through the BA Insight website at bainsight.com/ba-insight-resources-for-search-solutions.