Indexing Microsoft Office Documents using OpenXML in Umbraco
Umbraco, Examine, Index, Media
A project I worked on recently had a requirement that it needed to be able to search documents uploaded to the Umbraco media library including text, PDF, and Word documents. Previously in Umbraco 7, I had used the Cogworks CogUmbracoExamineMediaIndexer package to index both PDF and Word documents, but this package does not work on Umbraco 10.
By default, Umbraco does not index the contents of media items, it only indexes the name. If you want to index the contents of media items, you need to create an index and a value set builder to read the contents of the file when it is uploaded to the media library and store it in the index.
I came across an Umbraco package for indexing PDF documents called UmbracoExamine.PDF which uses the PdfPig library for extracting text from PDF documents.
This only solved half my problem, I still needed a way of indexing Word documents and as the UmbracoExamine.PDF package was open-source, I decided that my best option was to clone the source code and adapt it to read Word documents using the OpenXML library.
For those that do not know, Wikipedia describes OpenXML as:
Office Open XML (also informally known as OOXML) is a zipped, XML-based file format developed by Microsoft for representing spreadsheets, charts, presentations, and word processing documents.
The source code for UmbracoExamine.PDF package was easy enough to follow, it has got quite a lot of boiler plate code that you need to create a custom index and populate it with values, most of which would be the same implementation for reading Word documents, the only difference being instead of using PdfPig to read a PDF file I needed to use the OpenXML library.
An OpenXML document is made up of parts which represents the document and contains the content, whether it is a Word, Excel, or PowerPoint document. I am not an OpenXML expert, so it took a bit of trial and error to get it reading the text out of the document.
All that is needed to read the contents of a Word document is the following which gets the text from the main document part.
public class WordProcessingDocumentTextExtractor : BaseOpenXmlTextExtractor, IWordProcessingDocumentTextExtractor { public string GetText(Stream fileStream) { StringBuilder builder = new StringBuilder(); var wordprocessingDocument = WordprocessingDocument.Open(fileStream, false); if (wordprocessingDocument.MainDocumentPart != null) { builder.AppendLine(GetTextFromPart(wordprocessingDocument.MainDocumentPart)); } return builder.ToString(); } }
The GetTextFromPart() method in the base class gets the text from the OpenXmlPart.
protected string GetTextFromPart(OpenXmlPart part) { StringBuilder builder = new StringBuilder(); using (var reader = OpenXmlReader.Create(part, false)) { while (reader.Read()) { var line = reader.GetText().Trim(); if (String.IsNullOrEmpty(line) == false) { builder.AppendLine(line).Append(" "); } } } return builder.ToString(); }
The text is then appended into a single string separated by a space which will get turned into tokens by the Lucene standard analyser and stored in the index.
Once I had it reading Word documents, it did not take that much effort to read Excel and PowerPoint documents, although for Excel documents it does get a bit more complicated as you must read the cells from each worksheet within a workbook. It then creates a space separated string containing the text content the same as the Word document.
Instead of including all the code here, I have added a copy to GitHub for you to download.
https://github.com/justin-nevitech/UmbracoExamine.OpenXml
To get this working in Umbraco you need to call AddExamineOpenXml() after calling AddUmbraco() on the Umbraco builder in the ConfigureServices() method within startup.cs.
services.AddUmbraco(_env, _config) .AddBackOffice() .AddWebsite() .AddComposers() .AddExaminePdf() .AddExamineText() .AddExamineOpenXml() .Build();
You can then use the index directly or you can create a multi searcher which combines multiple indexes which you can query with a single search to get a combined set of results.
services.AddExamineLuceneMultiSearcher(Constants.Searchers.MediaSearcher, new string[] { PdfIndexConstants.PdfIndexName, TextIndexConstants.TextIndexName, OpenXmlIndexConstants.OpenXmlIndexName });
The search method itself then creates a query that searches the fileTextContent field (which is where the file contents are stores in the index) to return any matching results.
public IEnumerable<IPublishedContent> Search(string search, int pageNumber, int pageSize, out int pageCount, out int resultCount) { if (_examineManager.TryGetSearcher(Constants.Searchers.MediaSearcher, out ISearcher searcher)) { var query = searcher.CreateQuery(); IBooleanOperation? filter = null; // Match any words that either contain an apostrophe or a full stop or just words var matches = Regex.Matches(search?.RemoveStopWords() ?? String.Empty, @"(\w+[.']\w+|\w+)"); foreach (Match match in matches) { if (filter == null) { filter = query.GroupedOr(new string[] { "nodeName", "fileTextContent" }, QueryParser.Escape(match.Value).MultipleCharacterWildcard()); } else { filter = filter.And().GroupedOr(new string[] { "nodeName", "fileTextContent" }, QueryParser.Escape(match.Value).MultipleCharacterWildcard()); } } var queryOptions = new QueryOptions((pageNumber - 1) * pageSize, pageSize); var results = filter?.Execute(queryOptions); resultCount = (int)(results?.TotalItemCount ?? 0); pageCount = (int)Math.Ceiling((double)resultCount / (double)pageSize); return _umbracoHelper.Media(results?.Select(p => p.Id) ?? new List<string>()); } else { throw new Exception("Unable to find searcher"); } }
I have also written a version that indexes text files which you can download.