Searchable Text Within Assets in Elasticsearch

For the file and MIME types listed below, Elasticsearch can find text within assets using the search functionality in New UI and the Search Screen in the Web UI.

Full-text search within assets functionality is available in ramp-up status for early adopters. Contact your Stibo Systems representative for details.

Prerequisites

To search text within assets, the assets must:

  • Be included in the Elasticsearch index

  • Contain native, readable, or pre-processed text; OCR scanning is not performed on documents or image-based files.

Considerations

Apache Tika is used to perform a best-effort extraction of text for search. Extraction quality depends on the file (encoding, structure, embedded objects, corruption). Even for the listed types below, some files may yield little or no searchable text (e.g., scans without OCR, complex layouts, passwords, or formats Tika cannot parse fully). Perfect text extraction is not guaranteed. For more information, search 'Apache Tika' on the web.

When the Elasticsearch indexing or event processor background process encounters an unsupported asset file, the Execution Report logs an error with a link to the asset ID.

The following types of assets are not indexed and, therefore, their contents are not searchable:

  • File size of 50 MB

  • File with more than 100,000 characters

  • Archive formats (.ZIP, .RAR, .7z, .TAR)

Dimension-dependent assets

Indexing dimension-dependent assets is not recommended; however, it is supported with the following limits. If dimension-dependent assets are indexed, the combined size across all dimension points must be under 50 MB. Consider the scenarios and outcomes defined in the table below.

Scenario Outcome BGP Execution Report Text

Asset file(s) with a total size under 50 MB

All content is indexed

Nothing is logged when asset(s) meet the size conditions; log entries can be reported if Elasticsearch fails to reindex due to Apache Tika issues.

Asset file(s) with a total size exceeding 50 MB

All content is skipped

  1. For assets that are dimension independent: Sum of content for asset [ID] exceeds the max supported size (50MB)…

  2. For dimension-dependent assets: Sum of size of content(s) across contexts for asset [ID] exceeds the max supported size (50MB). Asset content found in dimension points (Language: [language1]), (Language: [language2])…

Supported Types of Searchable Assets

Elasticsearch can identify text within assets for the following file types and Mime types. For more information, refer to the MIME Types topic.

Definition File Type Extension

Text and markup

TXT – plain text

HTML / HTM / XHTML – web pages

XML – structured text

CSV / TSV – tabular text data

Microsoft Office

DOC / DOCX / DOCM – Word documents

XLS / XLSX – Excel spreadsheets

PPT / PPTX – PowerPoint presentations

MDB – Access database

OpenDocument (LibreOffice / OpenOffice)

ODT – text document

ODS – spreadsheet

ODP – presentation

Email formats

EML – email message

MSG – Outlook email

Portable and rich text

PDF – portable document

RTF – rich text document

EPUB – ebook

Apple formats

PAGES – Apple Pages document

KEY – Apple Keynote presentation

IBOOKS – Apple iBooks ebook

Publishing and design

INDD – Adobe InDesign

QXD – QuarkXPress

CHM – compiled HTML help

DES / IDU – design-related formats

The following MIME types are searchable:

Definition MIME Type

Text formats

text/plain

text/html

text/csv

text/tab-separated-values

application/xml

application/xhtml+xml

Microsoft formats

application/msword

application/vnd.openxmlformats-officedocument.wordprocessingml.document

application/vnd.ms-word.document.macroenabled.12

application/vnd.ms-excel

application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

application/vnd.ms-powerpoint

application/vnd.openxmlformats-officedocument.presentationml.presentation

application/x-msaccess

Email

message/rfc822

application/vnd.ms-outlook

OpenDocument formats

application/vnd.oasis.opendocument.text

application/vnd.oasis.opendocument.spreadsheet

application/vnd.oasis.opendocument.presentation

Portable and ebooks

application/pdf

application/rtf

application/epub+zip

application/x-ibooks+zip

Apple formats

application/vnd.apple.pages

application/vnd.apple.keynote

Publishing and design

application/x-indesign

application/vnd.quark.quarkxpress

application/vnd.ms-htmlhelp

application/x-design

application/x-idu