Searchable Text Within Assets in Elasticsearch
For the file and MIME types listed below, Elasticsearch can find text within assets using the search functionality in New UI and the Search Screen in the Web UI.
Full-text search within assets functionality is available in ramp-up status for early adopters. Contact your Stibo Systems representative for details.
Prerequisites
To search text within assets, the assets must:
-
Be included in the Elasticsearch index
-
Contain native, readable, or pre-processed text; OCR scanning is not performed on documents or image-based files.
Considerations
Apache Tika is used to perform a best-effort extraction of text for search. Extraction quality depends on the file (encoding, structure, embedded objects, corruption). Even for the listed types below, some files may yield little or no searchable text (e.g., scans without OCR, complex layouts, passwords, or formats Tika cannot parse fully). Perfect text extraction is not guaranteed. For more information, search 'Apache Tika' on the web.
When the Elasticsearch indexing or event processor background process encounters an unsupported asset file, the Execution Report logs an error with a link to the asset ID.
The following types of assets are not indexed and, therefore, their contents are not searchable:
-
File size of 50 MB
-
File with more than 100,000 characters
-
Archive formats (.ZIP, .RAR, .7z, .TAR)
Dimension-dependent assets
Indexing dimension-dependent assets is not recommended; however, it is supported with the following limits. If dimension-dependent assets are indexed, the combined size across all dimension points must be under 50 MB. Consider the scenarios and outcomes defined in the table below.
| Scenario | Outcome | BGP Execution Report Text |
|---|---|---|
|
Asset file(s) with a total size under 50 MB |
All content is indexed |
Nothing is logged when asset(s) meet the size conditions; log entries can be reported if Elasticsearch fails to reindex due to Apache Tika issues. |
|
Asset file(s) with a total size exceeding 50 MB |
All content is skipped |
|
Supported Types of Searchable Assets
Elasticsearch can identify text within assets for the following file types and Mime types. For more information, refer to the MIME Types topic.
| Definition | File Type Extension |
|---|---|
|
Text and markup |
TXT – plain text HTML / HTM / XHTML – web pages XML – structured text CSV / TSV – tabular text data |
|
Microsoft Office |
DOC / DOCX / DOCM – Word documents XLS / XLSX – Excel spreadsheets PPT / PPTX – PowerPoint presentations MDB – Access database |
|
OpenDocument (LibreOffice / OpenOffice) |
ODT – text document ODS – spreadsheet ODP – presentation |
| Email formats |
EML – email message MSG – Outlook email |
| Portable and rich text |
PDF – portable document RTF – rich text document EPUB – ebook |
| Apple formats |
PAGES – Apple Pages document KEY – Apple Keynote presentation IBOOKS – Apple iBooks ebook |
| Publishing and design |
INDD – Adobe InDesign QXD – QuarkXPress CHM – compiled HTML help DES / IDU – design-related formats |
The following MIME types are searchable:
| Definition | MIME Type |
|---|---|
|
Text formats |
text/plain text/html text/csv text/tab-separated-values application/xml application/xhtml+xml |
| Microsoft formats |
application/msword application/vnd.openxmlformats-officedocument.wordprocessingml.document application/vnd.ms-word.document.macroenabled.12 application/vnd.ms-excel application/vnd.openxmlformats-officedocument.spreadsheetml.sheet application/vnd.ms-powerpoint application/vnd.openxmlformats-officedocument.presentationml.presentation application/x-msaccess |
|
message/rfc822 application/vnd.ms-outlook |
|
| OpenDocument formats |
application/vnd.oasis.opendocument.text application/vnd.oasis.opendocument.spreadsheet application/vnd.oasis.opendocument.presentation |
| Portable and ebooks |
application/pdf application/rtf application/epub+zip application/x-ibooks+zip |
| Apple formats |
application/vnd.apple.pages application/vnd.apple.keynote |
| Publishing and design |
application/x-indesign application/vnd.quark.quarkxpress application/vnd.ms-htmlhelp application/x-design application/x-idu |