eXist-db Indexer for Algolia

eXist Indexer for Algolia is a configurable index plug-in for the eXist-db native XML database. It uses eXist's own indexing mechanisms to create, upload and incrementally sync local indexes with Algolia's cloud services.

Example deployment: autocomplete search on http://raskovnik.org

Installation

This README covers the build, manual installation, and general configuration of the plugin.

Build

Requirements: Java 17, sbt.

sbt assembly

The assembly is written to:

target/scala-2.13/exist-algolia-index-assembly-<version>.jar

Manual install

The plugin JAR must be built and then installed into eXist manually.

Build the assembly:
```
sbt assembly
```
Copy the resulting JAR into eXist's plugin/library directory.

Add the Algolia module to conf.xml inside indexer/modules:

<module id="algolia-index"
    class="org.humanistika.exist.index.algolia.AlgoliaIndex"
    application-id="YOUR-ALGOLIA-APPLICATION-ID"
    admin-api-key="YOUR-ALGOLIA-ADMIN-API-KEY"
    batch-size="1000"/>

Add the dependency entry to startup.xml:

<dependency>
    <groupId>org.humanistika.exist.index.algolia</groupId>
    <artifactId>exist-algolia-index</artifactId>
    <version>VERSION_FROM_VERSION_SBT</version>
    <relativePath>exist-algolia-index-assembly-VERSION_FROM_VERSION_SBT.jar</relativePath>
</dependency>

Restart eXist.
Reindex the configured collections so already-present data is pushed into Algolia. The correct reindex target depends on your own collection structure. Reindex the collection or subcollection whose collection.xconf contains the Algolia index configuration.

Configuration

For a single collection in eXist, you can put data into one or more indexes in Algolia, just create an "index" element inside the "algolia" element for each index and give it the name of the Algolia index, if the index doesn't exist in Algolia it will be automatically created for you.

For incremental indexing to work, you need to have two sets of unique ids, one for each document in the collection (documentId) and one for each rootObject (nodeId).

Algolia writes are sent in batches. The global batch-size module attribute defaults to 1000 operations per request. A collection-level <index> can override it with batchSize if a specific Algolia index needs smaller or larger chunks.

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index>
        <algolia>
            <namespaceMappings>
                <namespaceMapping>
                    <prefix>xml</prefix>
                    <namespace>http://www.w3.org/XML/1998/namespace</namespace>
                </namespaceMapping>
            </namespaceMappings>
            <index name="my-algolia-index-1" documentId="/path/to/unique-id/@xml:id" visibleBy="/path/to/unique-id" batchSize="1000">
                <rootObject path="/path/to/element" nodeId="@xml:id">
                    <attribute name="f1" path="/further/patha"/>
                    <attribute name="f2" path="/further/pathb" type="integer"/>
                    <object name="other" path="/further/pathc">
                        <map path="/x" type="boolean"/>
                   </object>
                </rootObject>
            </index>
        </algolia>
    </index>
</collection>

An optional visibleBy attribute can be used to restrict data access when searching the Algolia index.

A rootObject is equivalent to an object inside an Algolia Index. We create one "rootObject" either for each document, or document fragment (if you specify a path attribute on the rootObject).

An attribute (represents a JSON object attribute, not to be confused with an XML attribute) is a simple key/value pair that is extracted from the XML and placed into the Algolia object ("rootObject" as we call it). All of the text nodes or attribute values indicated by the "path" on the "attribute" element will be serialized to a string (and then converted if you set an explicit "type" attribute).

The path for an "attribute" may point to either an XML element or XML attribute node. Paths must be simple, you can use namespace prefixes in the path, but you must also set the namespaceMappings element in the collection.xconf.

The XML Schema file exist-algolia-index-config.xsd defines and documents the index configuration.

An object represents a JSON object, and this is where things become fun, we basically serialize the XML node pointed to by the "path" attribute on the "object" element to a JSON equivalent. This allows you to create highly complex and structured objects in the Algolia index from your XML.

The name attribute that is available on the "attribute" and "object" elements allows you to set the name of the field in the JSON object of the Algolia index, this means that name names of your data fields can be different in Algolia to eXist if you wish.

Reindexing Existing Data

Installing or updating the plugin does not by itself upload already-present XML documents to Algolia. After installation, reindex each configured collection in eXist so the configured rootObjects are serialized and pushed to Algolia.

In general:

reindex the full configured collection for a first-time backfill
reindex a narrower subcollection if your deployment replaced only part of the XML corpus and that subcollection has the relevant Algolia collection config
avoid reindexing broad parent collections unless they are the intended scope of the Algolia configuration

Indexing Status

The plugin writes deployment-readable indexing status to algolia-index/status.json under eXist's configured data directory. The status records are keyed by Algolia index and collection path where a collection is known.

Status states:

current: the latest tracked operation for that index or collection completed successfully
degraded: Algolia rejected or failed a terminal operation such as a batch write, document delete, collection delete, or index drop
stale_local_store: the plugin could not derive collection-delete object IDs from the local Algolia store, usually because the collection was removed before a successful backfill created local state

The local and staging helper scripts fail verification when status.json contains degraded or stale_local_store records. Resolve those states before treating a deployment as successful. In practice, check the failure message in status.json and the Algolia/eXist logs, then retry the targeted reindex or run a wider backfill if the local store is missing the needed collection state.

Live vs local-store sync

The plugin's incremental reindex path computes diffs from the local store under algolia-index/indexes/. That means a normal xmldb:reindex(...) is not a guaranteed recovery path once the local store and live Algolia have drifted apart.

The failure mode this now guards against is simple:

the local store still contains the full object set for a collection tree
live Algolia has silently lost some of those objects
a normal reindex trusts the local snapshot, emits only small diffs, and leaves the live loss in place

Use the explicit verification command to compare exact objectID sets for one collection tree:

./scripts/exist-local.sh verify-collection-sync /db/apps/raskovnik-data/data/MBRT.RDG
./scripts/exist-stage.sh verify-collection-sync /db/apps/raskovnik-data/data/MBRT.RDG
./scripts/exist-production-hotpatch.sh verify-collection-sync /db/apps/raskovnik-data/data/MBRT.RDG

The check reads the latest local-store snapshot per document directory, filters by exact collection-tree membership, browses the live Algolia index, and compares exact objectID sets. It reports:

local count
live count
missing-in-live count
unexpected-in-live count
wrong-path live count
per-collection live counts
small sample mismatches
a sync classification and the blast radius for a safe replay

For deeper read-only diagnosis:

./scripts/exist-local.sh inspect-collection-sync /db/apps/raskovnik-data/data/MBRT.RDG
./scripts/exist-stage.sh inspect-collection-sync /db/apps/raskovnik-data/data/MBRT.RDG
./scripts/exist-production-hotpatch.sh inspect-collection-sync /db/apps/raskovnik-data/data/MBRT.RDG

If the local store is healthy and live Algolia drifted or lost records, prefer the safe replay flow:

./scripts/exist-local.sh replay-collection-live /db/apps/raskovnik-data/data/MBRT.RDG
./scripts/exist-stage.sh replay-collection-live /db/apps/raskovnik-data/data/MBRT.RDG
./scripts/exist-production-hotpatch.sh replay-collection-live /db/apps/raskovnik-data/data/MBRT.RDG

If live/local sync is already clean but status.json is stale or incomplete, refresh the status file directly:

./scripts/exist-local.sh refresh-indexing-status /db/apps/raskovnik-data/data/MBRT.RDG
./scripts/exist-stage.sh refresh-indexing-status /db/apps/raskovnik-data/data/MBRT.RDG
./scripts/exist-production-hotpatch.sh refresh-indexing-status /db/apps/raskovnik-data/data/MBRT.RDG

reconcile-collection remains available only as an explicit exceptional fallback. It mutates the local store, is disabled by default for production, and should not be the routine next step after divergence.

When intentionally used, reconcile:

verifies the collection first
no-ops when already synced unless --force is used
quarantines matching local-store document directories to algolia-index/quarantine/<index>/<timestamp>__<collection-slug>/
runs xmldb:reindex(<collection-path>)
re-verifies until the live/local sets match or the command times out

Those quarantine backups are intentional forensic evidence. Keep them until you have confirmed the recovery and no longer need the old snapshots.

Incident note

The motivating example was a staging GE.RKMD incident where the local store still represented the full collection tree, live Algolia no longer did, and an ordinary reindex produced only small diffs instead of a full republish. The initiating record loss remains unknown; the new commands close the persistence gap that let that bad state survive verification.

Limiting Object Access

You can limit data access by setting the visibleBy attribute in collection.xconf and mapping it to the corresponding path in your XML data, preferably in the document header.

See the test fixture examples:

XML: VSK.TEST.xml
Configuration: collection.xconf

Enable logging in eXist (optional)

You can see what we are sending to Algolia by adding the following to your $EXIST_HOME/log4j2.xml file:

Add this as a child of the <Appenders> element:

<RollingRandomAccessFile name="algolia.index"
        filePattern="${logs}/algolia-index.${rollover.file.pattern}.log.gz"
        fileName="${logs}/algolia-index.log">
    <Policies>
        <SizeBasedTriggeringPolicy size="${rollover.max.size}"/>
    </Policies>
    <DefaultRolloverStrategy max="${rollover.max}"/>
    <PatternLayout pattern="${exist.file.pattern}"/>
</RollingRandomAccessFile>

And add this as a child of the <Loggers> element:

<Logger name="org.humanistika.exist.index.algolia" additivity="false" level="trace">
    <AppenderRef ref="algolia.index"/>
</Logger>

The log output will then appear in eXist's configured log directory, usually logs/algolia-index.log under the active eXist home or container layout, the next time eXist is started.

Current limitations

When you back up eXist, you should also back up the algolia-index directory inside eXist's configured data directory, because it holds the local representation of what is stored on the remote Algolia server. That now includes the algolia-index/quarantine/ subtree used by explicit reconcile runs. Support for integrating that local store into a native backup/restore workflow may be added later.

Acknowledgements

Hats off to Adam Retter for sharing his superb programming skills with us in this project.

This tool was developed in the context of ongoing work at BCDH, including Raskovnik, a Serbian dictionary platform built together with the Institute of Serbian Language.

bcdh / exist-algolia-index 1.0