Skip to main content

Essofore Semantic Search

What is Essofore?

Essofore is a document store powered by a semantic search engine that understands the meaning of your query rather than searching for keywords in your query. Supported file formats include .txt, .pdf, .html, .docx, .ppt, .xlsx.

Teaser

Getting Started on AWS

Begin by provisioning a VM and choosing the Essofore AMI. Summary of required steps:

  1. Provision a VM with RAM = 10x your expected data size as measured on disk (see below). Recommend x2gd or r6g instance type if x2gd is not available in your region.
  2. Attach a secondary EBS volume in which you will store your data. The boot disk is only 8gb in size. See this and this for how to do it. See following links for difference between EBS and instance store volumes [1, 2, 3]
  3. Open port 22 to ssh to the VM and port 8080 to connect to Essofore Webserver from a client machine.

Essofore is an in-memory datastore so make sure you provision a VM with sufficient RAM to be able to handle the size of your dataset. As a rough estimate, the RAM size should be 10 times the size of your dataset on disk (UTF-8 encoded). This is a conservative estimate and you may be able to get away with less. Although it may sound like a lot, for context the entire works of Shakespeare take up less than 6MB of disk space. Amazon EC2 X2gd instances offer the lowest cost per GiB of memory in Amazon EC2 for memory-intensive applications. Your other best option is r6g.

Launch the VM and ssh to it. User should be ec2-user (not root):

ssh ec2-user@...

The Essofore Webserver should already be running on the VM and you can verify it by running:

sudo systemctl status essofore.service

If the server is not running, you can start it explicitly (see below). Before we do that, let's go over a few housekeeping items. By default Essofore is setup to save data and logs in /opt/essofore/data and /opt/essofore/logs respectively. You can change these locations to the secondary EBS volume by editing /opt/essofore/bin/application.properties as shown below (change placeholders as necessary):

essofore.rootDir=/path/to/data
logging.file.name=/path/to/logs/essofore.log

You also need to make essofore the owner of above two directories. Do that by running:

sudo chown -R essofore:essofore /path/to/data
sudo chown -R essofore:essofore /path/to/logs

You are now set. We need to re-start the server for changes to take effect (you can skip this if you have not made any changes to the default settings). Do that by running:

sudo systemctl stop essofore.service
sudo systemctl start essofore.service

Inspect the new path(s) to ensure they are being used. Once the service is running, you can view the logs by running:

journalctl -u essofore -f

or inspect the log files directly in the log directory. Run ss -tpln and verify there is a process listening on port 8080 for incoming connections before moving on to the next step.

REST API

By default Essofore runs a Spring Boot Webserver on port 8080. You can change the port by setting server.port in application.properties file we encountered earlier. All operations are handled through a REST API conforming to an OpenAPI specification. The API can be browsed at http://localhost:8080/swagger-ui/index.html where you can even perform operations interactively against the datastore. The OpenAPI specification can be accessed at http://localhost:8080/v3/api-docs.

To be able to access the UI from a web browser like Chrome on your local computer (laptop etc.), you will need to allow HTTP traffic to/from port 8080 on the VM. Refer for AWS documentation on how to do that. Another alternative is to put the Essofore Webserver behind an nginx instance. Below is sample nginx config to make that possible. We recommend you skip this exercise for now (esp. if you are not familiar with nginx or you are running Essofore for the first time) and come back to it later if needed.

location ~ ^/api(/|$) {        
proxy_pass http://127.0.0.1:8080; # Forward requests to essofore
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_cache_bypass $http_upgrade;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-proto https;
proxy_set_header X-Forwarded-Prefix /api/;
}

With the above config, all requests that start with /api will be forwarded by nginx to http://127.0.0.1:8080 where Essofore is running. You should also edit Essofore's application.properties to include following:

springdoc.swagger-ui.path=/api/swagger-ui.html
springdoc.api-docs.path=/api/v3/api-docs
openapi.essoforeOpenAPIDefinition.base-path=/api
server.forward-headers-strategy=NATIVE
server.use-forward-headers=true

The proxy_set_header X-Forwarded-proto https directive is needed if you will be accessing the Swagger-UI via https instead of http. The Swagger-UI can now be accessed at http(s)://public-ip-address-of-vm/api/swagger-ui.html from a web browser on your local computer (assuming you have allowed http/https traffic to/from the VM on default ports 80 and 443 respectively). Above is only partial configuration showing the essential directives. The full config is omitted for brevity and since nginx is not the focus of this manual.

You can disable the Swagger UI on production instances if needed for security by setting following properties in application.properties:

springdoc.api-docs.enabled=false
springdoc.swagger-ui.enabled=false

Installing Python Client

In a real-world setup you would be using Essofore just like you use a database like MySQL or PostgreSQL. There will be a server instance the running of which we covered earlier and there would be one or more backend service(s) making calls to the server using the REST API. Here we show how to make calls to the server you provisioned earlier using Python.

Begin by provisioning a separate VM designed to mimic your backend service and open port 8080 on the server to allow traffic to/from the server. Next, download the client code to your machine. For the client VM we recommend using Ubuntu AMI over Amazon Linux. Below steps are for Ubuntu (see this for how to install pipx and poetry on Amazon Linux):

git clone https://github.com/essofore/python-client.git

or if you don't have git you can do:

wget https://github.com/essofore/python-client/archive/refs/heads/main.zip
unzip main.zip

Before we can start using the client, we need to install a bunch of pre-requisites. Always refer to official documentation for latest instructions.

  1. Install python. Should already be pre-installed but check.

  2. Install pipx

sudo apt-get update
sudo apt install pipx
sudo pipx ensurepath
exec -l $SHELL
  1. Install poetry (version 2.x is latest at time of this writing). Poetry should always be installed in a dedicated virtual environment to isolate it from the rest of your system.
pipx install poetry

pipx will automatically create an isolated virtual environment and install poetry in it. refresh:

exec -l $SHELL

verify:

$ pipx list
venvs are in /home/siddjain/.local/pipx/venvs
apps are exposed on your $PATH at /home/siddjain/.local/bin
package poetry 2.1.3, installed using Python 3.10.12
- poetry
  1. Now cd to the cli directory and from there run poetry install:
cd cli
poetry install

Note that we have not activated any virtual environment while running poetry install. Poetry will create a virtual environment for you when you run poetry install if one does not exist.

This finishes installing the pre-requisites and the python client. The steps in this section do not have to be repeated again. If you sync to a new version of the client, you will have to re-run poetry install to install any new dependencies but other than that everything in this section is a one-time setup.

Tip: run poetry show to list all the packages poetry has installed in the virtual environment.

Python CLI

  1. cd to cli directory and from there activate the virtual environment - if its not already activated - by running:
eval $(poetry env activate)

poetry env activate gives the command to activate the virtual environment. eval runs (evaluates) that command.

  1. cd to the cli/cli subdirectory and from there run:
python cli.py --host http://ip-address/8080

If all goes well, you should see a prompt like following:

Essofore v0.1.0. Type help or ? to list commands.
>>

What you have done here is connected to the server using a CLI just like you do with MySQL or PostgreSQL command-line clients. Let the games begin!

Inspecting the catalog

Begin by inspecting the database:

inspect catalog

You should see empty result. Next, let us create a collection - a collection is a container that stores related documents. For example, we can create a collection to store all the novels of Sherlock Holmes. Do this by running:

Creating a collection

create collection --collection_id 1 --title Sherlock Holmes

Great! Now lets add some documents to the collection. You can try adding your own documents or download the works of Arthur Conan Doyle (the creator of Sherlock Holmes) from Project Gutenberg in a subdirectory /data/sherlock-holmes:

wget https://www.gutenberg.org/cache/epub/1661/pg1661.txt
wget https://www.gutenberg.org/cache/epub/244/pg244.txt
wget https://www.gutenberg.org/cache/epub/2852/pg2852.txt
wget https://www.gutenberg.org/cache/epub/2097/pg2097.txt
wget https://www.gutenberg.org/cache/epub/834/pg834.txt
wget https://www.gutenberg.org/cache/epub/3289/pg3289.txt
wget https://www.gutenberg.org/cache/epub/108/pg108.txt
wget https://www.gutenberg.org/cache/epub/2350/pg2350.txt
wget https://www.gutenberg.org/cache/epub/69700/pg69700.txt

Now let's add them to the database. Do that by running and modifying paths as necessary:

Adding documents to a collection

upload document --collection_id 1 --document_id pg3289 --title The Valley of Fear --file ./data/sherlock-holmes/pg3289.txt --doc_type txt
upload document --collection_id 1 --document_id pg1661 --title The Adventures of Sherlock Holmes --file ./data/sherlock-holmes/pg1661.txt --doc_type txt
upload document --collection_id 1 --document_id pg108 --title The Return of Sherlock Holmes --file ./data/sherlock-holmes/pg108.txt --doc_type txt
upload document --collection_id 1 --document_id pg2097 --title The Sign of the Four --file ./data/sherlock-holmes/pg2097.txt --doc_type txt
upload document --collection_id 1 --document_id pg2350 --title His Last Bow --file ./data/sherlock-holmes/pg2350.txt --doc_type txt
upload document --collection_id 1 --document_id pg2852 --title The Hound of the Baskervilles --file ./data/sherlock-holmes/pg2852.txt --doc_type txt
upload document --collection_id 1 --document_id pg834 --title The Memoirs of Sherlock Holmes --file ./data/sherlock-holmes/pg834.txt --doc_type txt
upload document --collection_id 1 --document_id pg69700 --title The Case Book of Sherlock Holmes --file ./data/sherlock-holmes/pg69700.txt --doc_type txt
upload document --collection_id 1 --document_id pg244 --title A Study in Scarlet --file ./data/sherlock-holmes/pg244.txt --doc_type txt

These commands are also available in cli/cli/sherlock-holmes.txt and the CLI has a nify feature that allows you to run all the commands in a text file like so:

playback sherlock-holmes.txt

Try querying the catalog once again and see what you get:

inspect catalog

Searching Documents in a Collection

We can now execute search queries against the documents in a collection. Do that by running following commands:

search collection --collection_id 1 --query "Who is Sherlock Holmes?"
search collection --collection_id 1 --query "Who is Dr. Watson?"
search collection --collection_id 1 --query "How are Sherlock Holmes and Dr. Watson related?"
search collection --collection_id 1 --query "Where does Sherlock Holmes live?"
search collection --collection_id 1 --query "How old is Sherlock Holmes?"

By default the top 5 results are fetched from the database. This can be changed via the -k parameter to the CLI.

Python API

Below is the python code that would accomplish the steps we have performed till now:

  1. create collection
  2. add documents
  3. search
from essofore_client.api.collections import create_collection, upload_document, search
from essofore_client.models.document_type import DocumentType
create_collection.sync_detailed(collection_id="1", title="Sherlock Holmes", client=client)
with open('./data/sherlock-holmes/pg3289.txt, 'rb') as f:
upload_document.sync_detailed(client=client,
collection_id = "1",
document_id = "pg2350",
title="The Hound of the Baskervilles",
doc_type=DocumentType.TXT,
source_url = url,
body=Blob(f))
response = search.sync_detailed(client=client, collection_id="1", q="Who is Sherlock Holmes?", k=5)
for r in response.parsed:
print_search_result(r)

In a real-world implementation the line with open('./data/sherlock-holmes/pg3289.txt, 'rb') as f: would be replaced by a call that fetches a document from S3 or another blob storage where your actual data is likely stored e.g., Sharepoint, Confluence etc. Note that you do not have to store a copy of the document on your client before uploading it to the server. You open a binary stream and pipe it to the server.

Ingesting Public URLs

For documents available on public URLs, there is a shortcut to add them to the database by giving the public URL. We show how to do that now. First, create a new collection to store the works of Shakespeare:

create collection --collection_id 2 --title Shakespeare

And now add to the collection by downloading from the public URL:

download document --collection_id 2 --document_id pg100 --title Complete Works of William Shakespeare --source_url https://www.gutenberg.org/cache/epub/100/pg100.txt --doc_type TXT

The python code would look like:

from essofore_client.api.documents import download_document
from essofore_client.models.download_document_metadata import DownloadDocumentMetadata
...
response = download_document.sync_detailed(client=client,
collection_id = collection_id,
document_id = document_id,
title=title,
source_url=source_url,
doc_type=doc_type,
metadata=metadata)

Exit the CLI by typing exit or quit when you are done. Also take care to exit the poetry shell when done.

Exploring Further

Type help to see list of all the commands available in the CLI:

>> help

Documented commands (type help <topic>):
========================================
create_collection help quit update_collection
delete_collection inspect_catalog record update_document
delete_document inspect_collection rename_collection upload_document
download_document inspect_document rename_document
exit playback search_collection

There are commands to remove documents from a collection and delete entire collection itself. You can also rename documents and collections but you cannot move a document from one collection to another. You have to do that explicitly via a combination of commands.

Study the source code of commands under cli/cli/commands to understand how the CLI uses the python client library essofore-client to communicate with the database. You can similarly write clients in any programming language you like and are not limited to python since the core communication happens to the server over a REST API.

API Reference

Currently, the Essofore REST API consists of following 13 endpoints. All of them are accessible on the Swagger-UI Playground at http://localhost:8080/swagger-ui.html. The best way to understand how to make calls to the endpoints from a client is to study the code in cli/cli/commands of the Python CLI or use the interactive Swagger-UI Playground.

GET /collections/{collectionId}

Get list of documents in a collection.

PUT /collections/{collectionId}

Update collection metadata. Due to this bug you will not see any field for metadata in the Swagger UI but the actual REST endpoint does support updating the metadata (i.e., the issue is only with the UI).

DELETE /collections/{collectionId}

Delete a collection.

GET /collections

Get list of all the collections in the database.

POST /collections

Create a new collection.

POST /collections/rename

Rename a collection (change its ID).

GET /collections/query

Search the documents in a collection. Among other things returned by this endpoint is a distance measure associated with every search result. This should be used to determine the relevance of the search result. relevance = 1 - distance (higher is better). You should typically aim for search results with distance < 0.35.

GET /documents/{collectionId}/{documentId}

Get information about a document.

PUT /documents/{collectionId}/{documentId}

Change or update metadata associated with a document. Due to this bug you will not see any field for metadata in the Swagger UI but the actual REST endpoint does support updating the metadata (i.e., the issue is only with the UI).

DELETE /documents/{collectionId}/{documentId}

Delete a document.

POST /documents/upload

Upload a document to a collection.

POST /documents/download

Download a document to a collection from a public URL.

POST /documents/rename

Rename a document (change its ID).

Help and Support

https://groups.google.com/g/essofore