Skip to main content

Essofore Chat Server

What is Essofore Chat Server?

Essofore Chat Server can be run standalone or as an optional add-on to Essofore Semantic Search. When used as an add-on, Essofore Chat Server can be used to layer a LLM (large language model) on top of Essofore Semantic Search. The results of a search query from Essofore Semantic Search can be input to the Chat Server which will then summarize the results in a more conversational form and tailored to the user's query. This pattern is known as RAG - retrieval augmented generation - and can be used e.g., to do Q&A on top of your internal knowledge base.

Demo

Getting Started on AWS

Launch a new Chat Server EC2 instance from AWS Console using the Chat Server AMI. For instance type, we recommend g4dn.xlarge. You will need an instance type that comes with NVIDIA GPU that understands CUDA instruction set. Open port 8080 to connect to the Chat Server and make requests to it. Ssh to the instance using ubuntu username (not root). The Chat Server should already be running and you can verify using below command:

sudo systemctl status chat-server.service

If the server is not running, you can run it using:

sudo systemctl start chat-server.service

To stop a running server run:

sudo systemctl stop chat-server.service

To revise the configuration, edit /opt/chat-server/application.properties:

sudo -u essofore vi /opt/chat-server/application.properties

e.g., you can change llama.modelPath to use another LLM of your choice so long as it is compatible with llama.cpp. By default the Chat Server comes with an instruction tuned 8 billion parameter version of Llama3. If you do decide to use another model then attach a separate EBS volume to the EC2 instance as the root volume that comes with the Chat Server has only 8gb of free space.

To use RAG, you need to give the Chat Server the URL of Essofore Semantic Search from where it can fetch search results. Do that by editing:

app.ragEndpoint=http://<ip-address>:<port>/collections/query

and replacing the ip-address and port as necessary.

By default the Chat Server runs on port 8080. This can be changed by editing server.port in the config file.

If you make any changes to the config, you need to stop and restart the server for changes to take effect. Do that by running:

sudo systemctl stop chat-server.service
sudo systemctl start chat-server.service

To view the logs run:

journalctl -u chat-server -f

It does take a full 20 min for the model to load to the GPU as seen below:

Jan 06 01:32:49 ip-172-31-42-117 run.sh[731]: llm_load_tensors: ggml ctx size =    0.27 MiB
Jan 06 01:50:55 ip-172-31-42-117 run.sh[731]: llm_load_tensors: offloading 32 repeating layers to GPU

Be patient.

In addition to using sudo systemctl status chat-server.service to check status of the service, you can run ss -tpln to further confirm that a process a listening on port 8080.

The GPU usage can be monitored by running nvidia-smi. There is a script monitor-gpu.sh in the home directory that will run nvidia-smi every 5s.

Depending on the number of users in your organization, you may need to run multiple instances of the Chat Server to cope up with the load.

Making requests to the Chat Server

The Chat Server exposes just 1 endpoint:

POST /completion

which takes following inputs:

ParameterRequiredDefaultNotes
queryYes-the query
nPredictNo-1controls the length of the response
temperatureNo0.7controls the randomness of the response
ragNofalsewhether to use RAG or not
ragKNo5number of search results to fetch from Essofore Semantic Search
ragRelevanceThresholdNo0.65ignore search results with relevance < threshold
ragCollectionIdYes if rag is true-the collection to search
ragEndpointNo-can be used to override the endpoint where Essofore Semantic Search is running

and outputs a text stream (text/event-stream) of JSON data:

{
text: "..."
}

The text field contains the response to the user's question. Other fields should be ignored and can change at any time.

Example client code using JavaScript [1]:

const completionArgs = {
cachePrompt: true,
penalizeNl: true,
query: "Who is Sherlock Holmes?",
nPredict: -1,
temperature: 0.8,
rag: true,
ragRelevanceThreshold: 0.65,
ragK: 5,
ragCollectionId: 1,
ragEndpoint: 'http://ip-address:port/collections/query' // optional; use to override the default
};

const url = `http://localhost:8080/completion`; // url where Chat Server is running

const options = {
method: 'POST',
headers: {
'accept': 'text/event-stream',
'content-type': 'application/json',
},
body: JSON.stringify(completionArgs)
};

const parse = (data) => {
const match = data.match(/^data:(.+)\n/);
return match ? JSON.parse(match[1]).text : '';
}

fetch(url, options)
.then(res => res.body?.pipeThrough(new TextDecoderStream()).getReader())
.then(async (reader) => {
while(reader) {
const { done, value } = await reader.read();
if (done) {
break;
}
process.stdout.write(parse(value));
}
})
.catch(err => console.log(`event: error\ndata: {"message": "${err}"}\n\n`))

Help and Support

https://groups.google.com/g/essofore