Essofore Chat Server
What is Essofore Chat Server?
Essofore Chat Server can be run standalone or as an optional add-on to Essofore Semantic Search. When used as an add-on, Essofore Chat Server can be used to layer a LLM (large language model) on top of Essofore Semantic Search. The results of a search query from Essofore Semantic Search can be input to the Chat Server which will then summarize the results in a more conversational form and tailored to the user's query. This pattern is known as RAG - retrieval augmented generation - and can be used e.g., to do Q&A on top of your internal knowledge base.
Getting Started on AWS
Launch a new Chat Server EC2 instance from AWS Console using the Chat Server AMI.
For instance type, we recommend g4dn.xlarge. You will need an instance type that comes with NVIDIA GPU that understands CUDA instruction set.
Open port 8080 to connect to the Chat Server and make requests to it.
Ssh to the instance using ubuntu username (not root). The Chat Server should already be running and you can verify using below command:
sudo systemctl status chat-server.service
If the server is not running, you can run it using:
sudo systemctl start chat-server.service
To stop a running server run:
sudo systemctl stop chat-server.service
To revise the configuration, edit /opt/chat-server/application.properties:
sudo -u essofore vi /opt/chat-server/application.properties
e.g., you can change llama.modelPath to use another LLM of your choice so long as it is compatible with llama.cpp.
By default the Chat Server comes with an instruction tuned 8 billion parameter version of Llama3.
If you do decide to use another model then attach a separate EBS volume to the EC2 instance as the root volume that comes with the Chat Server has only 8gb of free space.
To use RAG, you need to give the Chat Server the URL of Essofore Semantic Search from where it can fetch search results. Do that by editing:
app.ragEndpoint=http://<ip-address>:<port>/collections/query
and replacing the ip-address and port as necessary.
By default the Chat Server runs on port 8080. This can be changed by editing server.port in the config file.
If you make any changes to the config, you need to stop and restart the server for changes to take effect. Do that by running:
sudo systemctl stop chat-server.service
sudo systemctl start chat-server.service
To view the logs run:
journalctl -u chat-server -f
It does take a full 20 min for the model to load to the GPU as seen below:
Jan 06 01:32:49 ip-172-31-42-117 run.sh[731]: llm_load_tensors: ggml ctx size = 0.27 MiB
Jan 06 01:50:55 ip-172-31-42-117 run.sh[731]: llm_load_tensors: offloading 32 repeating layers to GPU
Be patient.
In addition to using sudo systemctl status chat-server.service to check status of the service, you can run ss -tpln to further confirm that a process a listening on port 8080.
The GPU usage can be monitored by running nvidia-smi. There is a script monitor-gpu.sh in the home directory that will run nvidia-smi every 5s.
Depending on the number of users in your organization, you may need to run multiple instances of the Chat Server to cope up with the load.
Making requests to the Chat Server
The Chat Server exposes just 1 endpoint:
POST /completion
which takes following inputs:
| Parameter | Required | Default | Notes |
|---|---|---|---|
| query | Yes | - | the query |
| nPredict | No | -1 | controls the length of the response |
| temperature | No | 0.7 | controls the randomness of the response |
| rag | No | false | whether to use RAG or not |
| ragK | No | 5 | number of search results to fetch from Essofore Semantic Search |
| ragRelevanceThreshold | No | 0.65 | ignore search results with relevance < threshold |
| ragCollectionId | Yes if rag is true | - | the collection to search |
| ragEndpoint | No | - | can be used to override the endpoint where Essofore Semantic Search is running |
and outputs a text stream (text/event-stream) of JSON data:
{
text: "..."
}
The text field contains the response to the user's question. Other fields should be ignored and can change at any time.
Example client code using JavaScript [1]:
const completionArgs = {
cachePrompt: true,
penalizeNl: true,
query: "Who is Sherlock Holmes?",
nPredict: -1,
temperature: 0.8,
rag: true,
ragRelevanceThreshold: 0.65,
ragK: 5,
ragCollectionId: 1,
ragEndpoint: 'http://ip-address:port/collections/query' // optional; use to override the default
};
const url = `http://localhost:8080/completion`; // url where Chat Server is running
const options = {
method: 'POST',
headers: {
'accept': 'text/event-stream',
'content-type': 'application/json',
},
body: JSON.stringify(completionArgs)
};
const parse = (data) => {
const match = data.match(/^data:(.+)\n/);
return match ? JSON.parse(match[1]).text : '';
}
fetch(url, options)
.then(res => res.body?.pipeThrough(new TextDecoderStream()).getReader())
.then(async (reader) => {
while(reader) {
const { done, value } = await reader.read();
if (done) {
break;
}
process.stdout.write(parse(value));
}
})
.catch(err => console.log(`event: error\ndata: {"message": "${err}"}\n\n`))