Project Title:Zentry

Git-Organization:https://github.com/Zentry-org

<aside>

Purpose

Zentry AI Assistant aims to deliver a real-time, multilingual, and human-like voice assistant optimized for telephony and institutional automation. Its core purpose is to streamline interactions such as reception calls, helpline responses, and localized support by combining efficient STT, lightweight reasoning, and conversational output.

</aside>

<aside>

Scope

The project focuses on integrating FreeSWITCH for call handling, Whisper (CTranslate2) for high-accuracy speech-to-text, and Phi-3 Mini with RAG for reasoning. It is extendable with Meta MMS models for multilingual support and future TTS integration, ensuring adaptability across education, healthcare, and enterprise use cases.

</aside>

<aside>

Key Deliverables

Deliverable 1
Deliverable 2 </aside>

<aside>

Background

Voice assistants are often cloud-reliant, expensive, and lack support for local languages. Zentry addresses these gaps by building a fully open-source, lightweight, and edge-deployable system optimized for real-world conditions like noisy phone calls. It leverages proven speech models and fast inference pipelines to provide a privacy-first and highly accurate solution.

</aside>

<aside>

Team

Habel Shaji
Lino Tom
Sooraj Suresh
Tom Benny </aside>

<aside>

Milestone Schedule

Date	Milestone
May 21	setting up asterik,realising vm is a real deal
May 23	traversing through various STT LLMs
→checking whisper large v3
[too bad on Malayalam]
May 24	→vrlsc/whisper medium
[good but i found out about thennal/ whisper medium ml]
[thennal fine-tuned model has a WER of 11 on Common Voice 11.0 dataset]
June-July
</aside>

special mention to this guy https://kurianbenoy.com/talks/delft-fastai/index.html?utm_source=chatgpt.com#/malayalam-models-in-whisper-event

Draft of out main architecture

Whats happening after STT text transcription

Actual Working!

Incoming RTP audio (caller → FreeSWITCH)
- Use mod_audio_fork to fork the live call audio to your Python STT server (Whisper).
- Audio is chunked (e.g. 500ms–1s segments).
Streaming STT (Whisper / faster-whisper)
- Process audio in sliding windows.
- Whisper-medium is heavy → you’ll want CTranslate2 quantized + chunked streaming.
- Instead of waiting for full sentences, emit partial transcripts as soon as possible.
RAG + Phi-3 Response
- Once you detect an “end of speech” event (silence / pause), finalize transcript.
- Pass to RAG + Phi-3 → generate response text (~300–700ms).
TTS (Malayalam)
- Convert response text to PCM in real time (Coqui/Kokoro).
- Either:
  - File mode: Save .wav and FreeSWITCH plays.
  - Streaming mode: Push PCM directly back over RTP via mod_external_media.
Caller hears Malayalam AI voice
- Audio streams back continuously (like a normal conversation).

Phi3 Pipeline

1. Call context builder (middleware)

When a new call starts:
- Identify caller (phone number or profile).
- Fetch their ERP context (past calls, current status).
- Fetch live facts (e.g., “Mgmt seats left = 2”).
- Combine into a compact JSON snippet → attached to Phi-3 system prompt.

2. Phi-3 stays lightweight

You don’t change your existing cache + RAG logic.