OMANI-Therapist-Voice: System Design & Model Integration Documentation

1. System Overview

OMANI-Therapist-Voice is a real-time, voice-first mental health conversational chatbot for Omani Arabic speakers. The system provides culturally sensitive, therapeutic-grade conversations using advanced speech processing and dual-model AI validation, with a focus on low-latency, safety, and clinical effectiveness. It uses all the industry standard techstack like Langchain as the LLM framework.

2. High-Level Architecture

Components:

Frontend (React/Vite): Real-time voice interface for users.
Backend (Node.js/Express): Handles audio processing, API gateway, and TTS.
LLM Microservice (Python/FastAPI + LangChain): Handles all LLM orchestration, prompt management, safety/cultural validation, and OpenAI API calls.
External Services: Azure Speech (STT & TTS), OpenAI GPT-4o (chat), OpenAI GPT-4.1 (validator).

Data Flow:

User speaks into the frontend interface.
Audio is sent to the backend for transcription.
Backend converts audio to wav, transcribes using Azure STT (Omani Arabic + English).
Transcription and chat history are sent to the LLM microservice (LangChain) for response generation and safety validation.
The LLM microservice:
- Generates a response (GPT-4o, with cultural/clinical prompt)
- Validates the response (GPT-4.1, structured JSON for risk/cultural/clinical checks)
- Applies crisis protocol or modification if needed
- Returns the final reply and safety metadata to the backend
Backend synthesizes the reply to Omani Arabic speech using Azure TTS.
Audio is streamed back to the frontend for playback.

1. System Overview

2. High-Level Architecture

3. Detailed Component Design