1. System Overview
OMANI-Therapist-Voice is a real-time, voice-first mental health conversational chatbot for Omani Arabic speakers. The system provides culturally sensitive, therapeutic-grade conversations using advanced speech processing and dual-model AI validation, with a focus on low-latency, safety, and clinical effectiveness. It uses all the industry standard techstack like Langchain as the LLM framework.
2. High-Level Architecture
Components:
- Frontend (React/Vite): Real-time voice interface for users.
- Backend (Node.js/Express): Handles audio processing, API gateway, and TTS.
- LLM Microservice (Python/FastAPI + LangChain): Handles all LLM orchestration, prompt management, safety/cultural validation, and OpenAI API calls.
- External Services: Azure Speech (STT & TTS), OpenAI GPT-4o (chat), OpenAI GPT-4.1 (validator).
Data Flow:
- User speaks into the frontend interface.
- Audio is sent to the backend for transcription.
- Backend converts audio to wav, transcribes using Azure STT (Omani Arabic + English).
- Transcription and chat history are sent to the LLM microservice (LangChain) for response generation and safety validation.
- The LLM microservice:
- Generates a response (GPT-4o, with cultural/clinical prompt)
- Validates the response (GPT-4.1, structured JSON for risk/cultural/clinical checks)
- Applies crisis protocol or modification if needed
- Returns the final reply and safety metadata to the backend
- Backend synthesizes the reply to Omani Arabic speech using Azure TTS.
- Audio is streamed back to the frontend for playback.
3. Detailed Component Design