Bo Kolstrup

Logo

Científico de Datos con experiencia en modelos predictivos y análisis de datos

View the Project on GitHub Bokols/ai_voice_assistant_nlp_project

🗣️ Improving AI Voice Assistants: A Danish NLP Evaluation and Enhancement Framework

Explore the app components:

🔍 Overview

This project presents a comprehensive framework to evaluate, improve, and visualize the performance of AI voice assistants in Danish. It combines synthetic data simulation, advanced NLP preprocessing, fine-tuned transformer models, and interactive visualizations. Designed for underrepresented languages, the pipeline identifies critical linguistic and contextual errors and enhances the performance and interpretability of intent recognition models.

⚠️ Note: This project uses a synthetic dataset for demonstration. It reflects common linguistic structures but does not represent real user behavior. Results should be interpreted accordingly.


🎯 Objective

📌 Business Context

Voice assistants are integral to digital ecosystems. However, underrepresented languages like Danish lack robust NLP support. Misunderstandings in native language interactions reduce user trust and satisfaction.

🎯 Goal

To build an end-to-end system for:


🧱 Project Components

1. 🧹 Data Cleaning & Preprocessing

2. 📊 Exploratory Data Analysis (EDA)

EDA modules include:

3. 🤖 Model Training & Evaluation

Intent Classification

Model Accuracy Precision Recall F1-score
Danish BERT 0.976 0.976 0.976 0.976
XLM-RoBERTa 0.973 0.973 0.973 0.973

Paraphrase Similarity

4. 📈 Comparative Analysis


💻 Technologies Used

Tool/Library Purpose
Python Data processing, modeling
pandas, NumPy Data handling and transformation
spaCy NLP preprocessing with Danish pipelines
Transformers BERT & XLM-RoBERTa model training
SentenceTransformers Semantic similarity modeling
Matplotlib/Seaborn Data visualization
Streamlit Interactive dashboard (planned)
Parquet/JSON Efficient data storage and reporting

📌 Key Takeaways


🔮 Future Improvements

  1. Add a Streamlit app to demo model predictions and satisfaction analysis.
  2. Integrate real Danish datasets to validate findings in production.
  3. Add voice-to-text preprocessing to simulate full assistant workflows.
  4. Improve anomaly detection for out-of-scope intents and unusual phrasing.

📬 Contact

Built with ❤️ by a data scientist passionate about multilingual NLP and human-centered AI.
For inquiries or collaboration ideas, feel free to connect via LinkedIn or raise an issue in the repo.