README.md aktualisiert
This commit is contained in:
935
README.md
935
README.md
@@ -1,35 +1,6 @@
|
|||||||
# Pi Zero 2W + ReSpeaker - OPTIMIERT FÜR 3 KOMMANDOS
|
# Pi Zero 2W + ReSpeaker - OPTIMIERT FÜR 3 KOMMANDOS
|
||||||
## Lightweight Keyword Spotting statt vollständiges Sprachmodell
|
## Lightweight Keyword Spotting statt vollständiges Sprachmodell
|
||||||
|
|
||||||
**Status:** Ultra-leichte Lösung für nur 3-5 einfache Sprachbefehle
|
|
||||||
**Speicherverbrauch:** ~30MB (statt 150MB)
|
|
||||||
**RAM-Nutzung:** 20-40MB (statt 100-120MB)
|
|
||||||
**Performance:** 93-98% Erkennungsgenauigkeit
|
|
||||||
**Startup-Zeit:** < 1 Sekunde (statt 3-5 Sekunden)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## VERGLEICH: Vollständig vs. Keyword Spotting
|
|
||||||
|
|
||||||
### Option 1: Vosk (Deine ursprüngliche Lösung)
|
|
||||||
- ✅ Erkennt beliebige Sätze und Text
|
|
||||||
- ❌ 50-100MB Modell
|
|
||||||
- ❌ 80-120MB RAM erforderlich
|
|
||||||
- ❌ 40-60% CPU-Last auf Pi Zero 2W
|
|
||||||
- ❌ 3-5 Sekunden Startzeit
|
|
||||||
- ❌ Für 3 Kommandos völlig übertrieben
|
|
||||||
|
|
||||||
### Option 2: Keyword Spotting (EMPFOHLEN für dich) ⭐
|
|
||||||
- ✅ Erkennt genau deine 3 Kommandos mit 93-98% Genauigkeit
|
|
||||||
- ✅ < 5MB Modell
|
|
||||||
- ✅ 20-40MB RAM erforderlich
|
|
||||||
- ✅ 5-15% CPU-Last (entspannt für Pi Zero 2W!)
|
|
||||||
- ✅ < 1 Sekunde Startup
|
|
||||||
- ✅ 4x schneller als Vosk
|
|
||||||
- ✅ Speichert 120MB Speicherplatz
|
|
||||||
|
|
||||||
**FAZIT:** Für dich ist Option 2 definitiv die bessere Wahl!
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## TEIL 1-3: Basis-Installation (wie vorher)
|
## TEIL 1-3: Basis-Installation (wie vorher)
|
||||||
@@ -233,898 +204,60 @@ aplay -D hw:1,0 test_recording.wav
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## TEIL 4 OPTIMIERT: Ultra-Leichte Setup
|
# installieren mit
|
||||||
|
pip3 install vosk --break-system-packages
|
||||||
|
|
||||||
### 4.1 Minimale Python-Pakete installieren
|
mkdir ~/vosk-models
|
||||||
|
cd ~/vosk-models
|
||||||
|
wget https://alphacephei.com/vosk/models/vosk-model-small-de-0.15.zip
|
||||||
|
unzip vosk-model-small-de-0.15.zip
|
||||||
|
mv vosk-model-small-de-0.15 model
|
||||||
|
|
||||||
```bash
|
|
||||||
# Nur das Nötigste
|
|
||||||
sudo apt install -y portaudio19-dev
|
|
||||||
sudo apt install python3-pyaudio
|
|
||||||
sudo apt install python3-numpy
|
|
||||||
sudo apt install python3-scipy
|
|
||||||
sudo python3 -m pip install sounddevice --break-system-packages
|
|
||||||
# PocketSphinx (minimal, nur ~5MB)
|
|
||||||
sudo apt install python3-pocketsphinx
|
|
||||||
sudo apt install python3-SpeechRecognition
|
|
||||||
```
|
|
||||||
|
|
||||||
**Das ist ALLES!** Keine großen Modelle.
|
# aufnehmen mit
|
||||||
|
arecord -D plughw:1,0 --format S16_LE --rate 16000 --channels 1 --duration 5 test_mono.wav
|
||||||
|
# ausfuehren mit
|
||||||
|
python3 test_simple.py test_mono.wav
|
||||||
|
|
||||||
Dauer: 2-3 Minuten (statt 20-30 Minuten bei Vosk)
|
|
||||||
|
|
||||||
### 4.2 Verzeichnisse erstellen
|
|
||||||
|
|
||||||
```bash
|
|
||||||
mkdir -p ~/voice_assistant
|
|
||||||
mkdir -p ~/voice_assistant/sounds
|
|
||||||
mkdir -p ~/voice_assistant/logs
|
|
||||||
cd ~/voice_assistant
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
#### test_simple.py
|
||||||
|
|
||||||
## TEIL 5 OPTIMIERT: Schlankes Python-Skript für 3 Kommandos
|
|
||||||
|
|
||||||
### 5.1 Keyword Spotting Skript erstellen
|
|
||||||
|
|
||||||
Erstelle `~/voice_assistant/keyword_spotting.py`:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
nano ~/voice_assistant/keyword_spotting.py
|
|
||||||
```
|
|
||||||
|
|
||||||
Kopiere diesen **viel kürzeren und schnelleren Code**:
|
|
||||||
|
|
||||||
```python
|
|
||||||
#!/usr/bin/env python3
|
#!/usr/bin/env python3
|
||||||
# -*- coding: utf-8 -*-
|
|
||||||
|
|
||||||
"""
|
import wave
|
||||||
Keyword Spotting für Raspberry Pi Zero 2W mit ReSpeaker Hat v1.2
|
|
||||||
Optimiert für exakt 3 Kommandos - Ultra-leicht und schnell
|
|
||||||
Speicher: ~30MB, RAM: 20-40MB, CPU: 5-15%, Startup: < 1 Sekunde
|
|
||||||
"""
|
|
||||||
|
|
||||||
import json
|
|
||||||
import os
|
|
||||||
import sys
|
import sys
|
||||||
import logging
|
|
||||||
import subprocess
|
|
||||||
import time
|
|
||||||
import numpy as np
|
|
||||||
import sounddevice as sd
|
|
||||||
from pathlib import Path
|
|
||||||
from collections import deque
|
|
||||||
|
|
||||||
# ============================================================================
|
from vosk import Model, KaldiRecognizer, SetLogLevel
|
||||||
# KONFIGURATION - Nur deine 3 Kommandos!
|
|
||||||
# ============================================================================
|
|
||||||
|
|
||||||
class Config:
|
# You can set log level to -1 to disable debug messages
|
||||||
# Pfade
|
SetLogLevel(0)
|
||||||
BASE_DIR = Path(__file__).parent
|
|
||||||
SOUNDS_DIR = BASE_DIR / "sounds"
|
|
||||||
LOGS_DIR = BASE_DIR / "logs"
|
|
||||||
|
|
||||||
# Audio-Einstellungen (minimal)
|
wf = wave.open(sys.argv[1], "rb")
|
||||||
SAMPLERATE = 16000
|
if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getcomptype() != "NONE":
|
||||||
CHUNK_SIZE = 512
|
print("Audio file must be WAV format mono PCM.")
|
||||||
CHANNELS = 1
|
sys.exit(1)
|
||||||
DEVICE_INDEX = None
|
|
||||||
|
|
||||||
# === DEINE 3 KOMMANDOS ===
|
model = Model("model") #lang="en-us")
|
||||||
# Format: "Gesprochenes Wort" -> "Sounddatei" und "Aktion"
|
|
||||||
KEYWORDS = {
|
|
||||||
"musik": {
|
|
||||||
"sound": "music.wav",
|
|
||||||
"action": "play_music",
|
|
||||||
"confidence": 0.65, # 65% Sicherheit ausreichend
|
|
||||||
},
|
|
||||||
"stopp": {
|
|
||||||
"sound": "stopped.wav",
|
|
||||||
"action": "stop",
|
|
||||||
"confidence": 0.70,
|
|
||||||
},
|
|
||||||
"licht": {
|
|
||||||
"sound": "light.wav",
|
|
||||||
"action": "toggle_light",
|
|
||||||
"confidence": 0.68,
|
|
||||||
},
|
|
||||||
}
|
|
||||||
|
|
||||||
# Logging
|
# You can also init model by name or with a folder path
|
||||||
LOG_FILE = LOGS_DIR / "keyword_spotting.log"
|
# model = Model(model_name="vosk-model-en-us-0.21")
|
||||||
LOG_LEVEL = logging.INFO
|
# model = Model("models/en")
|
||||||
|
|
||||||
# ============================================================================
|
rec = KaldiRecognizer(model, wf.getframerate())
|
||||||
# LOGGING SETUP
|
rec.SetWords(True)
|
||||||
# ============================================================================
|
rec.SetPartialWords(True)
|
||||||
|
|
||||||
def setup_logging():
|
|
||||||
"""Einfaches Logging"""
|
|
||||||
Config.LOGS_DIR.mkdir(exist_ok=True)
|
|
||||||
|
|
||||||
logger = logging.getLogger("KeywordSpotter")
|
|
||||||
logger.setLevel(Config.LOG_LEVEL)
|
|
||||||
|
|
||||||
# File handler
|
|
||||||
fh = logging.FileHandler(Config.LOG_FILE)
|
|
||||||
fh.setLevel(Config.LOG_LEVEL)
|
|
||||||
|
|
||||||
# Console handler
|
|
||||||
ch = logging.StreamHandler()
|
|
||||||
ch.setLevel(Config.LOG_LEVEL)
|
|
||||||
|
|
||||||
# Formatter
|
|
||||||
formatter = logging.Formatter(
|
|
||||||
'%(asctime)s - %(levelname)s - %(message)s',
|
|
||||||
datefmt='%Y-%m-%d %H:%M:%S'
|
|
||||||
)
|
|
||||||
fh.setFormatter(formatter)
|
|
||||||
ch.setFormatter(formatter)
|
|
||||||
|
|
||||||
logger.addHandler(fh)
|
|
||||||
logger.addHandler(ch)
|
|
||||||
|
|
||||||
return logger
|
|
||||||
|
|
||||||
logger = setup_logging()
|
|
||||||
|
|
||||||
# ============================================================================
|
|
||||||
# AUDIO-GERÄTE
|
|
||||||
# ============================================================================
|
|
||||||
|
|
||||||
def find_respeaker_device():
|
|
||||||
"""Finde ReSpeaker-Gerät"""
|
|
||||||
logger.info("Suche ReSpeaker...")
|
|
||||||
try:
|
|
||||||
for index, name in enumerate(sd.query_devices()):
|
|
||||||
if isinstance(name, dict):
|
|
||||||
device_name = name.get('name', '')
|
|
||||||
else:
|
|
||||||
device_name = str(name)
|
|
||||||
|
|
||||||
if 'seeed' in device_name.lower():
|
|
||||||
logger.info(f"✓ ReSpeaker gefunden: Index {index}")
|
|
||||||
return index
|
|
||||||
except:
|
|
||||||
pass
|
|
||||||
|
|
||||||
logger.warning("⚠ ReSpeaker nicht gefunden, nutze Standard-Audio")
|
|
||||||
return None
|
|
||||||
|
|
||||||
# ============================================================================
|
|
||||||
# AKUSTISCHE FINGERPRINTS (Ultra-leicht statt ML-Modell)
|
|
||||||
# ============================================================================
|
|
||||||
|
|
||||||
class AudioFingerprint:
|
|
||||||
"""
|
|
||||||
Erzeugt akustische Fingerprints für Keywords
|
|
||||||
Viel leichter als ML-Modelle - nur ~5MB gesamt!
|
|
||||||
"""
|
|
||||||
|
|
||||||
@staticmethod
|
|
||||||
def extract_features(audio_chunk):
|
|
||||||
"""
|
|
||||||
Extrahiere einfache Audio-Features für Vergleich
|
|
||||||
- Zero Crossing Rate (ZCR)
|
|
||||||
- Energy
|
|
||||||
- Spektrale Centroid
|
|
||||||
- MFCC (vereinfacht)
|
|
||||||
"""
|
|
||||||
audio = np.array(audio_chunk, dtype=np.float32) / 32768.0
|
|
||||||
|
|
||||||
# 1. Zero Crossing Rate (schnelle/langsame Sprache)
|
|
||||||
zcr = np.mean(np.abs(np.diff(np.sign(audio))))
|
|
||||||
|
|
||||||
# 2. Energy (Lautstärke)
|
|
||||||
energy = np.sqrt(np.mean(audio ** 2))
|
|
||||||
|
|
||||||
# 3. Spectral features (sehr vereinfacht)
|
|
||||||
fft = np.abs(np.fft.fft(audio[:512]))
|
|
||||||
freq_energy = [
|
|
||||||
np.sum(fft[0:50]), # Tiefe Frequenzen
|
|
||||||
np.sum(fft[50:150]), # Mittlere Frequenzen
|
|
||||||
np.sum(fft[150:256]), # Hohe Frequenzen
|
|
||||||
]
|
|
||||||
|
|
||||||
return np.array([zcr, energy] + freq_energy, dtype=np.float32)
|
|
||||||
|
|
||||||
@staticmethod
|
|
||||||
def compare_fingerprints(fp1, fp2):
|
|
||||||
"""Vergleiche zwei Fingerprints (0.0 = unterschiedlich, 1.0 = identisch)"""
|
|
||||||
# Normalisiere
|
|
||||||
fp1_norm = (fp1 - np.mean(fp1)) / (np.std(fp1) + 1e-6)
|
|
||||||
fp2_norm = (fp2 - np.mean(fp2)) / (np.std(fp2) + 1e-6)
|
|
||||||
|
|
||||||
# Cosine similarity
|
|
||||||
similarity = np.dot(fp1_norm, fp2_norm) / (
|
|
||||||
np.linalg.norm(fp1_norm) * np.linalg.norm(fp2_norm) + 1e-6
|
|
||||||
)
|
|
||||||
|
|
||||||
# Normalisiere auf [0, 1]
|
|
||||||
similarity = (similarity + 1.0) / 2.0
|
|
||||||
return max(0.0, min(1.0, similarity))
|
|
||||||
|
|
||||||
# ============================================================================
|
|
||||||
# REFERENCE FINGERPRINTS (Training)
|
|
||||||
# ============================================================================
|
|
||||||
|
|
||||||
class ReferenceDatabase:
|
|
||||||
"""
|
|
||||||
Speichert Reference-Fingerprints für deine 3 Kommandos
|
|
||||||
WICHTIG: Diese müssen einmalig trainiert werden!
|
|
||||||
"""
|
|
||||||
|
|
||||||
def __init__(self):
|
|
||||||
self.db_file = Config.BASE_DIR / "reference_fingerprints.npy"
|
|
||||||
self.keywords_file = Config.BASE_DIR / "reference_keywords.txt"
|
|
||||||
self.fingerprints = {}
|
|
||||||
self.load_or_create()
|
|
||||||
|
|
||||||
def load_or_create(self):
|
|
||||||
"""Lade existierende oder erstelle neue Referenzen"""
|
|
||||||
if self.db_file.exists() and self.keywords_file.exists():
|
|
||||||
logger.info("Lade existierende Reference-Fingerprints...")
|
|
||||||
try:
|
|
||||||
data = np.load(self.db_file, allow_pickle=True).item()
|
|
||||||
self.fingerprints = data
|
|
||||||
logger.info(f"✓ {len(self.fingerprints)} Keywords geladen")
|
|
||||||
except Exception as e:
|
|
||||||
logger.warning(f"Konnte Fingerprints nicht laden: {e}")
|
|
||||||
self.create_default_fingerprints()
|
|
||||||
else:
|
|
||||||
logger.info("Erstelle Default-Fingerprints...")
|
|
||||||
self.create_default_fingerprints()
|
|
||||||
|
|
||||||
def create_default_fingerprints(self):
|
|
||||||
"""
|
|
||||||
Erstelle vereinfachte Default-Fingerprints
|
|
||||||
In Produktion würdest du diese durch echte Audio-Samples trainieren!
|
|
||||||
"""
|
|
||||||
logger.warning("⚠ WICHTIG: Benutze bin/prepare_training.py für Training!")
|
|
||||||
|
|
||||||
# Vereinfachte Fingerprints als Platzhalter
|
|
||||||
# Später durch echte Samples ersetzen!
|
|
||||||
self.fingerprints = {
|
|
||||||
"musik": np.array([0.05, 0.3, 100, 500, 200], dtype=np.float32),
|
|
||||||
"stopp": np.array([0.02, 0.2, 150, 400, 300], dtype=np.float32),
|
|
||||||
"licht": np.array([0.04, 0.25, 120, 450, 250], dtype=np.float32),
|
|
||||||
}
|
|
||||||
|
|
||||||
self.save()
|
|
||||||
|
|
||||||
def save(self):
|
|
||||||
"""Speichere Fingerprints"""
|
|
||||||
try:
|
|
||||||
np.save(self.db_file, self.fingerprints)
|
|
||||||
logger.info(f"✓ Reference-Fingerprints gespeichert")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Fehler beim Speichern: {e}")
|
|
||||||
|
|
||||||
def add_training_sample(self, keyword, audio_chunk):
|
|
||||||
"""Füge Trainings-Sample hinzu"""
|
|
||||||
fp = AudioFingerprint.extract_features(audio_chunk)
|
|
||||||
|
|
||||||
if keyword not in self.fingerprints:
|
|
||||||
self.fingerprints[keyword] = fp
|
|
||||||
else:
|
|
||||||
# Durchschnitt mit existierendem
|
|
||||||
self.fingerprints[keyword] = (
|
|
||||||
self.fingerprints[keyword] + fp
|
|
||||||
) / 2.0
|
|
||||||
|
|
||||||
self.save()
|
|
||||||
logger.info(f"✓ Training-Sample hinzugefügt: {keyword}")
|
|
||||||
|
|
||||||
# ============================================================================
|
|
||||||
# KEYWORD SPOTTER
|
|
||||||
# ============================================================================
|
|
||||||
|
|
||||||
class KeywordSpotter:
|
|
||||||
"""Höre nach deinen 3 Keywords"""
|
|
||||||
|
|
||||||
def __init__(self):
|
|
||||||
logger.info("Initialisiere Keyword Spotter...")
|
|
||||||
|
|
||||||
Config.DEVICE_INDEX = find_respeaker_device()
|
|
||||||
self.ref_db = ReferenceDatabase()
|
|
||||||
|
|
||||||
self.stream = None
|
|
||||||
self.is_running = False
|
|
||||||
self.buffer = deque(maxlen=Config.SAMPLERATE) # 1 Sekunde Buffer
|
|
||||||
|
|
||||||
def audio_callback(self, indata, frames, time_info, status):
|
|
||||||
"""Callback beim Audio-Input"""
|
|
||||||
if status:
|
|
||||||
logger.warning(f"Audio-Status: {status}")
|
|
||||||
|
|
||||||
# Füge zu Buffer hinzu
|
|
||||||
audio_data = indata[:, 0]
|
|
||||||
for sample in audio_data:
|
|
||||||
self.buffer.append(int(sample * 32767))
|
|
||||||
|
|
||||||
def start(self):
|
|
||||||
"""Starte Audio-Listening"""
|
|
||||||
logger.info("Starte Audio-Listening...")
|
|
||||||
try:
|
|
||||||
self.stream = sd.InputStream(
|
|
||||||
samplerate=Config.SAMPLERATE,
|
|
||||||
blocksize=Config.CHUNK_SIZE,
|
|
||||||
channels=Config.CHANNELS,
|
|
||||||
device=Config.DEVICE_INDEX,
|
|
||||||
callback=self.audio_callback
|
|
||||||
)
|
|
||||||
self.stream.start()
|
|
||||||
self.is_running = True
|
|
||||||
logger.info("✓ Audio-Listening aktiv")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Fehler beim Starten: {e}")
|
|
||||||
raise
|
|
||||||
|
|
||||||
def stop(self):
|
|
||||||
"""Stoppe Audio-Listening"""
|
|
||||||
logger.info("Stoppe Audio-Listening...")
|
|
||||||
if self.stream:
|
|
||||||
self.stream.stop()
|
|
||||||
self.stream.close()
|
|
||||||
self.is_running = False
|
|
||||||
|
|
||||||
def detect_keywords(self):
|
|
||||||
"""
|
|
||||||
Erkenne Keywords kontinuierlich
|
|
||||||
Rückgabe: (keyword, confidence) oder (None, 0)
|
|
||||||
"""
|
|
||||||
if len(self.buffer) < Config.SAMPLERATE:
|
|
||||||
return None, 0
|
|
||||||
|
|
||||||
audio_chunk = list(self.buffer)
|
|
||||||
current_fp = AudioFingerprint.extract_features(audio_chunk)
|
|
||||||
|
|
||||||
best_keyword = None
|
|
||||||
best_confidence = 0
|
|
||||||
|
|
||||||
# Vergleiche mit allen Keywords
|
|
||||||
for keyword, threshold_config in Config.KEYWORDS.items():
|
|
||||||
ref_fp = self.ref_db.fingerprints.get(keyword)
|
|
||||||
|
|
||||||
if ref_fp is None:
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Berechne Ähnlichkeit
|
|
||||||
similarity = AudioFingerprint.compare_fingerprints(current_fp, ref_fp)
|
|
||||||
required_threshold = threshold_config.get("confidence", 0.7)
|
|
||||||
|
|
||||||
logger.debug(f"{keyword}: {similarity:.2%} (benötigt: {required_threshold:.0%})")
|
|
||||||
|
|
||||||
# Ist besser als bisherig?
|
|
||||||
if similarity > best_confidence and similarity >= required_threshold:
|
|
||||||
best_keyword = keyword
|
|
||||||
best_confidence = similarity
|
|
||||||
|
|
||||||
return best_keyword, best_confidence
|
|
||||||
|
|
||||||
# ============================================================================
|
|
||||||
# SOUND-AUSGABE
|
|
||||||
# ============================================================================
|
|
||||||
|
|
||||||
class SoundPlayer:
|
|
||||||
"""Spiele Sounds ab"""
|
|
||||||
|
|
||||||
def __init__(self):
|
|
||||||
self.sounds_dir = Config.SOUNDS_DIR
|
|
||||||
self.sounds_dir.mkdir(exist_ok=True)
|
|
||||||
|
|
||||||
def play_sound(self, filename):
|
|
||||||
"""Spiele Sound ab"""
|
|
||||||
sound_path = self.sounds_dir / filename
|
|
||||||
|
|
||||||
if not sound_path.exists():
|
|
||||||
logger.warning(f"⚠ Sound nicht gefunden: {filename}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
try:
|
|
||||||
logger.info(f"♪ Spiele Sound ab: {filename}")
|
|
||||||
subprocess.run(
|
|
||||||
['aplay', '-D', 'hw:1,0', str(sound_path)],
|
|
||||||
check=True,
|
|
||||||
capture_output=True,
|
|
||||||
timeout=10
|
|
||||||
)
|
|
||||||
return True
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"✗ Fehler beim Abspielen: {e}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
# ============================================================================
|
|
||||||
# AKTION-HANDLER
|
|
||||||
# ============================================================================
|
|
||||||
|
|
||||||
class ActionHandler:
|
|
||||||
"""Führe Aktionen aus"""
|
|
||||||
|
|
||||||
def __init__(self, sound_player):
|
|
||||||
self.sound_player = sound_player
|
|
||||||
|
|
||||||
def execute(self, keyword):
|
|
||||||
"""Führe Aktion aus"""
|
|
||||||
if keyword not in Config.KEYWORDS:
|
|
||||||
return False
|
|
||||||
|
|
||||||
config = Config.KEYWORDS[keyword]
|
|
||||||
logger.info(f"🎯 Erkannt: {keyword.upper()}")
|
|
||||||
|
|
||||||
# Spiele Sound ab
|
|
||||||
if config.get("sound"):
|
|
||||||
self.sound_player.play_sound(config["sound"])
|
|
||||||
|
|
||||||
# Führe Aktion aus
|
|
||||||
action = config.get("action")
|
|
||||||
|
|
||||||
if action == "play_music":
|
|
||||||
logger.info("▶ Musik abspielen...")
|
|
||||||
# Hier könnten echte Aktionen folgen
|
|
||||||
elif action == "stop":
|
|
||||||
logger.info("⏹ Stoppen...")
|
|
||||||
elif action == "toggle_light":
|
|
||||||
logger.info("💡 Licht umschalten...")
|
|
||||||
# GPIO-Beispiel: GPIO.output(17, not GPIO.input(17))
|
|
||||||
|
|
||||||
return True
|
|
||||||
|
|
||||||
# ============================================================================
|
|
||||||
# HAUPTPROGRAMM
|
|
||||||
# ============================================================================
|
|
||||||
|
|
||||||
class VoiceControllerLite:
|
|
||||||
"""Hauptprogramm - Ultra-leicht und schnell"""
|
|
||||||
|
|
||||||
def __init__(self):
|
|
||||||
logger.info("=" * 70)
|
|
||||||
logger.info("Voice Controller (Lite) für Pi Zero 2W")
|
|
||||||
logger.info("Keyword Spotting - Nur 3 Kommandos, super schnell!")
|
|
||||||
logger.info("=" * 70)
|
|
||||||
|
|
||||||
try:
|
|
||||||
self.spotter = KeywordSpotter()
|
|
||||||
self.sound_player = SoundPlayer()
|
|
||||||
self.action_handler = ActionHandler(self.sound_player)
|
|
||||||
|
|
||||||
self.last_detection = 0
|
|
||||||
self.detection_cooldown = 1.0 # 1 Sekunde zwischen Erkennungen
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"✗ Initialisierungsfehler: {e}")
|
|
||||||
raise
|
|
||||||
|
|
||||||
def run(self):
|
|
||||||
"""Hauptschleife"""
|
|
||||||
logger.info("Starte Hauptschleife...")
|
|
||||||
|
|
||||||
try:
|
|
||||||
self.spotter.start()
|
|
||||||
|
|
||||||
detection_count = 0
|
|
||||||
|
|
||||||
while True:
|
while True:
|
||||||
try:
|
data = wf.readframes(4000)
|
||||||
keyword, confidence = self.spotter.detect_keywords()
|
if len(data) == 0:
|
||||||
|
|
||||||
if keyword and confidence > 0.5:
|
|
||||||
# Cooldown prüfen (verhindert Mehrfacherkennung)
|
|
||||||
current_time = time.time()
|
|
||||||
if current_time - self.last_detection > self.detection_cooldown:
|
|
||||||
detection_count += 1
|
|
||||||
logger.info(
|
|
||||||
f"[#{detection_count}] ✓ {keyword.upper()} "
|
|
||||||
f"({confidence:.1%})"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Führe Aktion aus
|
|
||||||
self.action_handler.execute(keyword)
|
|
||||||
self.last_detection = current_time
|
|
||||||
|
|
||||||
# Kurze Pause (nicht 100% CPU)
|
|
||||||
time.sleep(0.1)
|
|
||||||
|
|
||||||
except KeyboardInterrupt:
|
|
||||||
logger.info("\n⏹ Unterbrochen durch Benutzer")
|
|
||||||
break
|
break
|
||||||
except Exception as e:
|
if rec.AcceptWaveform(data):
|
||||||
logger.error(f"Fehler in Schleife: {e}")
|
print(rec.Result())
|
||||||
time.sleep(1)
|
else:
|
||||||
|
print(rec.PartialResult())
|
||||||
|
|
||||||
finally:
|
print(rec.FinalResult())
|
||||||
self.spotter.stop()
|
|
||||||
logger.info("✓ Voice Controller beendet")
|
|
||||||
|
|
||||||
# ============================================================================
|
|
||||||
# EINSTIEGSPUNKT
|
|
||||||
# ============================================================================
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
try:
|
|
||||||
controller = VoiceControllerLite()
|
|
||||||
controller.run()
|
|
||||||
except KeyboardInterrupt:
|
|
||||||
logger.info("Beendet")
|
|
||||||
sys.exit(0)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"✗ Kritischer Fehler: {e}", exc_info=True)
|
|
||||||
sys.exit(1)
|
|
||||||
```
|
|
||||||
|
|
||||||
Speichere die Datei (Ctrl+X, Y, Enter).
|
|
||||||
|
|
||||||
### 5.2 Training-Skript erstellen
|
|
||||||
|
|
||||||
Erstelle `~/voice_assistant/prepare_training.py` um die Keywords zu trainieren:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
nano ~/voice_assistant/prepare_training.py
|
|
||||||
```
|
|
||||||
|
|
||||||
```python
|
|
||||||
#!/usr/bin/env python3
|
|
||||||
# -*- coding: utf-8 -*-
|
|
||||||
|
|
||||||
"""
|
|
||||||
Training-Skript: Nimm Audio-Samples deiner 3 Keywords auf
|
|
||||||
Dies muss einmalig am Anfang durchgeführt werden!
|
|
||||||
"""
|
|
||||||
|
|
||||||
import os
|
|
||||||
import sys
|
|
||||||
import logging
|
|
||||||
import sounddevice as sd
|
|
||||||
import numpy as np
|
|
||||||
from pathlib import Path
|
|
||||||
from keyword_spotting import (
|
|
||||||
Config, setup_logging, find_respeaker_device,
|
|
||||||
ReferenceDatabase, AudioFingerprint
|
|
||||||
)
|
|
||||||
|
|
||||||
logger = setup_logging()
|
|
||||||
|
|
||||||
def record_keyword_sample(keyword, duration=2.0):
|
|
||||||
"""
|
|
||||||
Nimme Audio-Sample auf
|
|
||||||
Dauer: 2 Sekunden
|
|
||||||
"""
|
|
||||||
print(f"\n{'='*60}")
|
|
||||||
print(f"Recording: '{keyword}'")
|
|
||||||
print(f"{'='*60}")
|
|
||||||
print(f"⏺ Aufnahme in 3 Sekunden... (Drücke SPACE zur Bereitschaft)")
|
|
||||||
input("Drücke ENTER, wenn bereit >")
|
|
||||||
|
|
||||||
Config.DEVICE_INDEX = find_respeaker_device()
|
|
||||||
|
|
||||||
print(f"🔴 Aufnahme läuft... ({duration}s)")
|
|
||||||
|
|
||||||
# Aufnahme
|
|
||||||
audio = sd.rec(
|
|
||||||
int(Config.SAMPLERATE * duration),
|
|
||||||
samplerate=Config.SAMPLERATE,
|
|
||||||
channels=1,
|
|
||||||
device=Config.DEVICE_INDEX,
|
|
||||||
dtype='int16'
|
|
||||||
)
|
|
||||||
|
|
||||||
sd.wait()
|
|
||||||
|
|
||||||
print("✓ Aufnahme abgeschlossen")
|
|
||||||
|
|
||||||
return audio[:, 0] if audio.ndim > 1 else audio
|
|
||||||
|
|
||||||
def train_keyword(keyword, num_samples=3):
|
|
||||||
"""
|
|
||||||
Trainiere Keyword mit mehreren Samples
|
|
||||||
Empfohlen: 3-5 Samples pro Keyword
|
|
||||||
"""
|
|
||||||
logger.info(f"\n{'='*60}")
|
|
||||||
logger.info(f"Training: {keyword.upper()}")
|
|
||||||
logger.info(f"{'='*60}")
|
|
||||||
logger.info(f"Bitte nimm {num_samples} Samples des Keywords '{keyword}' auf")
|
|
||||||
|
|
||||||
db = ReferenceDatabase()
|
|
||||||
fingerprints = []
|
|
||||||
|
|
||||||
for i in range(num_samples):
|
|
||||||
print(f"\n[Sample {i+1}/{num_samples}] '{keyword}'")
|
|
||||||
audio = record_keyword_sample(keyword, duration=2.0)
|
|
||||||
|
|
||||||
# Extrahiere Fingerprint
|
|
||||||
fp = AudioFingerprint.extract_features(audio)
|
|
||||||
fingerprints.append(fp)
|
|
||||||
|
|
||||||
print(f"✓ Fingerprint extrahiert: {fp}")
|
|
||||||
|
|
||||||
# Durchschnitt aller Samples
|
|
||||||
avg_fingerprint = np.mean(fingerprints, axis=0)
|
|
||||||
db.fingerprints[keyword] = avg_fingerprint
|
|
||||||
db.save()
|
|
||||||
|
|
||||||
logger.info(f"✓ {keyword} trainiert und gespeichert!")
|
|
||||||
return True
|
|
||||||
|
|
||||||
def main():
|
|
||||||
"""Haupttraining"""
|
|
||||||
print("\n" + "="*60)
|
|
||||||
print("KEYWORD SPOTTING - TRAINING")
|
|
||||||
print("="*60)
|
|
||||||
print("\nAufnehmen von Sprachsamples für deine 3 Keywords:")
|
|
||||||
print("1. musik")
|
|
||||||
print("2. stopp")
|
|
||||||
print("3. licht")
|
|
||||||
print("\nFür jeden Keyword werden 3 Samples benötigt.")
|
|
||||||
print("Sprich das Keyword klar und deutlich ins Mikrofon.")
|
|
||||||
print("\n" + "="*60 + "\n")
|
|
||||||
|
|
||||||
input("Drücke ENTER um zu starten >")
|
|
||||||
|
|
||||||
try:
|
|
||||||
for keyword in Config.KEYWORDS.keys():
|
|
||||||
train_keyword(keyword, num_samples=3)
|
|
||||||
|
|
||||||
print("\n" + "="*60)
|
|
||||||
print("✓ TRAINING ABGESCHLOSSEN!")
|
|
||||||
print("="*60)
|
|
||||||
print("\nRun jetzt: python3 keyword_spotting.py")
|
|
||||||
|
|
||||||
except KeyboardInterrupt:
|
|
||||||
logger.info("\n✗ Training abgebrochen")
|
|
||||||
sys.exit(0)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"✗ Fehler: {e}", exc_info=True)
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
```
|
|
||||||
|
|
||||||
Speichere die Datei.
|
|
||||||
|
|
||||||
### 5.3 Sound-Dateien erstellen
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python3 << 'EOF'
|
|
||||||
import wave
|
|
||||||
import math
|
|
||||||
|
|
||||||
def generate_tone(frequency, duration, sample_rate=16000):
|
|
||||||
samples = []
|
|
||||||
for i in range(int(sample_rate * duration)):
|
|
||||||
sample = int(32767 * 0.3 * math.sin(2 * math.pi * frequency * i / sample_rate))
|
|
||||||
samples.append(sample)
|
|
||||||
return samples
|
|
||||||
|
|
||||||
# Music
|
|
||||||
sounds = generate_tone(523, 0.15) + generate_tone(587, 0.15) + generate_tone(659, 0.15)
|
|
||||||
with wave.open('/home/pi/voice_assistant/sounds/music.wav', 'wb') as f:
|
|
||||||
f.setnchannels(1)
|
|
||||||
f.setsampwidth(2)
|
|
||||||
f.setframerate(16000)
|
|
||||||
f.writeframes(b''.join(s.to_bytes(2, 'little', signed=True) for s in sounds))
|
|
||||||
print("✓ music.wav")
|
|
||||||
|
|
||||||
# Stopped
|
|
||||||
sounds = generate_tone(440, 0.3)
|
|
||||||
with wave.open('/home/pi/voice_assistant/sounds/stopped.wav', 'wb') as f:
|
|
||||||
f.setnchannels(1)
|
|
||||||
f.setsampwidth(2)
|
|
||||||
f.setframerate(16000)
|
|
||||||
f.writeframes(b''.join(s.to_bytes(2, 'little', signed=True) for s in sounds))
|
|
||||||
print("✓ stopped.wav")
|
|
||||||
|
|
||||||
# Light
|
|
||||||
sounds = generate_tone(587, 0.2) + generate_tone(659, 0.1)
|
|
||||||
with wave.open('/home/pi/voice_assistant/sounds/light.wav', 'wb') as f:
|
|
||||||
f.setnchannels(1)
|
|
||||||
f.setsampwidth(2)
|
|
||||||
f.setframerate(16000)
|
|
||||||
f.writeframes(b''.join(s.to_bytes(2, 'little', signed=True) for s in sounds))
|
|
||||||
print("✓ light.wav")
|
|
||||||
EOF
|
|
||||||
```
|
|
||||||
|
|
||||||
### 5.4 TRAINING durchführen (WICHTIG!)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd ~/voice_assistant
|
|
||||||
chmod +x prepare_training.py
|
|
||||||
python3 prepare_training.py
|
|
||||||
```
|
|
||||||
|
|
||||||
**Das Trainings-Skript wird dich auffordern:**
|
|
||||||
1. Sprich 3x das Wort "musik"
|
|
||||||
2. Sprich 3x das Wort "stopp"
|
|
||||||
3. Sprich 3x das Wort "licht"
|
|
||||||
|
|
||||||
Jedes Sample dauert 2 Sekunden. Die Fingerprints werden automatisch gespeichert.
|
|
||||||
|
|
||||||
**Dauer:** ~5 Minuten
|
|
||||||
|
|
||||||
### 5.5 Test
|
|
||||||
|
|
||||||
Nach dem Training:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python3 ~/voice_assistant/keyword_spotting.py
|
|
||||||
```
|
|
||||||
|
|
||||||
Jetzt:
|
|
||||||
1. Sprich: "musik" → Sound abspielen
|
|
||||||
2. Sprich: "stopp" → Sound abspielen
|
|
||||||
3. Sprich: "licht" → Sound abspielen
|
|
||||||
|
|
||||||
Beende mit Ctrl+C.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## TEIL 6: Systemctl Service (wie vorher)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
sudo nano /etc/systemd/system/voice-assistant.service
|
|
||||||
```
|
|
||||||
|
|
||||||
```ini
|
|
||||||
[Unit]
|
|
||||||
Description=Voice Assistant - Keyword Spotting
|
|
||||||
After=network.target sound.target
|
|
||||||
|
|
||||||
[Service]
|
|
||||||
Type=simple
|
|
||||||
User=pi
|
|
||||||
WorkingDirectory=/home/pi/voice_assistant
|
|
||||||
ExecStart=/usr/bin/python3 /home/pi/voice_assistant/keyword_spotting.py
|
|
||||||
Restart=on-failure
|
|
||||||
RestartSec=5
|
|
||||||
StandardOutput=journal
|
|
||||||
StandardError=journal
|
|
||||||
|
|
||||||
# Ressourcen-Limits
|
|
||||||
MemoryMax=128M
|
|
||||||
CPUQuota=30%
|
|
||||||
|
|
||||||
[Install]
|
|
||||||
WantedBy=multi-user.target
|
|
||||||
```
|
|
||||||
|
|
||||||
```bash
|
|
||||||
sudo systemctl daemon-reload
|
|
||||||
sudo systemctl enable voice-assistant.service
|
|
||||||
sudo systemctl start voice-assistant.service
|
|
||||||
sudo systemctl status voice-assistant.service
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## PERFORMANCE-VERGLEICH
|
|
||||||
|
|
||||||
### Speichernutzung:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Vor (Vosk)
|
|
||||||
du -sh ~/voice_models/
|
|
||||||
# Ausgabe: ~100MB
|
|
||||||
|
|
||||||
# Nach (Keyword Spotting)
|
|
||||||
du -sh ~/voice_assistant/
|
|
||||||
# Ausgabe: ~2MB (!!)
|
|
||||||
```
|
|
||||||
|
|
||||||
### RAM während Betrieb:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ps aux | grep python3 | grep keyword
|
|
||||||
# Vosk: ~100-120MB
|
|
||||||
# Keyword Spotting: ~25-35MB
|
|
||||||
```
|
|
||||||
|
|
||||||
### CPU-Last:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# top
|
|
||||||
# Vosk: 40-60% (Pi Zero 2W läuft fast warm!)
|
|
||||||
# Keyword Spotting: 5-15% (gemütlich!)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Startup-Zeit:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
time python3 keyword_spotting.py
|
|
||||||
# Vosk: real 0m3.5s
|
|
||||||
# Keyword Spotting: real 0m0.4s (!!)
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## RESSOURCEN-VERGLEICH (Zusammenfassung)
|
|
||||||
|
|
||||||
| Metrik | Vosk | Keyword Spotting | Einsparung |
|
|
||||||
|--------|------|------------------|------------|
|
|
||||||
| **Modellgröße** | 50-100MB | < 1MB | 99%! |
|
|
||||||
| **RAM-Nutzung** | 100-120MB | 25-35MB | 75% |
|
|
||||||
| **CPU-Last (Pi Zero 2W)** | 40-60% | 5-15% | 75% |
|
|
||||||
| **Startup-Zeit** | 3-5s | 0.4s | 90% |
|
|
||||||
| **Erkennungslatenz** | 200-500ms | 50-100ms | 75% |
|
|
||||||
| **Genauigkeit (3 Befehle)** | 85-92% | 93-98% | +10% |
|
|
||||||
| **Speicherplatz (gesamt)** | ~150MB | ~30MB | 80% |
|
|
||||||
|
|
||||||
**Fazit:** Du sparst massiv Ressourcen bei besserer Performance!
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## TROUBLESHOOTING
|
|
||||||
|
|
||||||
**Problem: "Erkennung funktioniert nicht nach Training"**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Überprüfe ob Fingerprints gespeichert wurden
|
|
||||||
ls -la ~/voice_assistant/*.npy
|
|
||||||
|
|
||||||
# Zeige gespeicherte Keywords
|
|
||||||
cat ~/voice_assistant/reference_keywords.txt
|
|
||||||
```
|
|
||||||
|
|
||||||
**Problem: "False Positives (erkennt Worte, die nicht gesprochen wurden)"**
|
|
||||||
|
|
||||||
Erhöhe die Confidence-Schwelle in `keyword_spotting.py`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
"musik": {
|
|
||||||
"confidence": 0.75, # Vorher: 0.65
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Problem: "Erkennung zu ungenau"**
|
|
||||||
|
|
||||||
Trainiere erneut mit besserer Aussprache:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python3 prepare_training.py
|
|
||||||
```
|
|
||||||
|
|
||||||
Sprich die Keywords deutlicher und lauter.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## NÄCHSTE SCHRITTE
|
|
||||||
|
|
||||||
Mit dieser Lösung kannst du:
|
|
||||||
|
|
||||||
1. ✅ **3 Keywords erkennen** mit 93-98% Genauigkeit
|
|
||||||
2. ✅ **Super schnell starten** (< 1 Sekunde)
|
|
||||||
3. ✅ **Speicher sparen** (80% weniger!)
|
|
||||||
4. ✅ **CPU sparen** (75% weniger Last)
|
|
||||||
5. ✅ **Offline arbeiten** (kein Internet nötig)
|
|
||||||
|
|
||||||
Wenn du später **mehr Kommandos** brauchst:
|
|
||||||
- 5 Kommandos: Noch OK mit dieser Methode
|
|
||||||
- 10+ Kommandos: Wechsel zu leichtem ML-Modell (TensorFlow Lite)
|
|
||||||
- Beliebige Sprache: Dann Vosk nötig
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## FRAGEN & ANTWORTEN
|
|
||||||
|
|
||||||
**F: Kann ich mehr als 3 Kommandos hinzufügen?**
|
|
||||||
A: Ja, bis ca. 10 Kommandos bleibt die Methode effizient. Mehr als 10 → TensorFlow Lite ML-Modell nutzen.
|
|
||||||
|
|
||||||
**F: Wie lange dauert Training?**
|
|
||||||
A: ~5 Minuten (3 Samples × 3 Keywords × 2 Sekunden + Verarbeitung)
|
|
||||||
|
|
||||||
**F: Muss ich jedes Mal neu trainieren?**
|
|
||||||
A: Nein, die Fingerprints werden gespeichert. Nur am Anfang nötig.
|
|
||||||
|
|
||||||
**F: Funktioniert es auch mit Dialekt/Akzent?**
|
|
||||||
A: Ja! Trainiere mit DEINEM Akzent, dann erkannt der System dich perfekt.
|
|
||||||
|
|
||||||
**F: Was ist wenn jemand anders spricht?**
|
|
||||||
A: Die Erkennung wird dann weniger genau (ca. 10-20% weniger). Das ist normal - trainiere ggf. mit mehreren Stimmen.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Viel Erfolg mit deinem schlanken Voice Control System! 🎉**
|
|
||||||
|
|
||||||
Die Lösung ist optimiert, super schnell und perfekt für Pi Zero 2W!
|
|
||||||
|
|||||||
Reference in New Issue
Block a user