Click to expand
AI Engineer Assessment Pipeline
Dual-model email generation assistant with custom LLM evaluation metrics
PythonStreamlitGroqLlama 3.3 70BGemma2 9BROUGE-LLLM-as-a-Judge
Overview
An email generation assistant that benchmarks two LLMs head-to-head — Llama 3.3 70B vs Gemma2 9B via Groq — using a custom evaluation framework with cross-model judging to eliminate self-grading bias.
Key Features
- Streamlit UI for email generation and side-by-side model comparison
- Custom eval metrics: Fact Recall, Tone Alignment, and hybrid Fluency/ROUGE-L
- LLM-as-a-Judge pattern with cross-model judging to avoid self-grading bias
- Groq inference for fast evaluation cycles across both models
- Structured evaluation report with per-metric scores and reasoning
Status
Completed (assessment)