Click to expand

AI Engineer Assessment Pipeline

Dual-model email generation assistant with custom LLM evaluation metrics

PythonStreamlitGroqLlama 3.3 70BGemma2 9BROUGE-LLLM-as-a-Judge

Overview

An email generation assistant that benchmarks two LLMs head-to-head — Llama 3.3 70B vs Gemma2 9B via Groq — using a custom evaluation framework with cross-model judging to eliminate self-grading bias.

Key Features

Streamlit UI for email generation and side-by-side model comparison
Custom eval metrics: Fact Recall, Tone Alignment, and hybrid Fluency/ROUGE-L
LLM-as-a-Judge pattern with cross-model judging to avoid self-grading bias
Groq inference for fast evaluation cycles across both models
Structured evaluation report with per-metric scores and reasoning

Status

Completed (assessment)