Back to work
AI Engineer Assessment Pipeline - 1
Click to expand

AI Engineer Assessment Pipeline

Dual-model email generation assistant with custom LLM evaluation metrics

PythonStreamlitGroqLlama 3.3 70BGemma2 9BROUGE-LLLM-as-a-Judge

Overview

An email generation assistant that benchmarks two LLMs head-to-head — Llama 3.3 70B vs Gemma2 9B via Groq — using a custom evaluation framework with cross-model judging to eliminate self-grading bias.

Key Features

  • Streamlit UI for email generation and side-by-side model comparison
  • Custom eval metrics: Fact Recall, Tone Alignment, and hybrid Fluency/ROUGE-L
  • LLM-as-a-Judge pattern with cross-model judging to avoid self-grading bias
  • Groq inference for fast evaluation cycles across both models
  • Structured evaluation report with per-metric scores and reasoning

Status

Completed (assessment)