Case Study: Designing a Human-in-the-Loop Workflow for Question-Type Annotation & Answer Quality Evaluation

A two-phase annotation platform for classifying question types, evaluating AI-generated answers, and refining answers to improve model performance.

Background

Before large language models became widespread, our team was building a dataset to evaluate and improve automated question-answering systems. To train and assess such systems effectively, the team needed a human-centered workflow that could:

Classify the type of question being asked
Evaluate whether a candidate answer was appropriate for that question type
Improve the answer when necessary
Include a review layer to ensure annotation quality and consistency

The existing process lacked structure, guidelines were unclear, and annotators had different interpretations of the task. There was no unified workflow to support consistent outputs. I was brought in as both a linguistics subject-matter expert and workflow/UX designer to redesign the system from the ground up.

My Role

Linguistics SME, Workflow Architect, UX Designer

I contributed to the redesign by:

Creating the multi-phase annotation workflow
Defining and refining the question-type taxonomy
Advising the client’s research team on linguistically grounded workflow improvements
Designing clear UI interactions for annotators, reviewers, and QA
Developing the answer-quality rubric and type-specific guidelines
Establishing the review and approval layers
Aligning engineers, annotators, QA, and researchers to ensure a consistent, scalable system

The Problem

The project required a reliable way to annotate questions and evaluate AI-generated answers, but:

Question types were inconsistently defined
Annotators lacked clarity for ambiguous or edge-case questions
Answer acceptability was subjective without structured criteria
There was no system for revising poor answers
Reviewers had limited visibility into the reasoning behind annotations
Inconsistencies across annotators reduced the usefulness of the dataset

The workflow needed to be clarified, formalized, and made usable for large teams.

The Solution

I designed a three-stage, human-in-the-loop workflow supported by type-specific guidelines, structured UI patterns, and linguistically informed rubrics.

The finalized workflow:

Annotator 1 → Annotator 2 (Peer Review & Refinement) → QA Final Gate

This balanced quality, efficiency, and cognitive load while producing high-fidelity data.

Phase 1 — Annotator 1: Question Classification & Answer Evaluation

Phase 2 — Annotator 2: Peer Review & Refinement

Annotator 2 acted as a hybrid reviewer and editor.

They could:

Approve Annotator 1’s classification and answer
Refine or rewrite the answer further
Adjust the question type if misclassified
Send the task back with feedback

This peer-review layer improved consistency and reduced QA burden.

Phase 3 — QA Final Gate

QA reviewers saw the finalized output with the full annotation history.

They could:

Accept
Send back for correction
Reject

This ensured final-level quality before the data was approved for use.

Linguistics SME Contributions

I supported the client’s research team by:

Refining the question-type taxonomy

Identified overlapping or ambiguous categories
Proposed clearer, linguistically grounded distinctions
Suggested new categories where appropriate

Shaping answer-quality criteria

Designed expectations for each question type
Created templates for “what a good answer looks like”
Suggested formatting standards to increase clarity

Improving the overall workflow

Recommended cognitively efficient task sequencing
Proposed clearer surfacing of guidelines within the UI
Helped restructure the tasks based on linguistic reasoning rather than operational convenience

This strengthened the client’s broader research methodology and improved the interpretability of their data.

Results & Impact

Clear, scalable annotation workflow that standardized outputs across annotators
Higher inter-annotator agreement due to structured rubrics and type-specific guidelines
Higher-quality answer corrections using linguistically informed templates
Reduced QA workload due to the peer-review stage
Generalizable workflow adaptable to future question types and evaluation tasks
Improved research design through SME-level input on categories, formats, and workflows