Case Study: Designing a Human-in-the-Loop Workflow for Question-Type Annotation & Answer Quality Evaluation

A two-phase annotation platform for classifying question types, evaluating AI-generated answers, and refining answers to improve model performance.

Background

Before large language models became widespread, our team was building a dataset to evaluate and improve automated question-answering systems. To train and assess such systems effectively, the team needed a human-centered workflow that could:

  • Classify the type of question being asked

  • Evaluate whether a candidate answer was appropriate for that question type

  • Improve the answer when necessary

  • Include a review layer to ensure annotation quality and consistency

The existing process lacked structure, guidelines were unclear, and annotators had different interpretations of the task. There was no unified workflow to support consistent outputs. I was brought in as both a linguistics subject-matter expert and workflow/UX designer to redesign the system from the ground up.

My Role

Linguistics SME, Workflow Architect, UX Designer

I contributed to the redesign by:

  • Creating the multi-phase annotation workflow

  • Defining and refining the question-type taxonomy

  • Advising the client’s research team on linguistically grounded workflow improvements

  • Designing clear UI interactions for annotators, reviewers, and QA

  • Developing the answer-quality rubric and type-specific guidelines

  • Establishing the review and approval layers

  • Aligning engineers, annotators, QA, and researchers to ensure a consistent, scalable system

The Problem

The project required a reliable way to annotate questions and evaluate AI-generated answers, but:

  • Question types were inconsistently defined

  • Annotators lacked clarity for ambiguous or edge-case questions

  • Answer acceptability was subjective without structured criteria

  • There was no system for revising poor answers

  • Reviewers had limited visibility into the reasoning behind annotations

  • Inconsistencies across annotators reduced the usefulness of the dataset

The workflow needed to be clarified, formalized, and made usable for large teams.

The Solution

I designed a three-stage, human-in-the-loop workflow supported by type-specific guidelines, structured UI patterns, and linguistically informed rubrics.

The finalized workflow:

Annotator 1 → Annotator 2 (Peer Review & Refinement) → QA Final Gate

This balanced quality, efficiency, and cognitive load while producing high-fidelity data.

Phase 1 — Annotator 1: Question Classification & Answer Evaluation

Phase 2 — Annotator 2: Peer Review & Refinement

Annotator 2 acted as a hybrid reviewer and editor.

They could:

  • Approve Annotator 1’s classification and answer

  • Refine or rewrite the answer further

  • Adjust the question type if misclassified

  • Send the task back with feedback

This peer-review layer improved consistency and reduced QA burden.

Phase 3 — QA Final Gate

QA reviewers saw the finalized output with the full annotation history.

They could:

  • Accept

  • Send back for correction

  • Reject

This ensured final-level quality before the data was approved for use.

Linguistics SME Contributions

I supported the client’s research team by:

Refining the question-type taxonomy

  • Identified overlapping or ambiguous categories

  • Proposed clearer, linguistically grounded distinctions

  • Suggested new categories where appropriate

Shaping answer-quality criteria

  • Designed expectations for each question type

  • Created templates for “what a good answer looks like”

  • Suggested formatting standards to increase clarity

Improving the overall workflow

  • Recommended cognitively efficient task sequencing

  • Proposed clearer surfacing of guidelines within the UI

  • Helped restructure the tasks based on linguistic reasoning rather than operational convenience

This strengthened the client’s broader research methodology and improved the interpretability of their data.

Results & Impact

  • Clear, scalable annotation workflow that standardized outputs across annotators

  • Higher inter-annotator agreement due to structured rubrics and type-specific guidelines

  • Higher-quality answer corrections using linguistically informed templates

  • Reduced QA workload due to the peer-review stage

  • Generalizable workflow adaptable to future question types and evaluation tasks

  • Improved research design through SME-level input on categories, formats, and workflows