How To: Build a Scalable Offline-MMC Chat Evaluation Pipeline in Labelbox

Hands-on guide to building a scalable Offline Multi-Modal Chat (MMC) Evaluation Pipeline in Labelbox — a framework we built to support the surge in enterprise GenAI conversational evaluations.

:rocket: Overview

With the surge in Multi-Modal Chat (MMC) projects, the demand for robust Offline MMC evaluation is rapidly accelerating. It’s becoming a cornerstone of enterprise GenAI model assessments, enabling teams to reliably annotate and review conversations turn-by-turn — even without live model responses — as text and visual context become integral to dialogue.

However, due to the lack of clear documentation on how to do this using the SDK, this guide offers a practical, step-by-step framework to help you bridge that gap and implement effective offline-MMC evaluation.

For enterprise customers, we built and delivered a full offline MMC evaluation pipeline using the Labelbox UI and SDK. This post walks through how we did it, including:

  • Structuring your offline-MMC JSONs to import into the editor based on Labelbox’s official documentation
  • Embedding inline content and message flow (such as images, videos, PDFs, audios, etc.)
  • Attaching dynamic turn instructions for precisely structured & free-flowing conversations
  • Lessons learned from debugging multi-turn annotations
  • Tips for teams scaling up review workflows

:hammer_and_wrench: How We Structure MMC row_data

Labelbox requires a strict structure to import multi-modal chat (MMC) data correctly. Each conversation is represented using:

  • messages : Turn-by-turn conversation data
  • actorId : Identifies who is speaking (typically human or model )
  • childMessageIds : Defines the threading (who replies to whom)
  • content : An array of objects such as text , IMAGE_OVERLAY , fileData , or html
  • fileUri : Contains https path to a public cloud-hosted attachment file. This field is used for fileData type messages.
  • mimeType : The mimeType of your attachment fileUri data.

Example:

{
  "actorId": "",
  "childMessageIds": [],
  "content": [
     {
        "type": "text",
        "content": "What do you see in this image?"
     }, 
     {
        "type": "fileData",
        "fileUri": "https://link-to-my-image",
        "mimeType": "image/png"
     }
  ]
}

To support navigation, each message includes childMessageIds , and the entire chat is rooted in rootMessageIds .

We generated this data programmatically to guarantee consistency in:

  • Message threading
  • Turn order
  • Actor attribution

:locked: Important Note: Any links to external assets such as images, audio files, PDFs, or videos must be publicly accessible (via HTTPS), delegated through pre-signed URLs, or internally available via VPN or SSO-authenticated environments where users are already logged in. Otherwise, Labelbox will not be able to render them in the labeling interface.

:framed_picture: Example: Inline Visual Recommendations

Here’s an example MMC scenario used for evaluating a recommendation model. The model responds to user movie queries with suggestions, embedding movie thumbnails directly into the response using the content attribute in the SDK.

Turn 1: Greek

User Input: “Greek”
Model Response:

Turn 1: Opa! Enjoy these picks that celebrate Greek culture and mythology.
<img src='...trip_to_greece.jpg' /> <img src='...maestro_in_blue.jpg' /> <img src='...dead_europe.jpg' />

Turn 2: Big fat

User Input: “Big fat”
Model Response:

Turn 2: Are you looking for "Big Fat Liar," or just feeling big and fat? Either way, we've got options.
<img src='...big_fat_liar.jpg' /> <img src='...sausage_party.jpg' /> <img src='...big_daddy.jpg' />

Turn 3: Big fat Greek wedding

User Input: “Big fat Greek wedding”
Model Response:

Turn 3: We don't have "My Big Fat Greek Wedding" right now, but if you're looking for a charming rom-com, these might fit the bill.
<img src='...last_holiday.jpg' /> <img src='...falling_in_love.jpg' /> <img src='...the_holiday.jpg' />

This structure is ideal for surfacing and evaluating GenAI responses that include dynamic, visual elements.

:pushpin: Attaching Turn Instructions

Labelbox allows you to attach metadata to each Data Row using RAW_TEXT attachments. We used this feature to provide turn-level instructions — enabling annotators to see context-specific guidance.

Here’s a ready-to-use script to attach instructions using the SDK:

import json
from instructions import instruction_data  # Your JSON dictionary with instructions
from labelbox import Client

client = Client(api_key="___your_api_key_here___")

data_row_ids = ["___your_data_row_id_here___", "___your_data_row_id_here___"]
for data_row_id in data_row_ids:
    print("Adding instructions to data row: ", data_row_id, "...")
    data_row = client.get_data_row(data_row_id)
    attachment_value = json.dumps(instruction_data)

    # Get all attachments and find the one named "turn-instructions"
    attachments = data_row.attachments()
    attachment_list = list(attachments)
    turn_instructions_attachment = None

    # Find the "turn-instructions" attachment
    for attachment in attachment_list:
        if attachment.attachment_name == "turn-instructions":
            turn_instructions_attachment = attachment
            break

    # Update if found, create if not
    if turn_instructions_attachment:
        print("...updating existing 'turn-instructions' attachment")
        turn_instructions_attachment.update(
            value=attachment_value,
        )
    else:
        print("...no 'turn-instructions' attachment found. Creating a new one.")
        data_row.create_attachment(
            attachment_type="RAW_TEXT",
            attachment_value=attachment_value,
            attachment_name="turn-instructions",
        )

Note: If a particular datarow already contains turn instructions, the script would update the existing instructions and if the datarow does not have any instructions, new instructions would be created.

Sample Instruction Data

instruction_data = {
    "turn_1": """
    Instructions: with the old key
    - Recording Distance: Medium (5 feet); 
    - Microphone Orientation: Slightly angled away; 
    - Background Noise: Mild room hum; 
    - Speech: Hesitant pauses (e.g., \"uh\" and \"um\")
    - Prompt: Hey, um… can you tell me, um… how many planets are in our solar system?
    """,

    "turn_2": """
    Instructions: with the new key
    - Recording Distance: Far (8 feet); 
    - Microphone Orientation: Slightly covered by user's hand; 
    - Background Noise: Moderate hum from nearby appliance; 
    - Speech: Normal volume but slightly muffled
    - Prompt: Which planet is the largest, and how far **muffled** is it from the Sun? 
    """,
}

:white_check_mark: The script supports both turn_1 and Turn_1 as keys.

:prohibited: Common Pitfalls (and Fixes)

  • If the UI skips turns → Validate childMessageIds and their order.
  • Instruction metadata not visible? → Ensure attachment_type="RAW_TEXT".
  • Malformed visuals? → Sanitize HTML and confirm URLs are accessible.
  • Labeling errors on deep threads? → Revisit your message graph and validate nesting.

:chart_increasing: Impact and Outcome

The system supports:

  • 100+ multi-turn conversations with embedded context
  • Review and evaluation of LLM-generated assistant responses
  • Fully offline workflows using pre-generated JSON data

Once deployed, internal teams could generate and label new rows using standardized scripts — no manual engineering needed.

:speech_balloon: Want to Give It a Shot in Your Org?

If you’re working with GenAI outputs, multi-turn evaluations, or human-in-the-loop reviews for already generated model responses:

If you found this guide helpful, feel free to drop a like and share your thoughts or questions in the comments — we’d love to hear from you!

Let’s build more reliable and scalable evaluation pipelines together. :rocket:

3 Likes