How To: Build a Scalable Offline-MMC Chat Evaluation Pipeline in Labelbox

nsheth · June 7, 2025, 7:43pm

Hands-on guide to building a scalable Offline Multi-Modal Chat (MMC) Evaluation Pipeline in Labelbox — a framework we built to support the surge in enterprise GenAI conversational evaluations.

Overview

With the surge in Multi-Modal Chat (MMC) projects, the demand for robust Offline MMC evaluation is rapidly accelerating. It’s becoming a cornerstone of enterprise GenAI model assessments, enabling teams to reliably annotate and review conversations turn-by-turn — even without live model responses — as text and visual context become integral to dialogue.

However, due to the lack of clear documentation on how to do this using the SDK, this guide offers a practical, step-by-step framework to help you bridge that gap and implement effective offline-MMC evaluation.

For enterprise customers, we built and delivered a full offline MMC evaluation pipeline using the Labelbox UI and SDK. This post walks through how we did it, including:

Structuring your offline-MMC JSONs to import into the editor based on Labelbox’s official documentation
Embedding inline content and message flow (such as images, videos, PDFs, audios, etc.)
Attaching dynamic turn instructions for precisely structured & free-flowing conversations
Lessons learned from debugging multi-turn annotations
Tips for teams scaling up review workflows

How We Structure MMC `row_data`

Labelbox requires a strict structure to import multi-modal chat (MMC) data correctly. Each conversation is represented using:

messages : Turn-by-turn conversation data
actorId : Identifies who is speaking (typically human or model )
childMessageIds : Defines the threading (who replies to whom)
content : An array of objects such as text , IMAGE_OVERLAY , fileData , or html
fileUri : Contains https path to a public cloud-hosted attachment file. This field is used for fileData type messages.
mimeType : The mimeType of your attachment fileUri data.

Example:

{
  "actorId": "",
  "childMessageIds": [],
  "content": [
     {
        "type": "text",
        "content": "What do you see in this image?"
     }, 
     {
        "type": "fileData",
        "fileUri": "https://link-to-my-image",
        "mimeType": "image/png"
     }
  ]
}

To support navigation, each message includes childMessageIds , and the entire chat is rooted in rootMessageIds .

We generated this data programmatically to guarantee consistency in:

Message threading
Turn order
Actor attribution

Important Note: Any links to external assets such as images, audio files, PDFs, or videos must be publicly accessible (via HTTPS), delegated through pre-signed URLs, or internally available via VPN or SSO-authenticated environments where users are already logged in. Otherwise, Labelbox will not be able to render them in the labeling interface.

Example: Inline Visual Recommendations

Here’s an example MMC scenario used for evaluating a recommendation model. The model responds to user movie queries with suggestions, embedding movie thumbnails directly into the response using the content attribute in the SDK.

Turn 1: Greek

User Input: “Greek”
Model Response:

Turn 1: Opa! Enjoy these picks that celebrate Greek culture and mythology.
<img src='...trip_to_greece.jpg' /> <img src='...maestro_in_blue.jpg' /> <img src='...dead_europe.jpg' />

Turn 2: Big fat

User Input: “Big fat”
Model Response:

Turn 2: Are you looking for "Big Fat Liar," or just feeling big and fat? Either way, we've got options.
<img src='...big_fat_liar.jpg' /> <img src='...sausage_party.jpg' /> <img src='...big_daddy.jpg' />

Turn 3: Big fat Greek wedding

User Input: “Big fat Greek wedding”
Model Response:

Turn 3: We don't have "My Big Fat Greek Wedding" right now, but if you're looking for a charming rom-com, these might fit the bill.
<img src='...last_holiday.jpg' /> <img src='...falling_in_love.jpg' /> <img src='...the_holiday.jpg' />

This structure is ideal for surfacing and evaluating GenAI responses that include dynamic, visual elements.

Attaching Turn Instructions

Labelbox allows you to attach metadata to each Data Row using RAW_TEXT attachments. We used this feature to provide turn-level instructions — enabling annotators to see context-specific guidance.

Here’s a ready-to-use script to attach instructions using the SDK:

import json
from instructions import instruction_data  # Your JSON dictionary with instructions
from labelbox import Client

client = Client(api_key="___your_api_key_here___")

data_row_ids = ["___your_data_row_id_here___", "___your_data_row_id_here___"]
for data_row_id in data_row_ids:
    print("Adding instructions to data row: ", data_row_id, "...")
    data_row = client.get_data_row(data_row_id)
    attachment_value = json.dumps(instruction_data)

    # Get all attachments and find the one named "turn-instructions"
    attachments = data_row.attachments()
    attachment_list = list(attachments)
    turn_instructions_attachment = None

    # Find the "turn-instructions" attachment
    for attachment in attachment_list:
        if attachment.attachment_name == "turn-instructions":
            turn_instructions_attachment = attachment
            break

    # Update if found, create if not
    if turn_instructions_attachment:
        print("...updating existing 'turn-instructions' attachment")
        turn_instructions_attachment.update(
            value=attachment_value,
        )
    else:
        print("...no 'turn-instructions' attachment found. Creating a new one.")
        data_row.create_attachment(
            attachment_type="RAW_TEXT",
            attachment_value=attachment_value,
            attachment_name="turn-instructions",
        )

Note: If a particular datarow already contains turn instructions, the script would update the existing instructions and if the datarow does not have any instructions, new instructions would be created.

Sample Instruction Data

instruction_data = {
    "turn_1": """
    Instructions: with the old key
    - Recording Distance: Medium (5 feet); 
    - Microphone Orientation: Slightly angled away; 
    - Background Noise: Mild room hum; 
    - Speech: Hesitant pauses (e.g., \"uh\" and \"um\")
    - Prompt: Hey, um… can you tell me, um… how many planets are in our solar system?
    """,

    "turn_2": """
    Instructions: with the new key
    - Recording Distance: Far (8 feet); 
    - Microphone Orientation: Slightly covered by user's hand; 
    - Background Noise: Moderate hum from nearby appliance; 
    - Speech: Normal volume but slightly muffled
    - Prompt: Which planet is the largest, and how far **muffled** is it from the Sun? 
    """,
}

The script supports both turn_1 and Turn_1 as keys.

Common Pitfalls (and Fixes)

If the UI skips turns → Validate childMessageIds and their order.
Instruction metadata not visible? → Ensure attachment_type="RAW_TEXT".
Malformed visuals? → Sanitize HTML and confirm URLs are accessible.
Labeling errors on deep threads? → Revisit your message graph and validate nesting.

Impact and Outcome

The system supports:

100+ multi-turn conversations with embedded context
Review and evaluation of LLM-generated assistant responses
Fully offline workflows using pre-generated JSON data

Once deployed, internal teams could generate and label new rows using standardized scripts — no manual engineering needed.

Want to Give It a Shot in Your Org?

If you’re working with GenAI outputs, multi-turn evaluations, or human-in-the-loop reviews for already generated model responses:

The offline-MMC format in Labelbox is well-documented and extensible
The SDK supports rich automation for importing and updating data

If you found this guide helpful, feel free to drop a like and share your thoughts or questions in the comments — we’d love to hear from you!

Let’s build more reliable and scalable evaluation pipelines together.

Topic	Replies	Views
🆕 Labelbox editor - Live multimodal chat Labelbox Updates	195	June 4, 2024
Release Note July 2025 Labelbox Updates	15	July 7, 2025
Release Note May 2025 Labelbox Updates	72	May 7, 2025
Release Note August 2024 Labelbox Updates	69	August 6, 2024
How to: Convert LangChain Results to Labelbox Conversation Data Data Import	247	March 28, 2024