Hands-on guide to building a scalable Offline Multi-Modal Chat (MMC) Evaluation Pipeline in Labelbox — a framework we built to support the surge in enterprise GenAI conversational evaluations.
Overview
With the surge in Multi-Modal Chat (MMC) projects, the demand for robust Offline MMC evaluation is rapidly accelerating. It’s becoming a cornerstone of enterprise GenAI model assessments, enabling teams to reliably annotate and review conversations turn-by-turn — even without live model responses — as text and visual context become integral to dialogue.
However, due to the lack of clear documentation on how to do this using the SDK, this guide offers a practical, step-by-step framework to help you bridge that gap and implement effective offline-MMC evaluation.
For enterprise customers, we built and delivered a full offline MMC evaluation pipeline using the Labelbox UI and SDK. This post walks through how we did it, including:
- Structuring your offline-MMC JSONs to import into the editor based on Labelbox’s official documentation
- Embedding inline content and message flow (such as images, videos, PDFs, audios, etc.)
- Attaching dynamic turn instructions for precisely structured & free-flowing conversations
- Lessons learned from debugging multi-turn annotations
- Tips for teams scaling up review workflows
How We Structure MMC row_data
Labelbox requires a strict structure to import multi-modal chat (MMC) data correctly. Each conversation is represented using:
messages
: Turn-by-turn conversation dataactorId
: Identifies who is speaking (typicallyhuman
ormodel
)childMessageIds
: Defines the threading (who replies to whom)content
: An array of objects such astext
,IMAGE_OVERLAY
,fileData
, orhtml
fileUri
: Containshttps
path to a public cloud-hosted attachment file. This field is used forfileData
type messages.mimeType
: ThemimeType
of your attachmentfileUri
data.
Example:
{
"actorId": "",
"childMessageIds": [],
"content": [
{
"type": "text",
"content": "What do you see in this image?"
},
{
"type": "fileData",
"fileUri": "https://link-to-my-image",
"mimeType": "image/png"
}
]
}
To support navigation, each message includes childMessageIds
, and the entire chat is rooted in rootMessageIds
.
We generated this data programmatically to guarantee consistency in:
- Message threading
- Turn order
- Actor attribution
Important Note: Any links to external assets such as images, audio files, PDFs, or videos must be publicly accessible (via
HTTPS
), delegated through pre-signed URLs, or internally available via VPN or SSO-authenticated environments where users are already logged in. Otherwise, Labelbox will not be able to render them in the labeling interface.
Example: Inline Visual Recommendations
Here’s an example MMC scenario used for evaluating a recommendation model. The model responds to user movie queries with suggestions, embedding movie thumbnails directly into the response using the content
attribute in the SDK.
Turn 1: Greek
User Input: “Greek”
Model Response:
Turn 1: Opa! Enjoy these picks that celebrate Greek culture and mythology.
<img src='...trip_to_greece.jpg' /> <img src='...maestro_in_blue.jpg' /> <img src='...dead_europe.jpg' />
Turn 2: Big fat
User Input: “Big fat”
Model Response:
Turn 2: Are you looking for "Big Fat Liar," or just feeling big and fat? Either way, we've got options.
<img src='...big_fat_liar.jpg' /> <img src='...sausage_party.jpg' /> <img src='...big_daddy.jpg' />
Turn 3: Big fat Greek wedding
User Input: “Big fat Greek wedding”
Model Response:
Turn 3: We don't have "My Big Fat Greek Wedding" right now, but if you're looking for a charming rom-com, these might fit the bill.
<img src='...last_holiday.jpg' /> <img src='...falling_in_love.jpg' /> <img src='...the_holiday.jpg' />
This structure is ideal for surfacing and evaluating GenAI responses that include dynamic, visual elements.
Attaching Turn Instructions
Labelbox allows you to attach metadata to each Data Row using RAW_TEXT
attachments. We used this feature to provide turn-level instructions — enabling annotators to see context-specific guidance.
Here’s a ready-to-use script to attach instructions using the SDK:
import json
from instructions import instruction_data # Your JSON dictionary with instructions
from labelbox import Client
client = Client(api_key="___your_api_key_here___")
data_row_ids = ["___your_data_row_id_here___", "___your_data_row_id_here___"]
for data_row_id in data_row_ids:
print("Adding instructions to data row: ", data_row_id, "...")
data_row = client.get_data_row(data_row_id)
attachment_value = json.dumps(instruction_data)
# Get all attachments and find the one named "turn-instructions"
attachments = data_row.attachments()
attachment_list = list(attachments)
turn_instructions_attachment = None
# Find the "turn-instructions" attachment
for attachment in attachment_list:
if attachment.attachment_name == "turn-instructions":
turn_instructions_attachment = attachment
break
# Update if found, create if not
if turn_instructions_attachment:
print("...updating existing 'turn-instructions' attachment")
turn_instructions_attachment.update(
value=attachment_value,
)
else:
print("...no 'turn-instructions' attachment found. Creating a new one.")
data_row.create_attachment(
attachment_type="RAW_TEXT",
attachment_value=attachment_value,
attachment_name="turn-instructions",
)
Note: If a particular datarow already contains turn instructions, the script would update the existing instructions and if the datarow does not have any instructions, new instructions would be created.
Sample Instruction Data
instruction_data = {
"turn_1": """
Instructions: with the old key
- Recording Distance: Medium (5 feet);
- Microphone Orientation: Slightly angled away;
- Background Noise: Mild room hum;
- Speech: Hesitant pauses (e.g., \"uh\" and \"um\")
- Prompt: Hey, um… can you tell me, um… how many planets are in our solar system?
""",
"turn_2": """
Instructions: with the new key
- Recording Distance: Far (8 feet);
- Microphone Orientation: Slightly covered by user's hand;
- Background Noise: Moderate hum from nearby appliance;
- Speech: Normal volume but slightly muffled
- Prompt: Which planet is the largest, and how far **muffled** is it from the Sun?
""",
}
The script supports both
turn_1
and Turn_1
as keys.
Common Pitfalls (and Fixes)
- If the UI skips turns → Validate
childMessageIds
and their order. - Instruction metadata not visible? → Ensure
attachment_type="RAW_TEXT"
. - Malformed visuals? → Sanitize HTML and confirm URLs are accessible.
- Labeling errors on deep threads? → Revisit your message graph and validate nesting.
Impact and Outcome
The system supports:
- 100+ multi-turn conversations with embedded context
- Review and evaluation of LLM-generated assistant responses
- Fully offline workflows using pre-generated JSON data
Once deployed, internal teams could generate and label new rows using standardized scripts — no manual engineering needed.
Want to Give It a Shot in Your Org?
If you’re working with GenAI outputs, multi-turn evaluations, or human-in-the-loop reviews for already generated model responses:
- The offline-MMC format in Labelbox is well-documented and extensible
- The SDK supports rich automation for importing and updating data
If you found this guide helpful, feel free to drop a like and share your thoughts or questions in the comments — we’d love to hear from you!
Let’s build more reliable and scalable evaluation pipelines together.