How to Upload Custom Embedding(s) and Why You Should Care

Hello Community :wave: ,

Labelbox has recently introduced the capability to import and export custom vector embeddings without requiring significant changes to your existing data ingestion processes. This new feature allows for a more seamless integration of embeddings into your workflow

But what are embedding ?

In the realm of machine learning, an embedding, often referred to as a feature vector, is a numerical representation of an asset generated by a neural network. These embeddings possess the unique characteristic of reflecting content similarity, meaning that assets with alike content will have closely related embeddings.

You can go deeper in the subject in this article : Understanding Vector Similarity for Machine Learning | by Frederik vom Lehn | Advanced Deep Learning | Medium

At Labelbox we use Cosine similarity.

Cosine similarity is a measure of similarity between two non-zero vectors in an inner product space. It is based on the cosine of the angle between the vectors and is obtained by calculating the dot product of the vectors and dividing it by the product of their magnitudes (also known as the norms).

Mathematically, cosine similarity is represented as:

cos(θ) = (A · B) / (||A|| * ||B||)

Where:

  • A and B are two vectors.

  • θ is the angle between A and B.

  • A · B represents the dot product of vectors A and B.

  • ||A|| and ||B|| denote the magnitudes of vectors A and B, respectively.

Cosine similarity ranges from -1 to 1, where:

  • 1 indicates identical vectors (0° angle between them).

  • 0 indicates orthogonal vectors (90° angle between them).

  • -1 indicates diametrically opposite vectors (180° angle between them).

In the context of machine learning and data analysis, cosine similarity is commonly used to determine the similarity between documents, images, and other types of data represented as vectors in a high-dimensional space.

Why would I need embedding?

Embeddings play a crucial role in data curation for model training, particularly in machine learning and deep learning applications. Here are several reasons why embeddings are important:

  1. Representation of data: Embeddings provide a way to represent complex and high-dimensional data, such as text, images, audio, or graphs, as dense vectors in a lower-dimensional space. This makes it easier to work with the data and feed it into machine learning models.

  2. Similarity measurement: Embeddings enable the measurement of similarity between data points by calculating the distances or similarities between their vector representations. This is especially useful for tasks such as clustering, anomaly detection, and recommendation systems.

  3. Dimensionality reduction: High-dimensional data can lead to overfitting, increased computational complexity, and the curse of dimensionality. Embeddings can help reduce the dimensionality of the data, making it more manageable and less prone to overfitting.

  4. Transfer learning: Pre-trained embeddings, such as word embeddings (e.g., Word2Vec, GloVe) or image embeddings (e.g., ResNet, VGG), capture semantic and contextual information from large datasets. These embeddings can be fine-tuned for specific tasks, allowing models to leverage the knowledge learned from large-scale datasets.

  5. Data visualization: Visualizing high-dimensional data can be challenging. Embeddings can be used in combination with dimensionality reduction techniques to project data into a lower-dimensional space, making it easier to visualize and explore patterns or relationships.

In summary, embeddings facilitate the transformation of complex data into a more manageable and meaningful representation, which is essential for training accurate and efficient machine learning models.

By default Labelbox compute embedding while you ingest your data, for ref please see : Similarity search

How can I compute my own embeddings and send them to Labelbox?

We will be using a public model from Hugging face (:hugs: :wave: ) to showcase here :

!pip install -q "labelbox"
!pip install -q transformers

import labelbox as lb
import transformers
transformers.logging.set_verbosity(50)
import torch
import torch.nn.functional as F
from PIL import Image
import requests
from tqdm import tqdm
import numpy as np

# Add your API key
API_KEY = ""
client = lb.Client(API_KEY)

# Get images from a Labelbox dataset,
# Ensure the images are available by obtaining a token from your cloud provider if necessary
DATASET_ID = ""

dataset = client.get_dataset(DATASET_ID)

export_task = dataset.export_v2()

export_task.wait_till_done()
if export_task.errors:
	print(export_task.errors)
export_json = export_task.result

data_row_urls = [dr_url['data_row']['row_data'] for dr_url in export_json]

# Get ResNet-50 from HuggingFace
image_processor = transformers.AutoImageProcessor.from_pretrained("microsoft/resnet-50")
model = transformers.ResNetModel.from_pretrained("microsoft/resnet-50")

# Create a new embedding in your workspace, use the right dimensions to your use case, here we use 2048 for ResNet-50
new_custom_embedding_id = client.create_embedding(name="My new awesome embedding", dims=2048).id

# Or use an existing embedding from your workspace
# existing_embedding_id = client.get_embedding_by_name(name="ResNet img 2048").id

img_emb = []

for url in tqdm(data_row_urls):
    try:
        response = requests.get(url, stream=True)
        if response.status_code == 200:
            # Open the image, convert to RGB, and resize to 224x224
            image = Image.open(response.raw).convert('RGB').resize((224, 224))

            # Preprocess the image for model input
            img_hf = image_processor(image, return_tensors="pt")

            # Pass the image through the model to get embeddings
            with torch.no_grad():
                last_layer = model(**img_hf, output_hidden_states=True).last_hidden_state
                resnet_embeddings = F.adaptive_avg_pool2d(last_layer, (1, 1))
                resnet_embeddings = torch.flatten(resnet_embeddings, start_dim=1, end_dim=3)
                img_emb.append(resnet_embeddings.cpu().numpy())
        else:
            continue
    except Exception as e:
        print(f"Error processing URL: {url}. Exception: {e}")
        continue

data_rows = []
    
# Create data rows payload to send to a dataset
for url, embedding in tqdm(zip(data_row_urls, img_emb)):
    data_rows.append({
        "row_data": url,
        "embeddings": [{"embedding_id": new_custom_embedding_id, "vector": embedding[0].tolist()}]
    })

# Upload to a new dataset
dataset = client.create_dataset(name='image_custom_embedding_resnet', iam_integration=None)
task = dataset.create_data_rows(data_rows)
print(task.errors)

Once you have uploaded your data with the embedding, you can see those at the data row level:

And you can now use it to find similar asset within Catalog
2024-05-15 16.26.19

Ref : labelbox-python/examples/integrations/huggingface/huggingface.ipynb at develop · Labelbox/labelbox-python · GitHub

3 Likes