Image Classification in [Database] Using SuperDuperDB

May 14, 2024 · 6 min read

Data Scientist at SuperDuperDB

Learn how to seamlessly integrate machine learning models with your [Database] using SuperDuperDB to create efficient and scalable image classification systems.

This blog post explores how to integrate machine learning models directly with Databases using SuperDuperDB, streamlining the process and reducing operational overhead.

For this demo, we are using MongoDB as an example. SuperDuperDB also supports other databases, including vector-supported SQL databases like Postgres and non-vector-supported databases like MySQL and DuckDB. Please check the documentation for more information about its functionality and the range of data integrations it supports.

SDdbMDBSkl

How to implement an image classification system in [MongoDB]?

ML models and data often reside in separate silos, leading to complex MLOps pipelines. Current solutions require extracting and processing data through multiple tools, creating high operational overhead. Low-code tools and vector databases simplify integration but introduce flexibility issues and additional management complexities.

Unifying ML Models and [Databases] with SuperDuperDB

Integrating data and machine learning in a single environment is essential for smooth ML deployment. This approach avoids the complexities of MLOps, ETL processes, and managing separate vector databases.

SuperDuperDB achieves this by directly connecting ML models with databases. It's an open-source framework that simplifies the integration of ML models, APIs, and vector search engines with your existing database infrastructure.

arch

SuperDuperDB allows you to:

Directly integrate ML models with your database for seamless data pre-processing, training and storage.
Easily manage various ML models for different data types (text, images).
Instantly process new queries and perform real-time predictions within your database.

A key benefit of SuperDuperDB is its flexibility, allowing custom function integration to meet diverse machine learning needs.

For this example, we will be using Logistic Regression model and MongoDB

Let’s integrate the Logistic Regression model into your MongoDB using SuperDuperDB.

Step 1: Install SuperDuperDB, Scikit-learn, Torch and Torchvision

pip install superduperdb 
pip install scikit-learn torchvision torch

Step 2: Connect SuperDuperDB to your MongoDB database and define the collection

For this demo, we will be using a local MongoDB instance. However, this can be switched to MongoDB with authentication or even a MongoDB Atlas URI.

from superduperdb import superduper

db = superduper('mongodb://localhost:27017/test')

my_collection = Collection("documents")

info

The MongoDB URI can be either a locally hosted MongoDB instance or a MongoDB Atlas URI. For more information here

Step 3: Data Collection and Pre-processing

Assuming that you have a pandas dataframe with two columns where the column image_array consists of image arrays and the column category consists of the target labels (0 or 1).

import pandas as pd

df = pd.DataFrame({
    'image_array': image_arrays,
    'category': category
})

Split the data into train and test datasets, where train_data and test_data are lists of dictionaries with image data and labels.

N_DATAPOINTS = 1000

train_data = []
for index, row in df.iloc[:N_DATAPOINTS].iterrows():
    datatype = array(dtype=row['image_array'].dtype, shape=row['image_array'].shape)
    superddb.apply(datatype)
    train_data.append(Document({
        '_fold': 'train',
        'image': datatype(row['image_array']),
        'label': int(row['category'])
    }))

test_data = []
for index, row in df.iloc[N_DATAPOINTS:N_DATAPOINTS + 50].iterrows():
    datatype = array(dtype=row['image_array'].dtype, shape=row['image_array'].shape)
    superddb.apply(datatype)
    test_data.append(Document({
        '_fold': 'test',
        'image': datatype(row['image_array']),
        'label': int(row['category'])
    }))

Creating Custom Datatypes in SuperDuperDB

SuperDuperDB allows you to create custom datatypes to handle various data types that your database backend might not natively support. This flexibility enables you to insert any type of data into your database seamlessly.

example

For example, you can create custom datatypes for vectors, tensors, arrays, PDFs, images, audio files, videos, and more. Please check Create datatype for more examples and details on how to construct custom datatypes.

Step 4: Data Insertion

db.execute(my_collection.insert_many(train_data))

Step 5: Feature Engineering

Integration of torchvision Model with Database to compute Features to the train data inside the DB using superduperdb

Create a Model class with simple a Embedding model

import torchvision.models as models
from torchvision import transforms
from PIL import Image

class TorchVisionEmbedding:
    def __init__(self):
        self.resnet = models.resnet18(pretrained=True)
        
        self.resnet.eval()
        
    def preprocess(self, image_array):
        image = Image.fromarray(image_array.astype(np.uint8))
        preprocess = preprocess = transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        ])
        tensor_image = preprocess(image)
        return tensor_image

Model object

from superduperdb.ext.torch import TorchModel

model = TorchVisionEmbedding()
superdupermodel = TorchModel(identifier='my-torchvision-model', object=model.resnet, preprocess=model.preprocess, postprocess=lambda x: x.numpy().tolist())

Integration and computation of features to the train data inside the db. With the help of Listener from superduperdb we can

apply a model to compute outputs on a query
outputs are refreshed every-time new data are added
outputs are saved to the db.databackend

More information about Listener can be found here.

from superduperdb import Listener

jobs, listener = db.apply(
    Listener(
        model=superdupermodel,
        select=my_collection.find(),
        key='image',
        identifier="features"
    )
) 

Step 6: Training

We use the above computed features as listener.output to train the Classification model

from sklearn.linear_model import LogisticRegression
from superduperdb.ext.sklearn.model import SklearnTrainer, Estimator

model = LogisticRegression()
model = Estimator(
    object=model,
    identifier='my-image-model',
    trainer=SklearnTrainer(
        identifier='my-image-trainer',
        key=(listener.outputs, 'label'),
        select=my_collection.find(),
    )
)        

Add the model to the DB and train the model with single command superduperdb.apply(model).

db.apply(model)

SuperDuperrr, you've succesfully trained a ML model!!!

Now that we have trained a Logistic regression model by integrating it into Database, we will use the trained model to classify the test data.

Insert test data into db

my_test_collection = Collection('my_test_data')

db.execute(my_test_collection.insert_many(test_data))

Compute features for the test data

jobs, listener_test_images = db.apply(
    Listener(
        model=superdupermodel,
        select=my_test_collection.find(),
        key='image',
        identifier="test_features"
    )
)

Classification of test data

You can easily classify the test data using the .load() of superduperdb as superduperdb.load('model', 'my-model').predict(test_data))

# Get a sample test data
for doc in superddb.execute(test_image_collection.find().limit(1)):
    sample_doc = doc

sample_feature_image = np.array(sample_doc['_outputs']['test_features::0'])
print("predicted", superddb.load('model', 'my-image-model').predict(sample_feature_image.reshape(1, -1)))
print("actual", [sample_doc['label']])

Conclusion

In this blog post, we demonstrated how to implement an image classification system in your [Database] for example using MongoDB, and SuperDuperDB. By leveraging the strengths of SuperDuperDB, we can build a scalable and efficient pipeline for image classification, integrated directly with your database of choice. This approach can be extended to various other machine learning tasks, making it a versatile solution for modern data-driven applications.

Whether you are using MongoDB, Postgres, MySQL, DuckDB, or another database, SuperDuperDB provides the tools necessary for seamless integration and efficient ML model deployment.

To explore more, check out our other use cases in the documentation

Useful Links

Contributors are welcome!

SuperDuperDB is open-source and permissively licensed under the Apache 2.0 license. We would like to encourage developers interested in open-source development to contribute in our discussion forums, issue boards and by making their own pull requests. We'll see you on GitHub!

Unifying ML Models and [Databases] with SuperDuperDB​

Let’s integrate the Logistic Regression model into your MongoDB using SuperDuperDB.​

SuperDuperrr, you've succesfully trained a ML model!!!​

Conclusion​

Useful Links​

Contributors are welcome!​