• Skip to main content
  • Skip to primary sidebar
BMA

BeMyAficionado

Inspire Affection

Building Semantic Search for E-commerce Using Product Embeddings and OpenSearch

September 27, 2024 by varunshrivastava Leave a Comment

Hey! So, I recently worked on this cool proof of concept (POC) where I tried to combine product embeddings with OpenSearch to enable a k-nearest neighbours (k-NN) search. Basically, I wanted to see if I could make searching through a product catalog smarter by using embeddings (which are essentially vector representations of the product information) instead of just basic text search.

In the end, I was able to set up a system where you could search for products based on their semantic meaning rather than just the keywords, which is a huge step forward in building intuitive search experiences for e-commerce. Let me break down what I did and how it all worked.


Table of Contents

  • The Problem I Was Solving
  • Understanding Vector Dimensions
    • OpenSearch and KNN Integration
      • Setting up OpenSearch
    • Creating the Index
    • Product Data and Embeddings
    • Ingesting Product Data
    • Data Preparation
    • Running the Ingestion Pipeline
    • Results
      • Search Query: Show me floral printed dresses
      • Keyword Search Results
      • Vector Search Results
    • Final Thoughts

The Problem I Was Solving

We’ve all seen e-commerce websites where the search just doesn’t understand what we want.

For example, if you search for “blue running shoes” some sites will only return products with the exact words “blue” and “running” in the title or description. That’s pretty limiting, right?

What if a product is described as “sky-colored jogger sneakers“?

Traditional text-based search would miss that.

This is where embeddings come in. Embeddings turn text (like product descriptions) into vectors (numbers in a high-dimensional space), which allows you to compare them by how similar they are instead of whether the exact words match. That’s the basis of k-NN search, where you find the items closest to each other in vector space.

I decided to build a system that would take e-commerce products, generate embeddings from their descriptions, and then store those embeddings in OpenSearch. This would allow us to perform k-NN searches based on those embeddings. In simpler terms: instead of searching based on exact words, we’d be searching based on meaning!

Understanding Vector Dimensions

Vector dimensions play a crucial role in capturing the semantic relationships between words and concepts. The higher the number of dimensions, the more accurately the vector can represent the nuances and complexities of language.

Imagine a simple 2-dimensional vector space, where each dimension represents a specific feature or characteristic. In this simplified scenario, let’s consider the dimensions as “size” and “color”. We can represent words or concepts as points in this 2D space, with their coordinates determined by their respective size and color values.

For example, consider the words “apple” and “banana.” We can represent “apple” as a point with coordinates (2, 1), indicating a medium size and a red color. Similarly, “banana” could be represented as (5, 2), indicating a larger size and a yellow color.

Feature – Size
(X-coordinate)
Feature – Color
(Y-coordinate)
Apple21
Banana52

Now, let’s consider two distinct words, like “laptop” and “ocean.” We might represent “laptop” as (3, 4), indicating a medium size and a gray color, while “ocean” could be (1, 6), indicating a large size and a blue color.

Feature – Size
(X-coordinate)
Feature – Color
(Y-coordinate)
Laptop34
Ocean16

In this 2D space, we can calculate the distance between these points (words) using the Euclidean distance formula. Similar words, like “apple” and “banana,” will have a smaller distance between them, while distinct words, like “laptop” and “ocean,” will have a larger distance.

However, language is far more complex than just two dimensions. Words and concepts have multiple facets, including context, connotations, and nuances. Higher-dimensional vector spaces allow us to capture these intricacies more accurately.

For instance, a 1536-dimensional vector space, like the one used by Amazon Bedrock’s Titan model, can represent words and concepts with a much higher level of detail. Each dimension could correspond to a different aspect of meaning, such as synonyms, antonyms, parts of speech, sentiment, and more.

In this high-dimensional space, semantically similar words and concepts will be clustered together, while dissimilar ones will be farther apart. This enables more accurate and meaningful semantic searches, as words with similar meanings will have vectors that are closer together in this vast vector space.


OpenSearch and KNN Integration

Let’s start with the OpenSearch part, which is where the magic happens. OpenSearch is an open-source search engine, similar to Elasticsearch.

What’s cool is that it supports k-NN, meaning you can use it to search through vectors (embeddings) instead of just text.

This is the architecture that I followed for the POC:

Setting up OpenSearch

First, I had to connect to OpenSearch using AWS credentials. Here’s how I did that:

from requests_aws4auth import AWS4Auth
from opensearchpy import OpenSearch
from opensearchpy.connection import RequestsHttpConnection
import boto3

credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(region='us-east-1', service='aoss', refreshable_credentials=credentials)

host = 'https://your-opensearch-domain'
client = OpenSearch(
    hosts=[host],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection
)

Here, I authenticated using AWS’s boto3 and connected it to my OpenSearch instance. Once I had this connection, I could start creating indexes and working with the data.

Creating the Index

I needed to create an index that would store my product data, including the embeddings. An index in OpenSearch is like a database table where you can define the structure of the documents you want to store. Here’s how I created the index:

def create_index(index_name, field_mappings):
    index_body = {
        "settings": {
            "number_of_shards": 1,
            "number_of_replicas": 1,
            "index.knn": True  # Enable KNN vector search
        },
        "mappings": field_mappings
    }

    response = client.indices.create(index=index_name, body=index_body)
    print(f"Index Created: {response}")

I used a mapping for the product data that included fields like name, price, category, and embedding. The important one is embedding, which is where I would store the product’s vector representation. I set the index.knn setting to True to enable vector search.

Product Data and Embeddings

Now for the fun part – getting embeddings.

I used Amazon Bedrock’s Titan model to generate the embeddings. Titan is great for creating text embeddings with higher dimensions (1536, in this case), which is perfect for our k-NN search.

Here’s how I generated embeddings:

from langchain_aws import BedrockEmbeddings

def generate_text_embeddings(text):
    text_embeddings = BedrockEmbeddings(
        region_name="us-east-1",
        model_id="amazon.titan-embed-text-v1"
    )
    embedding_vector = text_embeddings.embed_query(text)
    return embedding_vector

Basically, I took each product’s description and passed it through the Titan model, which converted the text into a 1536-dimensional vector. This vector represents the product in a way that captures its meaning, not just its words.

Ingesting Product Data

Once I had my embeddings, I needed to get the product data (including the embeddings) into OpenSearch. But before doing that, I wanted to check if the products were already in my DynamoDB table to avoid duplicate entries.

I wrote a function that checks if the product exists in DynamoDB:

def pk_exists(pk):
    response = table.get_item(Key={'pk': pk})
    if 'Item' in response:
        return response['Item']['status'] == 'created'
    return False

# for bulk fetch
def bulk_pk_exists(pks):
    try:
        # Prepare the list of keys for the BatchGetItem request
        keys_to_check = [{'pk': pk} for pk in pks]

        # Perform the BatchGetItem operation
        response = table.meta.client.batch_get_item(
            RequestItems={
                table.name: {
                    'Keys': keys_to_check
                }
            }
        )

        # Process the response to check if keys exist and their statuses
        existing_items = response['Responses'].get(table.name, [])

        # Create a dictionary to hold the status of each key
        results = {pk: False for pk in pks}  # Default to False (does not exist)

        # For each returned item, check its status
        for item in existing_items:
            pk_value = item['pk']
            if item.get('status') == 'created':
                results[pk_value] = True
        
        print(len(results))
        return results

    except ClientError as e:
        #print(f"Unable to check Partition Keys: {e.response['Error']['Message']}")
        raise Exception()

I also made a batch method to check multiple products at once and another function to write new items into DynamoDB.

With this setup, I could safely write new products to both DynamoDB (for tracking purposes) and OpenSearch (for search purposes). Here’s how I ingested the data into OpenSearch:

def bulk_index_documents(index_name, documents, success_callback):
    bulk_data = []
    for doc in documents:
        bulk_data.append({"index": {"_index": index_name}})
        bulk_data.append(doc)

    response = client.bulk(body=bulk_data)
    if not response['errors']:
        success_callback(documents, response)

This bulk ingestion allowed me to index multiple products at once, which made the whole process much faster.


Data Preparation

To prepare the product data, I wrote some helper functions to clean up the descriptions, image URLs, and other fields:

  • Flattened descriptions: I took complex product descriptions and parsed them into simpler, flattened text.
  • SKU sanitation: Made sure the SKU (a unique product identifier) was in a consistent format.

This was all handled in a generate_product_document function that yielded cleaned-up product data. For each product, I created a document field, which was a nicely formatted string containing all the relevant product details. I also added the embedding field to this product document.

Running the Ingestion Pipeline

With everything set up, I ran the ingestion script to generate embeddings for each product and index them in OpenSearch.

Here’s what that process looked like:

  1. Generate embeddings for each product description.
  2. Check if the product exists in DynamoDB (to avoid duplicates).
  3. Bulk index the products into OpenSearch.
  4. Store product metadata (including a status flag) in DynamoDB.

This looped through all the products in the dataset, and whenever a batch of 100 products was processed, it was written to both OpenSearch and DynamoDB.

Results

I’m sure you all are interested in the results and I’m not going to leave you without it. Although, I cannot share everything here, I can share some of the top results and link you to the entire image repository.

Search Query: Show me floral printed dresses

Keyword Search Results

Floral Printed Dress using Keyword Search
Floral Printed Dress using Keyword Search
Floral Printed Dress using Keyword Search
Floral Printed Dress using Keyword Search
Floral Printed Dress using Keyword Search
Floral Printed Dress using Keyword Search
Floral Printed Dress using Keyword Search
Floral Printed Dress using Keyword Search
Floral Printed Dress using Keyword Search

Vector Search Results

If you look at the above results, you can clearly see that vector based search results are pretty accurate. They are not just focused on the keyword such as “floral” but also they understand which dresses will contain “floral printed” pattern on the dress. This clearly depicts that vector based searches understand the semantic behind the user’s query. And it’s amazing to see when it comes to live.

Final Thoughts

So that’s pretty much it! By the end of this process, I had a working k-NN search for product embeddings. The cool thing about this setup is that it allows for much more intuitive search experiences. For example, a user could search for a product using natural language, and the system would return results based on the meaning of the query, not just the keywords.

This was a fun and rewarding POC, and I’m excited to see how I can build on it. In the future, I could add more product attributes or fine-tune the embedding generation process for even better search results. But for now, this was a great step forward!


Let me know if you have any thoughts or questions! 😄

Related

Filed Under: Programming, Technology Tagged With: aws, bedrock, LLM, programming, python, sagemaker, semantic search, vector

Primary Sidebar

Subscribe to Blog via Email

Do you enjoy the content? Feel free to leave your email with me to receive new content straight to your inbox. I'm an engineer, you can trust me :)

Join 874 other subscribers

Latest Podcasts

Recent Posts

  • Is The Cosmos a Vast Computation?
  • Building Semantic Search for E-commerce Using Product Embeddings and OpenSearch
  • Leader Election with ZooKeeper: Simplifying Distributed Systems Management
  • AWS Serverless Event Driven Data Ingestion from Multiple and Diverse Sources
  • A Step-by-Step Guide to Deploy a Static Website with CloudFront and S3 Using CDK Behind A Custom Domain

Recent Comments

  • Varun Shrivastava on Deploy Lambda Function and API Gateway With Terraform
  • Vaibhav Shrivastava on Deploy Lambda Function and API Gateway With Terraform
  • Varun Shrivastava on Should Girls Wear Short Clothes?
  • D on Should Girls Wear Short Clothes?
  • disqus_X5PikVsRAg on Basic Calculator Leetcode Problem Using Object-Oriented Programming In Java

Categories

  • Blogging
  • Cooking
  • Fashion
  • Finance & Money
  • Programming
  • Reviews
  • Software Quality Assurance
  • Technology
  • Travelling
  • Tutorials
  • Web Hosting
  • Wordpress N SEO

Archives

  • November 2024
  • September 2024
  • July 2024
  • April 2024
  • February 2024
  • November 2023
  • June 2023
  • May 2023
  • April 2023
  • August 2022
  • May 2022
  • April 2022
  • February 2022
  • January 2022
  • November 2021
  • September 2021
  • August 2021
  • June 2021
  • May 2021
  • April 2021
  • February 2021
  • January 2021
  • December 2020
  • November 2020
  • October 2020
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • February 2020
  • December 2019
  • November 2019
  • October 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • January 2019
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016

Tags

Affordable Hosting (4) algorithms (4) amazon (3) aoc-2020 (7) believe in yourself (4) best (4) database (4) earn money blogging (5) education (4) elementary sorting algorithms (4) experience (3) fashion (4) finance (6) Financial Freedom (7) food (7) friends (3) goals (5) google (5) india (10) indian cuisine (5) indian education system (4) java (16) life (16) life changing (4) love (4) make money (3) microservices (9) motivation (4) oops (4) podcast (6) poor education system (4) principles of microservices (5) problem-solving (7) programmer (5) programming (28) python (5) reality (3) seo (6) spring (3) success (10) success factor (4) technology (4) top 5 (7) typescript (3) wordpress (7)

Copyright © 2025 · Be My Aficionado · WordPress · Log in

Go to mobile version