In today’s blog, we are going to discuss Embedding, and How to create embeddings for various open-source large language models.

What are Embeddings?
Embeddings are vector representations of words, phrases, or whole sentences that capture meaning and context. It is a way to translate categorical or language data into a number to make it easier for computers to process and analyze language-based information.

For examples of creating embeddings for large open-source large language models:
Creating embeddings for LLM can seem complicated, but they can be effectively simplified. These are the steps to create embeddings for LLM:

Step 1: Select the pre-trained language model
The first step is to choose a pre-trained large language model that suits your needs. Some popular openings include GPT-3, BERT, and RoBERTa. These models are already trained on large amounts of text data and they have the capability to understand the languages. For example:

# Import llm libraries
from transformers import AutoTokenizer, AutoModel

# Select pre-trained model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Import llm libraries

from transformers import AutoTokenizer, AutoModel

# Select pre-trained model

model_name = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModel.from_pretrained(model_name)

In this example, we use Huggingface LLM model “bert-base-uncased” by importing transformer libraries.

Step 2: Tokenization
Once you have selected a language model, the next step is tokenization. Tokenization divides information into smaller units, such as words or subwords, which act as input to the model. For Indian languages, specialized tokenizers that understand textual and linguistic nuances can be used. For example:

sentence = "Embeddings are powerful in natural language processing."

tokens = tokenizer(sentence, return_tensors="pt")
print("Tokenized Input:", tokens)

sentence = "Embeddings are powerful in natural language processing."

tokens = tokenizer(sentence, return_tensors="pt")

print("Tokenized Input:", tokens)

In this example, we convert this sentence into tokens.

Step 3: Create Embeddings
Using tokenized input, you can input that language into the model, and it will appear embeddings. These identifiers represent the learned words or lexical items in statistical form. Example:

outputs = model(**tokens)
embeddings = outputs.last_hidden_state
print("Embeddings:", embeddings)

outputs = model(**tokens)

embeddings = outputs.last_hidden_state

print("Embeddings:", embeddings)

In this example, we give tokens to the model and get embeddings as output.

Step 4: Fine-Tuning (Optional)
If you have a specific domain or task in mind, you can optimize a pre-trained model on your data set. This function modifies the parameters of the model to match your target application and domain-specific language. Example:

Step 5: Apply Embeddings
Once embeddings are created, we can use them for NLP tasks. These embeddings can help with sentiment analysis, named entity recognition, machine translation, and much more. Example:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Sample sentences
sentence1 = "Embeddings are powerful in natural language processing."
sentence2 = "Word embeddings capture semantic meaning in language."

# Tokenize the sentences
tokens1 = tokenizer(sentence1, return_tensors="pt")
tokens2 = tokenizer(sentence2, return_tensors="pt")

# Get embeddings for each sentence
outputs1 = model(**tokens1)
embeddings1 = outputs1.last_hidden_state

outputs2 = model(**tokens2)
embeddings2 = outputs2.last_hidden_state

# Calculate cosine similarity between the embeddings
similarity_score = cosine_similarity(embeddings1, embeddings2)
print("Cosine Similarity Score:", similarity_score[0][0])

import numpy as np

from sklearn.metrics.pairwise import cosine_similarity

# Sample sentences

sentence1 = "Embeddings are powerful in natural language processing."

sentence2 = "Word embeddings capture semantic meaning in language."

# Tokenize the sentences

tokens1 = tokenizer(sentence1, return_tensors="pt")

tokens2 = tokenizer(sentence2, return_tensors="pt")

# Get embeddings for each sentence

outputs1 = model(**tokens1)

embeddings1 = outputs1.last_hidden_state

outputs2 = model(**tokens2)

embeddings2 = outputs2.last_hidden_state

# Calculate cosine similarity between the embeddings

similarity_score = cosine_similarity(embeddings1, embeddings2)

print("Cosine Similarity Score:", similarity_score[0][0])

This is an example of a simple NLP task, We apply all steps which are mentioned above and calculate the similarity between two sentences by using Cosine similarity.

Benefits of embeddings
Participating in natural language processing services offers many benefits:

Semantic understanding: Embedding captures the meaning of words and sentences, and enables language models to understand the context and relationships between different words

Compact representation: Embeddings compress large amounts of information into compact vector representations, making them computationally efficient and easy to handle

Transfer learning: Pre-trained embeddings can be used as starting points for language-related tasks, reducing the need to train large models from scratch.

Embeddings and LLMs in E-Commerce :

In today’s world of e-commerce, businesses have relied on language processing to know consumer preferences and improve or enhance searchability.

By using LLMs, it enables e-commerce platforms to provide personalized product recommendations and increase customer satisfaction.

For example, in Bagisto (Laravel Ecommerce) store, customers visit bagisto and search for some products, bagisto stores the customer search history and gives that data to the LLM model by following the above steps.

Based on the customer history, the model calculates the similarity between history and products with the cosine similarity method. Bagisto will show highly similar products to the customer.

By using LLMs with ecommerce we can build many apps for e-commerce like customized Chatbot, Recommendation Systems, AI-Translation, AI-Product-Description, etc.

Conclusion:
Embeddings play an important role in modern natural language processing applications.

Embeddings representing words and phrases in numeric vectors enable language models to better understand language and perform a variety of language-related tasks.

If you follow the steps described above you can create embeddings for open-source large language models.

Deep Learning Embeddings large language model Machine Learning NLP pytorch Tensorflow

Darshan

10 Aug, 2023
Created by - Darshan

. . .

How to create Embeddings for Open Source-LLM’s

Leave a Comment Cancel Reply

How to create Embeddings for Open Source-LLM’s

Leave a Comment Cancel Reply

Message Sent!