Embedding Text
Learn how to use the Smally API to create high-quality text embeddings.
Basic Usage
The /v1/embeddings endpoint converts text into a 384-dimensional embedding vector:
curl -X POST "http://localhost:8000/v1/embeddings" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"text": "Machine learning is transforming technology",
"normalize": false
}'
Request Parameters
text (required)
The text to embed. Can be a word, sentence, or paragraph.
- Type:
string - Max length: 2000 characters
- Max tokens: 128
{
"text": "Your text here"
}
normalize (optional)
Whether to L2 normalize the embedding vector.
- Type:
boolean - Default:
false
{
"text": "Your text here",
"normalize": true
}
When to use normalization:
- Cosine similarity calculations (vectors are already normalized)
- Consistent magnitude across all embeddings
- Some distance metrics work better with normalized vectors
Response Format
{
"embedding": [0.0234, -0.1567, 0.0892, ...],
"tokens": 8,
"cached": false,
"model": "all-MiniLM-L6-v2"
}
Fields
embedding: 384-dimensional float arraytokens: Number of tokens in the input textcached: Whether the result was served from cachemodel: Model identifier used for embeddings
Use Cases
Semantic Search
Find similar documents using cosine similarity:
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Embed query and documents
query_emb = embed("machine learning algorithms")
doc1_emb = embed("Neural networks and deep learning")
doc2_emb = embed("The weather forecast for tomorrow")
# Calculate similarity
sim1 = cosine_similarity(query_emb, doc1_emb) # High similarity
sim2 = cosine_similarity(query_emb, doc2_emb) # Low similarity
Clustering
Group similar texts together:
from sklearn.cluster import KMeans
# Embed multiple texts
texts = [
"Machine learning basics",
"Deep neural networks",
"Cooking pasta recipes",
"Italian cuisine guide"
]
embeddings = [embed(text) for text in texts]
# Cluster
kmeans = KMeans(n_clusters=2)
labels = kmeans.fit_predict(embeddings)
# [0, 0, 1, 1] - Groups ML and cooking topics
Duplicate Detection
Find near-duplicate content:
threshold = 0.95 # High similarity threshold
def is_duplicate(text1, text2):
emb1 = embed(text1)
emb2 = embed(text2)
similarity = cosine_similarity(emb1, emb2)
return similarity > threshold
Question Answering
Match questions to answers:
question = "How do I reset my password?"
answers = [
"Visit the password reset page and enter your email",
"Contact support for billing questions",
"Check our API documentation for integration help"
]
# Find best matching answer
question_emb = embed(question)
answer_embs = [embed(ans) for ans in answers]
similarities = [cosine_similarity(question_emb, ans_emb)
for ans_emb in answer_embs]
best_answer = answers[np.argmax(similarities)]
Best Practices
Input Text Quality
✅ Good inputs:
- Complete sentences or phrases
- Clean, well-formatted text
- Consistent language and style
❌ Poor inputs:
- Single words (except for specific use cases)
- Extremely long paragraphs (truncated at 128 tokens)
- Mixed languages in same text
Batch Processing
Process multiple texts efficiently:
import asyncio
import aiohttp
async def embed_batch(texts):
async with aiohttp.ClientSession() as session:
tasks = [
embed_async(session, text)
for text in texts
]
return await asyncio.gather(*tasks)
async def embed_async(session, text):
async with session.post(
'http://localhost:8000/v1/embeddings',
headers={'Authorization': f'Bearer {API_KEY}'},
json={'text': text, 'normalize': False}
) as response:
return await response.json()
# Embed 100 texts concurrently
texts = [...] # Your texts
embeddings = asyncio.run(embed_batch(texts))
Caching
Leverage automatic caching for frequently used texts:
# First request: Fresh computation
result1 = embed("common query") # cached: false
# Second request: Served from cache
result2 = embed("common query") # cached: true, ~1ms
See Caching Guide for details.
Error Handling
Text Too Long
{
"error": "text_too_long",
"message": "Text exceeds maximum token limit of 128"
}
Solution: Split long texts into chunks or summarize.
Empty Text
{
"error": "invalid_request",
"message": "Text cannot be empty"
}
Solution: Validate input before sending.
Rate Limit
{
"error": "rate_limit_exceeded",
"message": "Monthly quota exhausted"
}
Solution: Check X-RateLimit-Reset header and wait or upgrade tier.
Performance Tips
- Use caching: Identical inputs are cached automatically
- Batch requests: Use async/concurrent requests for multiple texts
- Normalize wisely: Only normalize when needed (e.g., cosine similarity)
- Monitor rate limits: Check response headers to avoid quota exhaustion
Next Steps
- Caching - Understand how caching works
- Rate Limits - Monitor and optimize usage
- API Reference - Full API documentation