Rate Limits
Smally enforces rate limits based on your organization's subscription tier to ensure fair usage and system stability.
Rate Limit Tiers
| Tier | Monthly Requests | Cost |
|---|---|---|
| Free | 10,000 | $0 |
| Pro | 100,000 | $29/mo |
| Scale | 1,000,000 | $199/mo |
| Enterprise | Unlimited | Custom |
How Rate Limiting Works
Rate limits are enforced per organization and reset monthly:
- Each API request decrements your remaining quota
- Quota resets on the 1st of each month at 00:00 UTC
- All API keys in an organization share the same quota
Organization "Acme Corp" (Pro tier):
├─ API Key 1: "Production" ╲
├─ API Key 2: "Staging" ├─ Share 100,000/month
└─ API Key 3: "Development" ╱
Checking Your Rate Limit
Response Headers
Every API response includes rate limit headers:
HTTP/1.1 200 OK
X-RateLimit-Limit: 100000
X-RateLimit-Remaining: 95432
X-RateLimit-Reset: 2025-02-01T00:00:00Z
X-RateLimit-Limit: Total monthly quotaX-RateLimit-Remaining: Requests left this monthX-RateLimit-Reset: When quota resets (ISO 8601)
Parsing Headers
import requests
from datetime import datetime
response = requests.post(...)
limit = int(response.headers['X-RateLimit-Limit'])
remaining = int(response.headers['X-RateLimit-Remaining'])
reset = datetime.fromisoformat(
response.headers['X-RateLimit-Reset'].replace('Z', '+00:00')
)
print(f"Used: {limit - remaining}/{limit}")
print(f"Resets: {reset}")
Organizations
View usage in real-time:
http://localhost:8000/organizations
- Current usage
- Historical trends
- Per-key breakdown
- Usage alerts
Rate Limit Exceeded
When you exceed your quota, requests return 429 Too Many Requests:
{
"error": "rate_limit_exceeded",
"message": "Monthly quota exhausted. Resets on 2025-02-01T00:00:00Z"
}
Error Response
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 10000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 2025-02-01T00:00:00Z
Retry-After: 86400
{
"error": "rate_limit_exceeded",
"message": "Monthly quota exhausted"
}
Retry-After: Seconds until quota reset
Handling Rate Limits
import time
def embed_with_retry(text, max_retries=3):
for attempt in range(max_retries):
response = requests.post(...)
if response.status_code == 200:
return response.json()
if response.status_code == 429:
# Rate limited
retry_after = int(response.headers.get('Retry-After', 60))
if attempt < max_retries - 1:
print(f"Rate limited. Waiting {retry_after}s...")
time.sleep(retry_after)
else:
raise Exception("Rate limit exceeded")
else:
response.raise_for_status()
raise Exception("Max retries exceeded")
Optimizing Rate Limit Usage
1. Leverage Caching
Identical requests are cached and don't count toward rate limits:
# First call: Uses 1 quota
embed("common query") # cached: false
# Second call: FREE! (cached)
embed("common query") # cached: true
Impact: Can reduce quota usage by 50-80% for typical workloads.
2. Deduplicate Requests
# Bad: 1000 requests (many duplicates)
texts = ["hello"] * 500 + ["world"] * 500
for text in texts:
embed(text) # Uses 1000 quota
# Good: 2 requests + caching
unique_texts = set(texts)
embeddings = {text: embed(text) for text in unique_texts}
# Uses 2 quota, rest are cached
3. Batch Processing
Process in batches to monitor usage:
def embed_batch(texts, batch_size=100):
results = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
# Check rate limit before batch
response = requests.post(...)
remaining = int(response.headers['X-RateLimit-Remaining'])
if remaining < batch_size:
print(f"Warning: Only {remaining} requests left")
# Maybe wait or upgrade tier
results.extend([embed(text) for text in batch])
return results
4. Monitor Usage
Set up alerts when approaching limits:
def check_rate_limit_warning(response, threshold=0.9):
limit = int(response.headers['X-RateLimit-Limit'])
remaining = int(response.headers['X-RateLimit-Remaining'])
used_pct = (limit - remaining) / limit
if used_pct >= threshold:
print(f"⚠️ WARNING: {used_pct*100:.1f}% of quota used!")
# Send alert, upgrade tier, etc.
Rate Limit Best Practices
Development vs Production
Use different API keys and organizations:
# Development (Free tier - 10k/month)
DEV_API_KEY = "sk_dev_..."
# Production (Pro tier - 100k/month)
PROD_API_KEY = "sk_prod_..."
This prevents dev/testing from consuming production quota.
Estimate Usage
Before deploying, estimate monthly usage:
# Example calculation
requests_per_user_per_day = 50
active_users = 100
days_per_month = 30
monthly_requests = (
requests_per_user_per_day *
active_users *
days_per_month
) # = 150,000
# Need Pro tier (100k) or Scale tier (1M)
Implement Backoff
When approaching limits, slow down requests:
def adaptive_embed(text):
response = requests.post(...)
remaining = int(response.headers['X-RateLimit-Remaining'])
limit = int(response.headers['X-RateLimit-Limit'])
# Slow down when < 10% remaining
if remaining < limit * 0.1:
time.sleep(1) # Add delay
return response.json()
Upgrading Your Tier
When to Upgrade
Upgrade when you consistently:
- Hit rate limits before month end
- Need higher throughput
- Want production SLAs
How to Upgrade
# Contact sales for tier upgrade
# Or visit organizations
http://localhost:8000/organizations
Changes take effect immediately.
Enterprise Custom Limits
Enterprise tier offers:
- Unlimited requests
- Custom rate limits (e.g., per-second instead of per-month)
- Dedicated infrastructure
- SLA guarantees
- Priority support
Contact sales: sales@smally.ai
Troubleshooting
Unexpected Rate Limit
Problem: Hit rate limit earlier than expected
Possible causes:
-
Multiple API keys: All keys in org share quota
# Check all keys in your organization
curl http://localhost:8000/v1/organizations/ORG_ID/keys -
Uncached requests: Not leveraging cache effectively
# Add logging to check cache hit rate
if response['cached']:
cache_hits += 1 -
Testing in production: Dev/test using production keys
# Use separate keys
if ENV == 'development':
api_key = DEV_KEY # Free tier
else:
api_key = PROD_KEY # Pro tier
Rate Limit Not Resetting
Problem: Limit didn't reset on 1st of month
Solutions:
# Check server time (must be UTC)
curl http://localhost:8000/health
# Verify organization tier
curl http://localhost:8000/v1/organizations/ORG_ID
Contact support if issue persists.
Next Steps
- Caching - Reduce quota usage with caching
- API Reference - Full API documentation
- Organizations - Monitor usage