Build Model-Agnostic AI Apps with Monolyth
Dozens of new AI models are trained and fine-tuned daily, with the process only accelerating. Your app deserves to remain functional even when an AI model becomes outdated. Build with the best models for each moment.
Learn Once, Build with Hundreds
Monolyth standardizes all model request parameters and responses to ensure there's no need to change your code when switching between models or providers.
Cheaper, Faster, Better
Monolyth searches for the lowest prices, lower latencies, higher throughputs, and the best results across dozens of providers for your projects, so you don't have to.
API for Mission Critical Models
Can't find an API provider for your favorite AI model? Request one.
Models
Discover the ideal model for your project here.
Use the Models API to programmatically list all models.
API Keys
API Keys enable users and developers to cover model costs. Create an API key to use curl or the OpenAI SDK with Monolyth by setting the api_base.
Chat Requests
The chat completion API processes a list of messages, returning a single, model-generated response. It handles both multi-turn conversations and single-turn tasks efficiently.
Chat Completion
OpenAI SDK Integration
Chat Responses
Responses are similar with the OpenAI Chat API, where are always presented as an array, even for a single completion. Each choice includes a
property for streams and a
property for other cases, simplifying the usage of code across different models.
The may differ based on the model provider. The model property indicates the specific model utilized by the API.
Response Body Example
Streaming Chat Responses
Monolyth supports streaming responses using Server Sent Events (SSE). To enable streaming, add to the request.
Assistant Prefill
Monolyth allows for the completion of partial responses by models, which is useful for directing model behavior. However, this feature is not universally supported across all models. In role-play scenarios, prefilling responses helps the model maintain its character, ensuring consistency in persona throughout extended interactions.
Role | Prompt |
---|---|
System | You are an AI English teacher named Catherine, Your goal is to provide English language teaching to users who visit the AI English Teacher Co. website. Users will be confused if you don't respond in the character of Catherine. Please respond to the user's question within tags. |
User |
|
Assistant (Prefilled) |
|
Image Inputs for Vision LLM
Some models, like LLaVA, allow the model to take in images and answer questions about them. We recommend sending only one image per request. Each image will be counted as 576 tokens.
Function Calling
Some models, like Hermes 2 Pro, allow the model to call functions. This is useful for creating custom applications.
Thirdparty Integration Examples
LangChain can be integrated with Monolyth to develop context-aware and reasoning-driven applications using language models.
Chat Parameters
Chat parameters are settings used to control how a large language model generates text. These parameters can significantly affect the model's output.
Some models or providers may not suppport all parameters and those parameters will usually be ignored.
Parameter | Type | Range/Options | Default | Description |
---|---|---|---|---|
| float | 0.0 to 2.0 | 1.0 | Affects the range of the model's outputs. Lower settings result in more consistent and expected outputs, while higher settings promote a wider array of unique and varied responses. A setting of 0 produces the identical response to the same input. |
| float | 0.0 to 1.0 | 1.0 | Restricts the model to consider only a subset of the most probable tokens, specifically those whose cumulative probability reaches a certain threshold, P. Smaller values result in more deterministic outputs, while the default value allows exploration across the entire spectrum of possible tokens. |
| integer | 0 or above | 0 | Limits the model to a smaller set of token choices at each step. A value of 1 forces the model to select the most probable next token, resulting in predictable outcomes. By default, this parameter is disabled, allowing the model to explore all possible choices. |
| float | -2.0 to 2.0 | 0.0 | Reduces token repetition by penalizing tokens based on their frequency in the input. The penalty increases with the token's occurrence, discouraging the use of frequently appearing tokens. Negative values promote the reuse of tokens. |
| float | -2.0 to 2.0 | 0.0 | Modifies the likelihood of reusing tokens from the input. Higher values decrease repetition, whereas negative values increase it. The penalty is constant and does not depend on the frequency of token occurrence. |
| float | 0.0 to 2.0 | 1.0 | Minimizes token repetition from the input. Increasing this value decreases the likelihood of repeating tokens, enhancing output uniqueness. Excessively high values may disrupt output coherence, leading to less fluent sentences. |
| float | 0.0 to 1.0 | 0.0 | Sets the threshold for the least probable token to be considered, as a fraction of the most likely token's probability. For example, a setting of 0.1 means only tokens with at least 10% of the highest probability token's likelihood are included. |
| integer | - | - | Specifying a seed ensures deterministic sampling, where identical requests yield consistent results. However, some models may not guarantee this determinism. |
| integer | 1 or above | 1024 or undefined | Defines the maximum number of tokens the model can generate, capped by the context length minus the prompt length. |
| map | - | - | Takes a JSON object mapping token IDs to bias values ranging from -100 to 100. This bias adjusts the model's logits before sampling. While the impact varies by model, biases between -1 and 1 modify token selection likelihood. Extremes (-100 or 100) effectively ban or ensure a token's selection. |
| boolean | - | - | Returns the log probabilities of each output token if set to true. |
| integer | 0-20 | - | Specifies the number of top tokens to return with their log probabilities at each position. Requires logprobs to be true. |
| map | - | - | Dictates the output format of the model. Use for JSON mode, ensuring the output is valid JSON. Only a few models support this feature. |
| array | - | - | Halts generation upon encountering any specified token in the array. |
Embeddings Requests
To get an embedding, send your text to the embeddings API endpoint with the model name. The response will include an embedding as a list of numbers that you can save in a vector database.
REST API
OpenAI SDK Integration
Embeddings Responses
The default embedding vector length is 1536 for and 3072 for
. You can adjust the
parameter to reduce the size while preserving conceptual integrity. More details are available in the embedding use case section.
Embeddings Parameters
Parameter | Description |
---|---|
| Input text to embed, encoded as a string or array of tokens. To embed multiple inputs in a single request, pass an array of strings or array of token arrays. The input must not exceed the max input tokens for the model (8192 tokens for text-embedding-ada-002), cannot be an empty string, and any array must be 2048 dimensions or less. |
| ID of the model to use. You can use the List models API to see all of your available models, or see our Model overview for descriptions of them. |
| The number of dimensions the resulting output embeddings should have. Only supported in and later models. |
Error Codes
Code | Description |
---|---|
| Bad Request (invalid or missing params, CORS) |
| Invalid credentials (OAuth session expired, disabled/invalid API key) or insufficient account credits |
| Request Timeout |
| Too Many Requests |
| Unhandled error, often due to unavailable model or provider |
Retry
Generative AI models and various providers may occasionally fail due to reasons like rate limits, network issues, moderations, or server downtime.
You can retry requests with parameter.
Retry 5 Attempts
Custom Error Code Retries
Monolyth automatically retries requests for these error codes: [429, 500, 502, 503, 504].
To retry on different error codes, use the parameter in your retry configuration to specify them.
Timeout
Handle the unpredictability of generative model latencies by automatically ending requests that surpass a set duration, allowing for efficient error management or faster alternative requests.
Set a 10-second Timeout
Set a 1-second Timeout with Retry
Monolyth issues a standard 408 error for timed-out requests.
Fallbacks
Monolyth directs each request to the optimal model providers. When multiple providers serve the same model, we attempt each one sequentially in optimal (lowest cost and higher throughputs) order until a working model is found or all options are exhausted.
Performance
Does Monolyth affect the latency of API requests?
Monolyth utilizes global edge workers to minimize latency, typically adding only 20-40ms to API request roundtrips.
Individual model performance
You can check individual model page to learn more about their throughput and status.
Status | Description |
---|---|
⚫ OFFLINE | The model or provider does not exists yet. |
🔵 COLD | The model or provider is not known and not actively processing requests yet. |
🟢 ACTIVE | The model or provider is operational. |
🟡 DEGRADED | The model or provider is operational but experiencing slower response times and occasional errors. |
Data Privacy
Monolyth does not log your API input prompts by default.
We collect your number of token usages and requests to calculate your credit usage and provide aggregated, anonymous statistics of trending AI models.
Some AI model providers might log your input prompts. We'll tag these on model pages with a Privacy Policy indicator to keep you informed, though it's not an exhaustive guide to all third-party data practices.
You can always change data handling preferences in your settings. Turning off logging here also stops external model providers from saving your prompts, which they might otherwise use for things like improving their models.
Support
Join our Discord to discuss potential new features, ask questions, and get support.