Monolyth Docs

Build Model-Agnostic AI Apps with Monolyth

Dozens of new AI models are trained and fine-tuned daily, with the process only accelerating. Your app deserves to remain functional even when an AI model becomes outdated. Build with the best models for each moment.

Learn Once, Build with Hundreds

Monolyth standardizes all model request parameters and responses to ensure there's no need to change your code when switching between models or providers.

Cheaper, Faster, Better

Monolyth searches for the lowest prices, lower latencies, higher throughputs, and the best results across dozens of providers for your projects, so you don't have to.

API for Mission Critical Models

Can't find an API provider for your favorite AI model? Request one.

Models

Discover the ideal model for your project here.
Use the Models API /v1/models to programmatically list all models.

API Keys

API Keys enable users and developers to cover model costs. Create an API key to use curl or the OpenAI SDK with Monolyth by setting the api_base.

Chat Requests

The chat completion API processes a list of messages, returning a single, model-generated response. It handles both multi-turn conversations and single-turn tasks efficiently.

Chat Completion

fetch("https://api.monolyth.ai/v1/chat/completions", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${MONOLYTH_API_KEY}`,
    "HTTP-Referer": `${YOUR_SITE_URL}`, // Optional, Shows in analytics dashboard on monolyth.ai
    "X-Name": `${YOUR_SITE_NAME}`, // Optional, Shows in analytics dashboard on monolyth.ai
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "model-slug", // Ex: gpt-3.5-turbo
    messages: [{ role: "user", content: "What is your dream?" }],
  }),
});

OpenAI SDK Integration

import OpenAI from "openai"
 
const openai = new OpenAI({
    baseURL: "https://api.monolyth.ai/v1",
    apiKey: $MONOLYTH_API_KEY,
    defaultHeaders: {
        "HTTP-Referer": $YOUR_SITE_URL, // Optional, Shows in analytics dashboard on monolyth.ai
        "X-Name": $YOUR_APP_NAME, // Optional, Shows in analytics dashboard on monolyth.ai
    },
})
async function main() {
    const completion = await openai.chat.completions.create({
    model: "model-slug", // Ex: gpt-3.5-turbo
    messages: [
            { role: "user", content: "Say this is a simulation" }
        ],
    })
    console.log(completion.choices[0].message)
}
 
main()

Chat Responses

Responses are similar with the OpenAI Chat API, where choices are always presented as an array, even for a single completion. Each choice includes a delta property for streams and a message property for other cases, simplifying the usage of code across different models.

The finish_reason may differ based on the model provider. The model property indicates the specific model utilized by the API.

Response Body Example

{
  "id": "chatcmpl-9EoHX1mAXByHDuTXqINVHMxxlQ8sX",
  "object": "chat.completion",
  "created": 1713316872,
  "model": "gpt-3.5-turbo-0125", // This model slug can be slightly different from the one you used in the request, depending on the model provider.
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I'm a helpful assistant here to provide you with information, answer your questions, and assist you with anything you need. How can I help you today?"
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 23,
    "completion_tokens": 33,
    "total_tokens": 56
  },
  "system_fingerprint": "fp_c22vve73ad"
}

Streaming Chat Responses

Monolyth supports streaming responses using Server Sent Events (SSE). To enable streaming, add stream: true to the request.

Assistant Prefill

Monolyth allows for the completion of partial responses by models, which is useful for directing model behavior. However, this feature is not universally supported across all models. In role-play scenarios, prefilling responses helps the model maintain its character, ensuring consistency in persona throughout extended interactions.

Role	Prompt
System	You are an AI English teacher named Catherine, Your goal is to provide English language teaching to users who visit the AI English Teacher Co. website. Users will be confused if you don't respond in the character of Catherine. Please respond to the user's question within `<response></response>` tags.
User	`{{QUESTION}}`
Assistant (Prefilled)	`[Catherine] <response>`

Image Inputs for Vision LLM

Some models, like LLaVA, allow the model to take in images and answer questions about them. We recommend sending only one image per request. Each image will be counted as 576 tokens.

...
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What'\''s in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
            }
          }
        ]
      }
    ],
...

Function Calling

Some models, like Hermes 2 Pro, allow the model to call functions. This is useful for creating custom applications.

await fetch("https://api.monolyth.ai/v1/chat/completions", {
  method: "POST",
  headers: {
    Accept: "application/json",
    "Content-Type": "application/json",
    Authorization: "Bearer <API_KEY>",
  },
  body: JSON.stringify({
    model: "hermes-2-pro-mistral-7b",
    messages: [],
    tools: [
      {
        type: "function",
        function: {
          name: "get_stock_price",
          description: "Get a stock price",
          parameters: {
            type: "object",
            properties: {
              symbol: {
                type: "string",
                description: "The stock symbol to get the price for",
              },
            },
            required: ["symbol"],
          },
        },
      },
    ],
  }),
});

Thirdparty Integration Examples

LangChain can be integrated with Monolyth to develop context-aware and reasoning-driven applications using language models.

const chat = new ChatOpenAI(
  {
    modelName: "claude-3-opus",
    streaming: true,
    openAIApiKey: $MONOLYTH_API_KEY,
  },
  {
    basePath: $MONOLYTH_API_URL + "/v1",
  },
);

Vercel AI SDK

const config = new Configuration({
  basePath: $MONOLYTH_API_URL + "/v1",
  apiKey: $MONOLYTH_API_KEY,
});
 
const monolyth = new OpenAIApi(config);

Chat Parameters

Chat parameters are settings used to control how a large language model generates text. These parameters can significantly affect the model's output.

Some models or providers may not suppport all parameters and those parameters will usually be ignored.

Parameter	Type	Range/Options	Default	Description
`temperature`	float	0.0 to 2.0	1.0	Affects the range of the model's outputs. Lower settings result in more consistent and expected outputs, while higher settings promote a wider array of unique and varied responses. A setting of 0 produces the identical response to the same input.
`top_p`	float	0.0 to 1.0	1.0	Restricts the model to consider only a subset of the most probable tokens, specifically those whose cumulative probability reaches a certain threshold, P. Smaller values result in more deterministic outputs, while the default value allows exploration across the entire spectrum of possible tokens.
`top_k`	integer	0 or above	0	Limits the model to a smaller set of token choices at each step. A value of 1 forces the model to select the most probable next token, resulting in predictable outcomes. By default, this parameter is disabled, allowing the model to explore all possible choices.
`frequency_penalty`	float	-2.0 to 2.0	0.0	Reduces token repetition by penalizing tokens based on their frequency in the input. The penalty increases with the token's occurrence, discouraging the use of frequently appearing tokens. Negative values promote the reuse of tokens.
`presence_penalty`	float	-2.0 to 2.0	0.0	Modifies the likelihood of reusing tokens from the input. Higher values decrease repetition, whereas negative values increase it. The penalty is constant and does not depend on the frequency of token occurrence.
`repetition_penalty`	float	0.0 to 2.0	1.0	Minimizes token repetition from the input. Increasing this value decreases the likelihood of repeating tokens, enhancing output uniqueness. Excessively high values may disrupt output coherence, leading to less fluent sentences.
`min_p`	float	0.0 to 1.0	0.0	Sets the threshold for the least probable token to be considered, as a fraction of the most likely token's probability. For example, a setting of 0.1 means only tokens with at least 10% of the highest probability token's likelihood are included.
`seed`	integer	-	-	Specifying a seed ensures deterministic sampling, where identical requests yield consistent results. However, some models may not guarantee this determinism.
`max_tokens`	integer	1 or above	1024 or undefined	Defines the maximum number of tokens the model can generate, capped by the context length minus the prompt length.
`logit_bias`	map	-	-	Takes a JSON object mapping token IDs to bias values ranging from -100 to 100. This bias adjusts the model's logits before sampling. While the impact varies by model, biases between -1 and 1 modify token selection likelihood. Extremes (-100 or 100) effectively ban or ensure a token's selection.
`logprobs`	boolean	-	-	Returns the log probabilities of each output token if set to true.
`top_logprobs`	integer	0-20	-	Specifies the number of top tokens to return with their log probabilities at each position. Requires logprobs to be true.
`response_format`	map	-	-	Dictates the output format of the model. Use `{"type": "json_object"}` for JSON mode, ensuring the output is valid JSON. Only a few models support this feature.
`stop`	array	-	-	Halts generation upon encountering any specified token in the array.

Embeddings Requests

To get an embedding, send your text to the embeddings API endpoint with the model name. The response will include an embedding as a list of numbers that you can save in a vector database.

REST API

fetch("https://api.monolyth.ai/v1/embeddings", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${MONOLYTH_API_KEY}`,
    "HTTP-Referer": `${YOUR_SITE_URL}`, // Optional, Shows in analytics dashboard on monolyth.ai
    "X-Name": `${YOUR_SITE_NAME}`, // Optional, Shows in analytics dashboard on monolyth.ai
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "text-embedding-3-small",
    input: "What is your dream?",
  }),
});

OpenAI SDK Integration

import OpenAI from "openai"
 
const openai = new OpenAI({
    baseURL: "https://api.monolyth.ai/v1",
    apiKey: $MONOLYTH_API_KEY,
    defaultHeaders: {
        "HTTP-Referer": $YOUR_SITE_URL, // Optional, Shows in analytics dashboard on monolyth.ai
        "X-Name": $YOUR_APP_NAME, // Optional, Shows in analytics dashboard on monolyth.ai
    },
})
async function main() {
    const embeddings = await openai.embeddings.create({
        model: "text-embedding-3-small",
        input: "What is your dream?",
        encoding_format: "float",
    })
    console.log(embeddings)
}
 
main()

Embeddings Responses

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [
        -0.006929283495992422,
        -0.005336422007530928,
        ...
        -4.547132266452536e-05,
        -0.024047505110502243
      ],
    }
  ],
  "model": "text-embedding-3-small",
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}

The default embedding vector length is 1536 for text-embedding-3-small and 3072 for text-embedding-3-large. You can adjust the dimensions parameter to reduce the size while preserving conceptual integrity. More details are available in the embedding use case section.

Embeddings Parameters

Parameter	Description
`input`	Input text to embed, encoded as a string or array of tokens. To embed multiple inputs in a single request, pass an array of strings or array of token arrays. The input must not exceed the max input tokens for the model (8192 tokens for text-embedding-ada-002), cannot be an empty string, and any array must be 2048 dimensions or less.
`model`	ID of the model to use. You can use the List models API to see all of your available models, or see our Model overview for descriptions of them.
`dimensions`	The number of dimensions the resulting output embeddings should have. Only supported in `text-embedding-3` and later models.

Error Codes

Code	Description
`400`	Bad Request (invalid or missing params, CORS)
`401`	Invalid credentials (OAuth session expired, disabled/invalid API key) or insufficient account credits
`408`	Request Timeout
`429`	Too Many Requests
`500`	Unhandled error, often due to unavailable model or provider

Retry

Generative AI models and various providers may occasionally fail due to reasons like rate limits, network issues, moderations, or server downtime. You can retry requests with retry parameter.

Retry 5 Attempts

{
  "retry": {
    "attempts": 5
  }
}

Custom Error Code Retries

Monolyth automatically retries requests for these error codes: [429, 500, 502, 503, 504]. To retry on different error codes, use the on_status_codes parameter in your retry configuration to specify them.

{
  "retry": {
    "attempts": 3,
    "on_status_codes": [408, 429, 401]
  }
}

Timeout

Handle the unpredictability of generative model latencies by automatically ending requests that surpass a set duration, allowing for efficient error management or faster alternative requests.

Set a 10-second Timeout

{
  "request_timeout": 10000
}

Set a 1-second Timeout with Retry

Monolyth issues a standard 408 error for timed-out requests.

{
  "request_timeout": 1000,
  "retry": { "attempts": 3, "on_status_codes": [408] }
}

Fallbacks

Monolyth directs each request to the optimal model providers. When multiple providers serve the same model, we attempt each one sequentially in optimal (lowest cost and higher throughputs) order until a working model is found or all options are exhausted.

Performance

Does Monolyth affect the latency of API requests?

Monolyth utilizes global edge workers to minimize latency, typically adding only 20-40ms to API request roundtrips.

Individual model performance

You can check individual model page to learn more about their throughput and status.

Status	Description
⚫ OFFLINE	The model or provider does not exists yet.
🔵 COLD	The model or provider is not known and not actively processing requests yet.
🟢 ACTIVE	The model or provider is operational.
🟡 DEGRADED	The model or provider is operational but experiencing slower response times and occasional errors.

Data Privacy

Monolyth does not log your API input prompts by default.

We collect your number of token usages and requests to calculate your credit usage and provide aggregated, anonymous statistics of trending AI models.

Some AI model providers might log your input prompts. We'll tag these on model pages with a Privacy Policy indicator to keep you informed, though it's not an exhaustive guide to all third-party data practices.

You can always change data handling preferences in your settings. Turning off logging here also stops external model providers from saving your prompts, which they might otherwise use for things like improving their models.

Support

Join our Discord to discuss potential new features, ask questions, and get support.