Skip to contents

Retrieves text embeddings from OpenAI's models (text-embedding-3-small or large). Supports optional dimension truncation and batch processing via the OpenAI Batch API.

Usage

embed(
  texts,
  org = "openai",
  size = c("small", "large"),
  dimensions = NULL,
  batch = FALSE,
  timeout = 60,
  ...
)

Arguments

texts

Character vector. The text(s) to embed. Required.

org

Character string. The LLM provider. Currently must be "openai". Argument exists for future expansion.

size

Character string. The embedding model size. Allowed: "small" (default), "large". Maps to "text-embedding-3-small" and "text-embedding-3-large".

dimensions

Integer or NULL. The desired number of dimensions for the output embeddings. If NULL (default), the model's full dimensions are used. If set, must be a positive integer (OpenAI may have specific constraints).

batch

Logical. Use batch processing via the OpenAI Batch API? Default FALSE.

timeout

Numeric. Request timeout in seconds. Applies to synchronous API calls or the batch initiation steps. Default is 60.

...

Currently unused. For future expansion.

Value

  • If batch = FALSE: A list where each element is a numeric vector representing the embedding for the corresponding input text. Returns NULL on API error.

  • If batch = TRUE: The OpenAI batch job ID (character string). Returns NULL if batch initiation fails.

Details

For non-batch requests (batch = FALSE), the function sends the texts to the OpenAI embeddings endpoint. Note that the standard API endpoint itself can handle multiple texts in a single request (up to API limits). If a large number of texts (>50) is provided with batch = FALSE, a message suggests using batch mode for potential cost savings.

For batch requests (batch = TRUE), the function prepares and uploads an input file to OpenAI and initiates a batch job targeting the embeddings endpoint. It returns the batch job ID. Use check_batch() and workspace_batch() to monitor and retrieve results later. Note that workspace_batch() will return a list of numeric vectors for completed embedding batch jobs.

Currently, only org = "openai" is supported.

Examples

if (FALSE) { # \dontrun{
# Ensure API key is set
# Sys.setenv(OPENAI_API_KEY = "YOUR_OPENAI_KEY")

my_texts <- c("The quick brown fox jumps over the lazy dog.",
             "R is a language for statistical computing.")

# --- Synchronous (Non-Batch) Example ---
embeddings_list <- embed(texts = my_texts, size = "small")
if (!is.null(embeddings_list)) {
  print(paste("Number of embeddings received:", length(embeddings_list)))
  print(paste("Dimensions of first embedding:", length(embeddings_list[[1]])))
}

# Example with dimension truncation
embeddings_short <- embed(texts = my_texts, size = "small", dimensions = 256)
if (!is.null(embeddings_short)) {
   print(paste("Dimensions of truncated embedding:", length(embeddings_short[[1]])))
}

# Example triggering the long vector warning
long_texts <- rep("Test text.", 60)
embeddings_long_warn <- embed(texts = long_texts) # Will show message

# --- Batch Example ---
batch_texts <- c("Embed this first.", "Embed this second.", "And this third.")
embedding_batch_id <- embed(texts = batch_texts, batch = TRUE)
if (!is.null(embedding_batch_id)) {
   print(paste("Embedding batch job created with ID:", embedding_batch_id))
   # Use check_batch(embedding_batch_id) and
   # workspace_batch(embedding_batch_id) later...
}
} # }