Get Text Embeddings from OpenAI

Retrieves text embeddings from OpenAI's models (text-embedding-3-small or large). Supports optional dimension truncation and batch processing via the OpenAI Batch API.

Usage

embed(
  texts,
  org = "openai",
  size = c("small", "large"),
  dimensions = NULL,
  batch = FALSE,
  timeout = 60,
  ...
)

Arguments

texts: Character vector. The text(s) to embed. Required.
org: Character string. The LLM provider. Currently must be "openai". Argument exists for future expansion.
size: Character string. The embedding model size. Allowed: "small" (default), "large". Maps to "text-embedding-3-small" and "text-embedding-3-large".
dimensions: Integer or NULL. The desired number of dimensions for the output embeddings. If NULL (default), the model's full dimensions are used. If set, must be a positive integer (OpenAI may have specific constraints).
batch: Logical. Use batch processing via the OpenAI Batch API? Default FALSE.
timeout: Numeric. Request timeout in seconds. Applies to synchronous API calls or the batch initiation steps. Default is 60.
...: Currently unused. For future expansion.

Value

If batch = FALSE: A list where each element is a numeric vector representing the embedding for the corresponding input text. Returns NULL on API error.
If batch = TRUE: The OpenAI batch job ID (character string). Returns NULL if batch initiation fails.

Details

For non-batch requests (batch = FALSE), the function sends the texts to the OpenAI embeddings endpoint. Note that the standard API endpoint itself can handle multiple texts in a single request (up to API limits). If a large number of texts (>50) is provided with batch = FALSE, a message suggests using batch mode for potential cost savings.

For batch requests (batch = TRUE), the function prepares and uploads an input file to OpenAI and initiates a batch job targeting the embeddings endpoint. It returns the batch job ID. Use check_batch() and workspace_batch() to monitor and retrieve results later. Note that workspace_batch() will return a list of numeric vectors for completed embedding batch jobs.

Currently, only org = "openai" is supported.

Examples

if (FALSE) { # \dontrun{
# Ensure API key is set
# Sys.setenv(OPENAI_API_KEY = "YOUR_OPENAI_KEY")

my_texts <- c("The quick brown fox jumps over the lazy dog.",
             "R is a language for statistical computing.")

# --- Synchronous (Non-Batch) Example ---
embeddings_list <- embed(texts = my_texts, size = "small")
if (!is.null(embeddings_list)) {
  print(paste("Number of embeddings received:", length(embeddings_list)))
  print(paste("Dimensions of first embedding:", length(embeddings_list[[1]])))
}

# Example with dimension truncation
embeddings_short <- embed(texts = my_texts, size = "small", dimensions = 256)
if (!is.null(embeddings_short)) {
   print(paste("Dimensions of truncated embedding:", length(embeddings_short[[1]])))
}

# Example triggering the long vector warning
long_texts <- rep("Test text.", 60)
embeddings_long_warn <- embed(texts = long_texts) # Will show message

# --- Batch Example ---
batch_texts <- c("Embed this first.", "Embed this second.", "And this third.")
embedding_batch_id <- embed(texts = batch_texts, batch = TRUE)
if (!is.null(embedding_batch_id)) {
   print(paste("Embedding batch job created with ID:", embedding_batch_id))
   # Use check_batch(embedding_batch_id) and
   # workspace_batch(embedding_batch_id) later...
}
} # }