Skip to contents

Overview

tidyvec is a lightweight vector database for the tidyverse ecosystem. It enables you to:

  • Store and query vector embeddings alongside your data in tibbles
  • Generate embeddings for text and images
  • Find similar items using vector similarity search
  • Visualize embedding spaces
  • Seamlessly integrate with dplyr, ggplot2, and other tidyverse packages

Why tidyvec?

While specialized vector databases like FAISS and Pinecone offer high performance for large-scale applications, they often require leaving the familiar tidyverse workflow. tidyvec bridges this gap by:

  • Keeping it tidy: Store embeddings right in your tibbles
  • Familiar syntax: Use standard dplyr verbs before and after vector operations
  • Low friction: No need to switch contexts between data wrangling and similarity search
  • Multimodal support: Work with text, images, or any data type that can be embedded

Installation

You can install tidyvec from GitHub:

# install.packages("remotes")
remotes::install_github("flmnh-ai/tidyvec")

For neural embedding models via HuggingFace, you’ll also need Python with the required dependencies:

# Set up Python environment with required packages
tidyvec::setup_python()

Basic Usage

library(tidyverse)
library(tidyvec)

# Create a collection of books
books <- tibble(
  title = c(
    "The Art of Data Science",
    "Advanced R Programming",
    "Tidy Data Visualization",
    "Statistical Learning Methods",
    "Machine Learning with R"
  ),
  description = c(
    "A comprehensive guide to data analysis using modern techniques",
    "Deep dive into R programming for advanced users",
    "Creating beautiful visualizations with ggplot2 and the tidyverse",
    "Introduction to statistical learning methods and their applications",
    "Practical machine learning approaches with R examples"
  )
)

# Create a TF-IDF embedder and embed the descriptions
embedder <- embedder_tfidf(books$description)
books_vec <- books %>%
  vec(embedding_fn = embedder) %>%
  embed(content_column = "description")

# Find similar books using the `%~%` operator
"data visualization techniques" %~% books_vec %>%
  select(title, similarity)

Working with Images

# Create a CLIP embedder for images
clip_embedder <- embedder_hf("openai/clip-vit-base-patch32", modality = "multimodal")

# Get paths to example images included with the package
img_paths <- c(
  cat = system.file("extdata/images", "cat.jpg", package = "tidyvec"),
  dog = system.file("extdata/images", "dog.jpg", package = "tidyvec"),
  beach = system.file("extdata/images", "beach.jpg", package = "tidyvec"),
  mountain = system.file("extdata/images", "mountain.jpg", package = "tidyvec"),
  city = system.file("extdata/images", "city.jpg", package = "tidyvec")
)

# Create an image collection
images <- tibble(
  id = names(img_paths),
  path = unname(img_paths),
  category = c("pet", "pet", "nature", "nature", "urban")
) %>%
  vec(embedding_fn = clip_embedder) %>%
  embed(content_column = "path")

# Find images similar to text
"a cat playing" %~% images %>%
  select(id, path, similarity)

# Find similar images and visualize them
"a dog on a beach" %~% images %>%
  viz_images(path_column = "path", label_columns = c("id", "category"))

Key Features

Vector Collections

The vec() function transforms a tibble into a vector collection:

# Create a basic collection
my_collection <- tibble(text = c("sample text", "another example")) %>%
  vec()

# With a custom embedding function
my_collection <- tibble(text = c("sample text", "another example")) %>%
  vec(embedding_fn = my_custom_embedder)

Embedding Generation

Generate embeddings using built-in or custom embedding functions:

# TF-IDF embeddings
documents <- tibble(text = c("sample text", "another example")) %>%
  vec(embedding_fn = embedder_tfidf(.$text)) %>%
  embed(content_column = "text")

# HuggingFace neural embeddings
comments <- tibble(text = c("I love this product", "Terrible experience")) %>%
  vec(embedding_fn = embedder_hf("sentence-transformers/all-MiniLM-L6-v2")) %>%
  embed(content_column = "text")

Find similar items with the nearest() function or %~% operator:

# Find nearest neighbors
my_collection %>%
  nearest("query text", n = 5)

# Or using the similarity operator
"query text" %~% my_collection

Embedding Visualization

Visualize your embedding space:

my_collection %>%
  viz_embeddings(method = "umap", labels = "id", color = "category")

Advanced Examples

Combining with Tidyverse Operations

# Filter first, then search
books_vec %>%
  filter(year >= 2020) %>%
  nearest("visualization techniques", n = 2)

# Search first, then filter results
books_vec %>%
  nearest("R programming", n = 10) %>%
  filter(similarity > 0.5) %>%
  arrange(desc(year))

Building a Simple RAG (Retrieval-Augmented Generation) System

# Split document into chunks
document_chunks <- tibble(
  chunk_id = paste0("chunk", 1:10),
  text = c("R is a programming language for statistical computing.", 
           "The tidyverse is a collection of R packages for data science.", 
           # ... more chunks
           ),
  source = "R Documentation"
) %>%
  vec(embedding_fn = embedder_tfidf(.$text)) %>%
  embed(content_column = "text")

# Query relevant chunks
query_results <- document_chunks %>%
  nearest("How do I visualize data in R?", n = 3)

# Use results to generate an answer with an LLM
query_results %>%
  select(text, similarity)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This package is licensed under the MIT License - see the LICENSE file for details.