Getting Started with TidyVec

Introduction

TidyVec is a lightweight package that brings vector embeddings and similarity search to the tidyverse ecosystem. It allows you to store embeddings alongside your data in tibbles and perform vector search operations while maintaining the ability to use all your familiar dplyr verbs.

This vignette will show you how to:

Create vector collections
Generate embeddings for text and images
Find similar items using vector search
Visualize embedding spaces
Combine vector search with tidyverse operations

Installation

You can install TidyVec from GitHub:

# install.packages("remotes")
remotes::install_github("flmnh-ai/tidyvec")

Load the required packages:

library(dplyr)
library(ggplot2)
library(tidyvec)

Creating a Vector Collection

At its core, TidyVec treats vector collections as enhanced tibbles. Any tibble can be converted to a vector collection using the vec() function:

# Create a simple dataset
books <- tibble(
  id = c("book1", "book2", "book3", "book4", "book5"),
  title = c(
    "The Art of Data Science",
    "Advanced R Programming",
    "Tidy Data Visualization",
    "Statistical Learning Methods",
    "Machine Learning with R"
  ),
  author = c("Smith", "Jones", "Brown", "Davis", "Wilson"),
  year = c(2018, 2020, 2019, 2021, 2022),
  description = c(
    "A comprehensive guide to data analysis using modern techniques",
    "Deep dive into R programming for advanced users",
    "Creating beautiful visualizations with ggplot2 and the tidyverse",
    "Introduction to statistical learning methods and their applications",
    "Practical machine learning approaches with R examples"
  )
)

# Convert to a vector collection
books_vec <- vec(books)
books_vec
#> Tidyvec collection with 5 items
#> Embedding column: embedding 
#> Has embedding function: No 
#> # A tibble: 5 × 6
#>   id    title                        author  year description          embedding
#>   <chr> <chr>                        <chr>  <dbl> <chr>                <list>   
#> 1 book1 The Art of Data Science      Smith   2018 A comprehensive gui… <NULL>   
#> 2 book2 Advanced R Programming       Jones   2020 Deep dive into R pr… <NULL>   
#> 3 book3 Tidy Data Visualization      Brown   2019 Creating beautiful … <NULL>   
#> 4 book4 Statistical Learning Methods Davis   2021 Introduction to sta… <NULL>   
#> 5 book5 Machine Learning with R      Wilson  2022 Practical machine l… <NULL>

By default, a column named embedding will be created to store vector embeddings.

Generating Embeddings

For this example, we’ll use a simple TF-IDF embedder for our text data:

# Create a TF-IDF embedder from our descriptions
embedder <- embedder_tfidf(books$description)

# Update our collection with the embedding function
books_vec <- vec(books, embedding_fn = embedder)

# Generate embeddings for the description column
books_vec <- embed(books_vec, content_column = "description")

# Look at the result
books_vec
#> Tidyvec collection with 5 items
#> Embedding column: embedding 
#> Has embedding function: Yes 
#> Embedding dimension: 5 
#> # A tibble: 5 × 6
#>   id    title                        author  year description          embedding
#>   <chr> <chr>                        <chr>  <dbl> <chr>                <list>   
#> 1 book1 The Art of Data Science      Smith   2018 A comprehensive gui… <dbl [5]>
#> 2 book2 Advanced R Programming       Jones   2020 Deep dive into R pr… <dbl [5]>
#> 3 book3 Tidy Data Visualization      Brown   2019 Creating beautiful … <dbl [5]>
#> 4 book4 Statistical Learning Methods Davis   2021 Introduction to sta… <dbl [5]>
#> 5 book5 Machine Learning with R      Wilson  2022 Practical machine l… <dbl [5]>

Each book now has an embedding vector derived from its description.

Finding Similar Items

Now we can find books similar to a text query:

# Find books similar to a query
query_results <- books_vec %>%
  nearest("machine learning and statistics", n = 3)

# View the results with similarity scores
query_results %>%
  select(title, author, similarity)
#> Tidyvec collection with 3 items
#> Embedding column: embedding 
#> Has embedding function: Yes 
#> # A tibble: 3 × 3
#>   title                        author similarity
#>   <chr>                        <chr>       <dbl>
#> 1 Statistical Learning Methods Davis       0.816
#> 2 Tidy Data Visualization      Brown       0.5  
#> 3 Machine Learning with R      Wilson      0.408

We can also use the similarity operator %~% for a more concise syntax:

# Using the similarity operator
"programming in R" %~% books_vec %>%
  select(title, author, similarity)
#> Tidyvec collection with 5 items
#> Embedding column: embedding 
#> Has embedding function: Yes 
#> # A tibble: 5 × 3
#>   title                        author similarity
#>   <chr>                        <chr>       <dbl>
#> 1 Advanced R Programming       Jones       1    
#> 2 Machine Learning with R      Wilson      0.577
#> 3 The Art of Data Science      Smith       0    
#> 4 Tidy Data Visualization      Brown       0    
#> 5 Statistical Learning Methods Davis       0

Combining with Tidyverse Operations

What makes TidyVec special is how seamlessly it integrates with tidyverse workflows. You can use standard dplyr operations before or after vector search:

# Filter to recent books, then find similar ones
books_vec %>%
  filter(year >= 2020) %>%
  nearest("R methods", n = 2) %>%
  select(title, year, similarity)
#> Tidyvec collection with 2 items
#> Embedding column: embedding 
#> Has embedding function: Yes 
#> # A tibble: 2 × 3
#>   title                    year similarity
#>   <chr>                   <dbl>      <dbl>
#> 1 Advanced R Programming   2020      1    
#> 2 Machine Learning with R  2022      0.577

You can also perform vector search first, then filter the results:

# Find similar books, then filter by similarity threshold
books_vec %>%
  nearest("R methods", n = 5) %>%
  filter(similarity > 0.2) %>%
  select(title, similarity)
#> Tidyvec collection with 2 items
#> Embedding column: embedding 
#> Has embedding function: Yes 
#> # A tibble: 2 × 2
#>   title                   similarity
#>   <chr>                        <dbl>
#> 1 Advanced R Programming       1    
#> 2 Machine Learning with R      0.577

Working with Images

TidyVec excels at working with multimodal data, including images. For this, we typically use neural embedding models like CLIP via HuggingFace.

Note: Using neural embedders requires Python and some dependencies. Run setup_python() to set up the required environment.

# Set up the Python environment
setup_python()

# Create a CLIP embedder
clip_embedder <- embedder_hf("openai/clip-vit-base-patch32", modality = "multimodal")

# Get paths to example images included with the package
img_paths <- c(
  cat = system.file("images/cat.jpeg", package = "tidyvec"), 
  dog = system.file("images", "dog.jpeg", package = "tidyvec"),
  beach = system.file("images", "beach.jpeg", package = "tidyvec"),
  mountain = system.file("images", "mountain.jpeg", package = "tidyvec"),
  city = system.file("images", "city.jpeg", package = "tidyvec")
)

# Create an image collection
images <- tibble(
  id = names(img_paths),
  path = unname(img_paths),
  category = c("pet", "pet", "nature", "nature", "urban")
) %>%
  vec(embedding_fn = clip_embedder) %>%
  embed(content_column = "path")

# Find images similar to text
"a cat playing" %~% images %>%
  select(id, path, similarity)

# Find images similar to another image
nearest(images, system.file("images", "dog-on-beach.jpeg", package = "tidyvec"), n = 2)

# Find images similar to text and visualize
"a dog on a mountain" %~% images %>%
  viz_images(path_column = "path", label_columns = c("id", "category"), n = 2)

"a dog on a beach" %~% images %>%
  viz_images(path_column = "path", label_columns = c("id", "category"), n = 2)

Visualizing Embedding Spaces

TidyVec provides a simple way to visualize your embedding spaces using dimensionality reduction techniques:

# Visualize our book embeddings
images %>%
  viz_embeddings(method = "tsne", labels = "id", color = "category", perplexity = 1)

Advanced Use Cases

RAG (Retrieval-Augmented Generation)

TidyVec is perfect for creating simple RAG systems:

# Split document into chunks
document_chunks <- tibble(
  id = paste0("chunk", 1:10),
  text = c(
    "R is a programming language for statistical computing.",
    "The tidyverse is a collection of R packages for data science.",
    "ggplot2 is used for data visualization in R.",
    "dplyr provides functions for data manipulation.",
    "tidyr helps to create tidy data.",
    "purrr enhances R's functional programming capabilities.",
    "readr provides functions to read rectangular data.",
    "tibble is a modern reimagining of the data frame.",
    "stringr provides functions for string manipulation.",
    "forcats provides tools for working with categorical variables."
  ),
  source = "R Documentation"
) %>%
  vec(embedding_fn = embedder_tfidf(.$text)) %>%
  embed(content_column = "text")

# User query
query_results <- document_chunks %>%
  nearest("How do I visualize data in R?", n = 3)

# Use results to generate answer with an LLM
query_results %>%
  select(text, similarity)

Custom Embedders

You can easily create custom embedding functions:

# Create a simple embedder that counts word frequencies
word_freq_embedder <- function(vocabulary = c("r", "data", "programming", "statistics", "visualization")) {
  function(text) {
    text <- tolower(text)
    vapply(vocabulary, function(word) {
      sum(gregexpr(word, text)[[1]] > 0)
    }, numeric(1))
  }
}

# Use custom embedder
simple_embedder <- word_freq_embedder()
books_vec <- books %>%
  vec(embedding_fn = simple_embedder) %>%
  embed(content_column = "description")

# Query with custom embedder
"data visualization" %~% books_vec %>%
  select(title, similarity)
#> Tidyvec collection with 5 items
#> Embedding column: embedding 
#> Has embedding function: Yes 
#> # A tibble: 5 × 2
#>   title                        similarity
#>   <chr>                             <dbl>
#> 1 The Art of Data Science           0.316
#> 2 Tidy Data Visualization           0.316
#> 3 Advanced R Programming            0    
#> 4 Statistical Learning Methods      0    
#> 5 Machine Learning with R           0

Conclusion

TidyVec provides a lightweight, tidyverse-friendly way to work with vector embeddings. By treating embeddings as just another column in your tibbles, you get all the power of vector search while maintaining the flexibility and familiarity of the tidyverse.

Key benefits:

Seamless integration with dplyr, ggplot2, and other tidyverse packages
Support for multiple modalities (text, images)
Flexible embedding generation
Simple, intuitive API
Visualization capabilities

For more advanced use cases or larger datasets, consider using FAISS or other specialized vector databases, but for many common applications, TidyVec provides an elegant and easy-to-use solution.

Nick Gauthier

2025-02-28