How I Built a RAG Chatbot in n8n Using Pinecone and HuggingFace

RAG — Retrieval Augmented Generation — sounds complicated. The idea is simple: instead of asking an AI to answer from its training data, you give it your own documents to search through first. The result is an AI that answers questions specifically about your content, not the general internet.

I built this entirely in n8n with no Python, no custom code, and no paid AI APIs. The whole stack runs on free tiers: HuggingFace for embeddings, Pinecone for vector storage, and OpenRouter for the language model. Here's exactly how it works.

What you'll need A self-hosted n8n instance, a free Pinecone account, a free HuggingFace account, a free OpenRouter account, and Google Drive for storing your FAQ documents. Total monthly cost: €0.

What It Does

The workflow has two completely separate pipelines running in the same n8n canvas. The first pipeline handles ingestion — it watches a Google Drive folder, and whenever you add or update a file, it automatically processes the document, creates embeddings, and stores them in Pinecone. The second pipeline handles chat — it takes a user question, searches Pinecone for relevant content, and uses an AI model to generate a natural language answer based on what it found.

In practice: you drop a FAQ text file into a Google Drive folder. Within seconds it's indexed. Now anyone can chat with it and get accurate answers — without ever seeing the source document.

Ingestion Pipeline

Google Drive Trigger → Download File → Split in Chunks → HuggingFace Embeddings → Pinecone Insert

Chat Pipeline

Chat Trigger → Embed Question → Pinecone Query → Build Context → OpenRouter LLM → Respond

The complete workflow — ingestion pipeline (left) and chat/retrieval pipeline (right)

Pipeline 1 — Document Ingestion

This is the left side of the workflow. It runs automatically whenever a file is created in the connected Google Drive folder.

Step 1 — Google Drive Trigger

The trigger fires on fileCreated events. In n8n, add a Google Drive Trigger node, authenticate with your Google account, and select the folder you want to watch. Any file dropped into this folder starts the ingestion pipeline.

Step 2 — Download File

The trigger gives you file metadata but not the content. A Download File node fetches the actual text. Set Operation to Download and pass the file ID from the trigger output. For text files and FAQs this works perfectly out of the box.

Step 3 — Recursive Character Text Splitter

Raw text can't go straight into an embedding model — it needs to be broken into chunks first. The Recursive Character Text Splitter node handles this. I used a chunk size of 500 characters with an overlap of 50 characters. The overlap ensures that sentences cut at a boundary still have context on both sides, which improves retrieval accuracy.

⚠ Chunk size affects answer quality Too large and each chunk contains too many topics — retrieval becomes imprecise. Too small and you lose context. 400–600 characters works well for FAQ-style content. For longer technical docs, go up to 1000.

Step 4 — HuggingFace Embeddings

Each text chunk gets converted into a vector — a list of numbers that represents its meaning. I used sentence-transformers/all-MiniLM-L6-v2, which produces 384-dimensional vectors. It's fast, free, and accurate enough for FAQ retrieval. Add your HuggingFace API key (free to get at huggingface.co) in the credentials.

Step 5 — Pinecone Vector Store (Insert)

The embeddings go into Pinecone. In the Pinecone Vector Store node, set the operation to Insert, connect your Pinecone API key, and select your index. My index is named n8n1 with dimension 384 — this must match the output dimension of your embedding model exactly. If they don't match, Pinecone will throw an error.

Pinecone Index

Index: n8n1 384 dims · all-MiniLM-L6-v2 (HuggingFace)

Pinecone dashboard — n8n1 (384 dims, for HuggingFace all-MiniLM-L6-v2) and n8n (512 dims, for a different model)

Pipeline 2 — Chat and Retrieval

This is the right side of the workflow. It's completely separate from ingestion and runs on demand whenever someone sends a message.

Step 1 — When Chat Message Received

Same trigger as the customer service agent — n8n's built-in chat interface. This gives you a working chat UI at your n8n URL with zero setup. Good for internal use or demos.

Step 2 — AI Agent

The AI Agent node is the coordinator. It receives the user's message, decides it needs to search for relevant content, calls the Pinecone Vector Store tool, gets the results, and generates a final answer. Two things connect to it: the OpenRouter Chat Model (the brain) and the Pinecone Vector Store (the tool it calls to search).

Step 3 — OpenRouter Chat Model

Same setup as the customer service agent — OpenRouter with a free model. For RAG specifically, the model doesn't need to be especially smart because all the hard work is done by the retrieval step. The model just needs to summarise and present what Pinecone found. A free Llama model handles this well.

Step 4 — Pinecone Vector Store (Retrieve)

This is the same Pinecone index, but configured for retrieval instead of insert. When the agent calls this tool, it converts the user's question into an embedding using the same HuggingFace model, searches the index for the most semantically similar chunks, and returns the top matches. The agent then uses those chunks to write its answer.

This is the core of RAG: the answer is grounded in your actual documents, not the model's training data. If the information isn't in Pinecone, the agent says it doesn't know — which is exactly what you want for a FAQ bot.

The System Prompt

The system prompt for a RAG agent is simpler than you might expect. The main thing is telling it to only answer from retrieved context and to be honest when it can't find an answer:

You are a helpful FAQ assistant. You have access to a knowledge base 
of FAQ documents via the Pinecone search tool.

When a user asks a question:
1. Search the knowledge base first
2. Answer based only on what you find there
3. If you can't find relevant information, say so clearly — 
   don't make up an answer

Keep answers concise and direct.

The instruction "don't make up an answer" is important. Without it, language models will hallucinate plausible-sounding but wrong answers when they don't find good matches. Explicit instructions to admit ignorance make the bot much more trustworthy.

Setting Up Pinecone

Creating the index takes 2 minutes. Go to pinecone.io → Create Index → give it a name → set Dimensions to 384 (for all-MiniLM-L6-v2) → Metric: Cosine → Create. The free Starter plan gives you 2GB storage and 1 million read units per month — more than enough for FAQ use.

One thing to be careful about: the dimension number is locked when you create the index. If you later switch embedding models, you need to create a new index. That's why I have two indexes in my Pinecone dashboard — one for testing with a different model (512 dims) and one for this workflow (384 dims).

What I'd Do Differently

Add metadata to chunks. Right now when Pinecone returns a result, it's just raw text. If you store the source filename as metadata alongside each chunk, the agent can tell users exactly which document the answer came from. This makes the bot far more trustworthy and easier to audit.

Handle document updates properly. The current setup only triggers on fileCreated. If you update an existing FAQ file, the old chunks stay in Pinecone alongside the new ones. The proper fix is to delete all chunks associated with a file before re-ingesting it. This requires storing the file ID as metadata and adding a delete step before insert.

Add a fallback response. When no relevant chunks are found, the agent sometimes gives a vague "I don't have information on that" message. A better approach is to add an IF node that detects low-confidence retrievals and routes them to a fixed response like "I couldn't find that in our FAQ — please contact support at [email]."

Conclusion

The complete workflow is about 8 nodes. The ingestion side runs automatically in the background whenever documents change. The chat side responds in a few seconds. The whole thing costs €0 to run because every component has a free tier that's generous enough for real use.

What makes this worth building for clients: they don't need to know anything about vectors, embeddings, or RAG. They just drop files into a Google Drive folder and their chatbot stays up to date automatically. That's the value — the complexity is invisible.