So you wanna build a local RAG?

(blog.yakkomajuri.com)

390 points by pedriquepacheco71 days ago

26 comments

simonw71 days ago
My advice for building something like this: don't get hung up on a need for vector databases and embedding.Full text search or even grep/rg are a lot faster and cheaper to work with - no need to maintain a vector database index - and turn out to work really well if you put them in some kind of agentic tool loop.The big benefit of semantic search was that it could handle fuzzy searching - returning results that mention dogs if someone searches for canines, for example.Give a good LLM a search tool and it can come up with searches like "dog OR canine" on its own - and refine those queries over multiple rounds of searches.Plus it means you don't have to solve the chunking problem!
- navar70 days ago
 I created a small app that shows the difference between embedding-based ("semantic") and bm25 search:<a href="http://search-sensei.s3-website-us-east-1.amazonaws.com/" rel="nofollow">http://search-sensei.s3-website-us-east-1.amazonaws.com/</a>(warning! It will download ~50MB of data for the model weights and onnx runtime on first load, but should otherwise run smoothly even on a phone)It runs a small embedding model in the browser and returns search results in "real time".It has a few illustrative examples where semantic search returns the intended results. For example bm25 does not understand that "j lo" or "jlo" refer to Jennifer Lopez. Similarly embedding based methods can better deal with things like typos.EDIT: search is performed over 1000 news articles randomly sampled from 2016 to 2024
- andai71 days ago
 <a href="https://www.anthropic.com/engineering/contextual-retrieval" rel="nofollow">https://www.anthropic.com/engineering/contextual-retrieval</a>Anthropic found embeddings + BM25 (keyword search) gave the best results. (Well, after contextual summarization, and fusion, and reranking, and shoving the whole thing into an LLM...)But sadly they didn't say how BM25 did on its own, which is the really interesting part to me.In my own (small scale) tests with embeddings, I found that I'd be looking right at the page that contained the literal words in my query and embeddings would fail to find it... Ctrl+F wins again!
 - bredren70 days ago
 FWIW, the org decided against vector embeddings for Claude Code due in part to maintenance. See 41:05 here: <a href="https://youtu.be/IDSAMqip6ms" rel="nofollow">https://youtu.be/IDSAMqip6ms</a>
 - mips_avatar70 days ago
 It would also blow up the price/latency of Claude code if every chunk of every file had to be read into haiku->summarized->sent to an embedding model ->reindexed into a project index and that index stored somewhere. Since there’s a lot of context inherent in things like the file structure, storing the central context in Claude.md is a lot simpler. I don’t think them not using vector embeddings in the project space is anything other than an indication that it’s hard to manage embeddings in Claude code.
 - andai70 days ago
 Some agents integrate with code intelligence tools which do use embeddings, right? (As well as "mechanical" solutions like LSPs, I imagine.)I think it's just a case of "this isn't something we need to solve, other companies solve it already and then our thing can integrate with that."Or maybe it's really just marginal gains compared with iterative grepping. I don't know. (Still amazed how well that works!)
 mips_avatar70 days ago
 I think your last point captures it, for various reasons (RL, inherent structure of code) iterative grepping is unreasonably effective. Interestingly Cursor does use embedding vectors for codebase indexing:<a href="https://cursor.com/docs/context/codebase-indexing" rel="nofollow">https://cursor.com/docs/context/codebase-indexing</a>Seems like sometimes Cursor has a better understanding of the vibe of my codebase than Claude code, maybe this is part of it. Or maybe it’s just really marginally important in codebase indexing. Vector dbs still have a huge benefit in less verifiable domains.
 - skrebbel70 days ago
 who's "the org"?
 - noobcoder70 days ago
 No cross encoders?
- whakim71 days ago
 In my experience the semantic/lexical search problem is better understood as a precision/recall tradeoff. Lexical search (along with boolean operators, exact phrase matching, etc.) has very high precision at the expense of lower recall, whereas semantic search sits at a higher recall/lower precision point on the curve.
 - simonw71 days ago
 Yeah, that sounds about right to me. The most effective approach does appear to be a hybrid of embeddings and BM25, which is worth exploring if you have the capacity to do so.For most cases though sticking with BM25 is likely to be "good enough" and a whole lot cheaper to build and run.
 - mips_avatar70 days ago
 Depends on the app and how often you need to change your embeddings, but I run my own hybrid semantic/bm25 search on my MacBook Pro across millions of documents without too much trouble.
 - echion58 days ago
 Can you elaborate a bit on your setup if you have time?
- cwmoore71 days ago
 I recently came across a “prefer the most common synonym” problem, in Google Maps, while searching for a poolhall—even literally ‘billiards’ returned results for swimming pools and chlorine. I wonder if some more NOTs aren’t necessary…interested in learning about RAGs though I’m a little behind the curve.
- mips_avatar71 days ago
 In my app the best lexical search approaches completely broke my agent. For my rag system the llm would on average take 2.1 lexical searches to get the results it needed. Which wasn’t terrible but it meant sometimes it needed up to 5 searches to find it which blew up user latency. Now that I have a hybrid semantic search + lexical search it only requires 1.1 searches per result.
 - nostrebored71 days ago
 The problem is not using parallel tool calling or not returning a search array. We do this across large data sets and don’t see much of a problem. It also means you can swap algorithms on the fly. Building a BM25 index over a few thousand documents is not very expensive locally. Rg and grep are freeish. If you have information on folder contents you can let your agent decide at execution time based on information need.Embeddings just aren’t the most interesting thing here if you’re running a frontier fm.
 - mips_avatar70 days ago
 Search arrays help, but parallel tool calling assumes you’ve solved two hard problems: generating diverse query variations, and verifying which result is correct. Most retrieval doesn’t have clean verification. The better approach is making search good enough that you sidestep verification as much as possible (hopefully you are only requiring the model to make a judgment call within its search array). In my case (OpenStreetMap data), lexical recall is unstable, but embeddings usually get it right if you narrow the search space enough—and a missed query is a stronger signal to the model that it’s done something wrong.Besides, if you could reliably verify results, you’ve essentially built an RL harness—which is a lot harder to do than building an effective search system and probably worth more.
- froobius71 days ago
 Hmm it can capture more than just single words though, e.g. meaningful phrases or paragraphs that could be written in many ways.
- leetrout71 days ago
 Simon have you ever given a talk or written about this sort of pragmatism? A spin on how to achieve this with Datasette is an easy thing to imagine IMO.
 - simonw71 days ago
 I did a livestream thing about building RAG against FTS search in Datasette last year: <a href="https://simonwillison.net/2024/Jun/21/search-based-rag/" rel="nofollow">https://simonwillison.net/2024/Jun/21/search-based-rag/</a>
- scosman70 days ago
 Alternative advice: just test and see what works best for your use case. Totally agreed embeddings are often overkill. However, sometimes they really help. The flow is something like:- Iterate over your docs to build eval data: hundreds of pairs of [synthetic query, correct answer]. Focus on content from the docs not general LLM knowledge.- Kick off a few parallel evaluations of different RAG configurations to see what works best for your use case: BM25, Vector, Hybrid. You can do a second pass to tune parameters: embedding model, top k, re-ranking, etc.I build a free system that does all this (synthetic data from docs, evals, test various RAG configs without coding each version). <a href="https://docs.kiln.tech/docs/evaluations/evaluate-rag-accuracy-q-and-a-evals" rel="nofollow">https://docs.kiln.tech/docs/evaluations/evaluate-rag-accurac...</a>
 - simonw70 days ago
 That's excellent advice, the only downside being that collecting that eval data remains difficult and time-consuming.But if you want to build truly great search that's the approach to take.
 - scosman70 days ago
 Agree totally. I’m spending half my time focused that problem (mostly synthetic data gen with guidance), and the other half on how to optimize once it works.
 - sbene97070 days ago
 At this point you could also optimize your agentic flow directly in DSPy using a colbert model / Ratatouille for retrieval.
 - scosman70 days ago
 Not there yet. The biggest vectors for optimizing aren’t in the agents yet (RAG method, embedding model, etc)
- victorbuilds70 days ago
 This matches what I found building an AI app for kids. Started with embeddings because everyone said to, then ripped it out and went with simple keyword matching. The extra complexity wasn't worth it for my use case. Most of the magic comes from the LLM anyway, not the retrieval layer.
- dmezzetti70 days ago
 Are multiple LLM queries faster than vector search? Even with the example "dog OR canine" that leads to two LLM inference calls vs one. LLM inference is also more expensive than vector search.In general RAG != Vector Search though. If a SQL query, grep, full text search or other does the job then by all means. But for relevance-based search, vector search shines.
- 773412870 days ago
 No reason to try to avoid semantic search. Dead easy to implement, works across languages to some extent and the fuzziness is worth quite alot.You're realistically going to need chunks of some kind anyway to feed the LLM, and once you got those it's just a few lines of code to get a basic persistant ChromaDB going.
- tra371 days ago
 I built a simple emacs package based on this idea [0]. It works surprisingly well, but I dont know how far it scales. It's likely not as frugal from a token usage perspective.0: <a href="https://github.com/dmitrym0/dm-gptel-simple-org-memory" rel="nofollow">https://github.com/dmitrym0/dm-gptel-simple-org-memory</a>
- drittich70 days ago
 Do you have a standard prompt you use for this? I have definitely seen agentic tools doing this for me, e.g., when searching the local file system, but I'm not sure if it native behaviour for tool-using LLMs or if it is coerced via prompts.
 - simonw70 days ago
 No I've not got a good only for this yet. I've found the modern models (or the Claude Code etc harness) know how to do this already by default - you can ask them a question and give them a search tool and they'll start running and iterating on searches by themselves.
- enraged_camel71 days ago
 Yes, exactly. We have our AI feature configured to use our pre-existing TypeSense integration and it's stunningly competent at figuring out exactly what search queries to use across which collections in order to find relevant results.
 - busssard71 days ago
 if this is coupled with powerful search engines beyond elastic then we are getting somewhere. other nonmonotonic engines that can find structural information are out there.
- petercooper70 days ago
 So kinda GAR - Generation-Augmented Retrieval :-)
- pstuart71 days ago
 Perhaps SQLite with FTS5? Or even better, getting DuckDB into the party as it's ecosystem seems ripe for this type of work.
- paulyy_y70 days ago
 Burying the lede here - your solution for avoiding using vector search is either offloading to 1) user, expecting them to remember the right terms or 2) using LLM to craft the search query? And having it iterate multiple times? Holy mother of inefficiency, this agentic focus is making us all brain dead.Vector DB's and embeddings are dead simple to figure out, implement, and maintain. Especially for a local RAG, which is the primary context here. If I want to find my latest tabular notes on some obscure game dealing with medical concepts, I should be able to just literally type that. It shouldn't require me remembering the medical terms, or having some local (or god forbid, remote) LLM iterate through a dozen combos.FWIW I also think this is a matter of how well one structures their personal KB. If you follow strict metadata/structure and have coherent/logical writing, you'll have better chance of getting results with text matching. For someone optimizing for vector space search, and minimizing the need for upfront logical structuring, it will not work out well.
 - simonw70 days ago
 My opinion on this really isn't very extreme.Claude Code is widely regarded to be the best coding agent tool right now and it uses search, not embeddings.I use it to answer questions about files on my computer all the time.
mips_avatar71 days ago
One thing I didn’t see here that might be hurting your performance is a lack of semantic chunking. It sounds like you’re embedding entire docs, which kind of breaks down if the docs contain multiple concepts. A better approach for recall is using some kind of chunking program to get semantic chunks (I like spacy though you have to configure it a bit). Then once you have your chunks you need to append context to how this chunk relates to the rest of your doc before you do your embedding. I have found anthropics approach to contextual retrieval to be very performant in my RAG systems (<a href="https://www.anthropic.com/engineering/contextual-retrieval" rel="nofollow">https://www.anthropic.com/engineering/contextual-retrieval</a>) you can just use gpt oss 20b as the model for generation of context.Unless I’ve misunderstood your post and you are doing some form of this in your pipeline you should see a dramatic improvement in performance once you implement this.
- yakkomajuri71 days ago
 hey, author (not op) here. we do do semantic chunking! I think maybe I gave the impression that we don't because of the mention of aggregating context but I tested this with questions that would require aggregating context from 15+ documents (meaning 2x that in chunks), hence the comment in the post!
 - NebulaStorm45670 days ago
 Is there a way to convert documents into a hierarchical connected graph data structure which references each other similar to how we use personal knowledge tools like Obsidian and ability to traverse this graph? Is GraphRag technique trying to do this exactly?
 - mips_avatar70 days ago
 Not exactly what you’re looking for but Wilson Lin’s search engine creates a graph from the DOM for context. Here’s his write up: <a href="https://blog.wilsonl.in/search-engine/" rel="nofollow">https://blog.wilsonl.in/search-engine/</a>
 - mips_avatar71 days ago
 Ah so you’re generating context from multiple docs for your chunks? How do you decide which docs get aggregated?
 - nostrebored71 days ago
 Haven’t seen an answer better than “vibes” here. Especially with data across multiple domains.
 - mips_avatar71 days ago
 I mean as long as they're not too long I suppose you could use just about any heuristic for grouping sources. Just seems like it would be hard to generate succinct context if you mess it up.
abhashanand150170 days ago
My advice - use same rigor as other software development for a RAG application. Have a test suite (of say 100 cases) which says for this question correct response is this. Use an LLM judge to score each of the outputs of the RAG system. Now iterate till you get a score of 85 or so. And every change of prompts and strategy triggers this check, and ensures that output of 85 is always maintained.
nilirl71 days ago
Why is it implicit that semantic search will outperform lexical search?Back in 2023 when I compared semantic search to lexical search (tantivy; BM25), I found the search results to be marginally different.Even if semantic search has slightly more recall, does the problem of context warrant this multi-component, homebrew search engine approach?By what important measure does it outperform a lexical search engine? Is the engineering time worth it?
- kgeist71 days ago
 It depends on how you test it. I recently found that the way devs test it differs radically from how users actually use it. When we first built our RAG, it showed promising results (around 90% recall on large knowledge bases). However, when the first actual users tried it, it could barely answer anything (closer to 30%). It turned out we relied on exact keywords too much when testing it: we knew the test knowledge base, so we formulated our questions in a way that helped the RAG find what we expected it to find. Real users don't know the exact terminology used in the articles. We had to rethink the whole thing. Lexical search is certainly not enough. Sure, you can run an agent on top of it, but that blows up latency - users aren't happy when they have to wait more than a couple of seconds.
 - victorbuilds70 days ago
 This is the gap that kills most AI features. Devs test with queries they already know the answer to. Users come in with vague questions using completely different words. I learned to test by asking my kids to use my app - they phrase things in ways I would never predict.
 - embedding-shape70 days ago
 Ironically, pitting a LLM (ideally a completely different model) up against what you're testing, letting it write human "out of the ordinary" queries to use as test cases tend to work well too, if you don't have kids you can use as a free workforce :)
 - scosman70 days ago
 I build a system to do exactly this: <a href="https://docs.kiln.tech/docs/evaluations/evaluate-rag-accuracy-q-and-a-evals" rel="nofollow">https://docs.kiln.tech/docs/evaluations/evaluate-rag-accurac...</a>Basically it:- iterates over your docs to find knowledge specific to the content- generates hundreds of pairs of [synthetic query, correct answer]- evaluates different RAG configurations for recall
 - babelfish71 days ago
 How did you end up changing it? Creating new evals to measure the actual user experience seems easy enough, how did that inform your stack?
- scosman70 days ago
 Totally depends on use case.It solves some types of issues lexical search never will. For example if a user searches "Close account", but the article is named "Deleting Your Profile".But lexical solves issues semantic never will. Searching an invoice DB for "Initech" with semantic search is near useless.Pick a system that can do both, including a hybrid mode, then evaluate if the complexity is worth it for you.
- mips_avatar71 days ago
 Depends on how important keyword matching vs something more ambiguous is to your app. In Wanderfugl there’s a bunch of queries where semantic search can find an important chunk that lacks a high bm25 score. The good news is you can get all the benefits of bm25 and semantic with a hybrid ranking. The answer isn’t one or the other.
- andoando71 days ago
 The benefit I see is you can have queries like "conversations between two scientists".Its very dependent on use case imo
autogn0me71 days ago
What we use: - <a href="https://github.com/ggozad/haiku.rag" rel="nofollow">https://github.com/ggozad/haiku.rag</a>Why?- developer oriented (easy to read Python and uses pydantic-ai)- benchmarks available- docling with advanced citations (on branch)- supports deep research agent- real open source by long term committed developer not fly by night
mingodad70 days ago
I did an experiment while learning about LLMs and llama.cpp consisting in trying to use create a Lua extension to use llama.cpp API to enhance LLMs with agent/RAG written in Lua with simple code to learn the basics and after more than 5 hours chatting with <a href="https://aistudio.google.com/prompts/new_chat?model=gemini-3-pro-preview" rel="nofollow">https://aistudio.google.com/prompts/new_chat?model=gemini-3-...</a> (see the scrapped output of the whole session attached) I've got a lot far in terms of learning how to use an LLM to help develop/debug/learn about a topic (in this case agent/RAG with llama.cpp API using Lua).I'm posting it here just in case it can help others to see and comment/improve it (it was using around 100K tokens at the end and started getting noticeable slow but still very helpful).You can see the scrapped text for the whole seession here <a href="https://github.com/ggml-org/llama.cpp/discussions/17600" rel="nofollow">https://github.com/ggml-org/llama.cpp/discussions/17600</a>
nh270 days ago
I'd like to have a local, fully offline and open-source software into which I can dump all our Emails, Slack, Gdrive contents, Code, and Wiki, and then query it with free form questions such as "with which customers did we discuss feature X?", producing references to the original sources.What are my options?I want to avoid building my own or customising a lot. Ideally it would also recommend which models work well and have good defaults for those.
- cbcoutinho70 days ago
 This is why I built the Nextcloud MCP server, so that you can talk with your own data. Obviously this is Nextcloud-specific, but if you're using it already then this is possible now.<a href="https://github.com/cbcoutinho/nextcloud-mcp-server" rel="nofollow">https://github.com/cbcoutinho/nextcloud-mcp-server</a>The default MCP server deployment supports simple CRUD operations on your data, but if you enable vector search the MCP server will begin embedding docs/notes/etc. Currently ollama and openai are supporting embeddings providers.The MCP server then exposes tools you can use to search your docs based on semantic search and/or bm25 (via qdrant fusion) as well as generate responses using MCP sampling.Importantly, rather than generating responses itself, the server relies on MCP sampling so that you can use any LLM/MCP client. This MCP sampling/RAG pattern is extremely powerful and it wouldn't surprise me if there was something open source that generalizes this across other data sources.
- russdill70 days ago
 Would love to see someone build an example using the offline wikipedia text.
 - fragmede70 days ago
 Given the full text of Wikipedia is undoubtedly part of the training data, what would having it in a RAG add?
 - Zambyte70 days ago
 High precision recall.It may also be cheaper to update the source (Wikipedia) with new information than to update the model.
davedx70 days ago
> we use Sentence Transformers (all-MiniLM-L6-v2) as our default (solid all-around performer for speed and retrieval, English-only).Huh, interesting. I might be building a German-language RAG at some point in my future and I never even considered that some models might not support German at all. Does anyone have any experience here? Do many models underperform or not support non-English languages?
- navar70 days ago
 You can refer to <a href="https://huggingface.co/spaces/mteb/leaderboard" rel="nofollow">https://huggingface.co/spaces/mteb/leaderboard</a> and use that to guide your selection.Check under the "Retrieval" section, either RTEB Multilingual or RTEB German (under language specific).You may also want to filter for model sizes (under "Advanced Model Filters"). For instance if you are self-hosting and running on a CPU it may make sense to limit to something like <=100M parameters models.
 - davedx68 days ago
 Thanks, that's really useful, I had no idea this table existed.
- yakkomajuri70 days ago
 > Do many models underperform or not support non-English languages?Yes they do. However:1. German is one of the more common languages to train on so more models will support it than say, Bahasa2. There should still be a reasonable amount of multi-lingual models available. Particularly if you're OK with using proprietary models via API. AFAIK all the frontier embedding and reranking models (non open-source) are multi-lingual
- architectonic70 days ago
 Yes I can confirm that,we had resorted to a multilingual embedding model back in the day. <a href="https://link.springer.com/chapter/10.1007/978-3-031-77918-3_15" rel="nofollow">https://link.springer.com/chapter/10.1007/978-3-031-77918-3_...</a>
dmezzetti71 days ago
Glad to see all the interest in the local RAG space, it's been something I've been pushing for a while.I just put this example together today: <a href="https://gist.github.com/davidmezzetti/d2854ed82f2d0665ec7efdd073d575d7" rel="nofollow">https://gist.github.com/davidmezzetti/d2854ed82f2d0665ec7efd...</a>
andai70 days ago
When I started playing with this stuff in the GPT-4 days (8K context!), I wrote a script that would search for a relevant passage in a book, by shoving the whole book into GPT-4, in roughly context sized chunks.I think it was like a dollar per search or something in those days. We've come a long way!Anthropic, in their RAG article, actually say that if your thing fits in context, you should probably just put it there instead of using RAG.I don't know where the optimal cutoff is though, since quality does suffer with long contexts. (Not to mention price and speed.)<a href="https://www.anthropic.com/engineering/contextual-retrieval" rel="nofollow">https://www.anthropic.com/engineering/contextual-retrieval</a>The context size and pricing has come so far! Now the whole book fits in context, and it's like 1 cent to put the whole thing in context.(Well, a little more with Anthropic's models ;)
urbandw311er71 days ago
When it comes to the evals for this kind of thing, is there a standard set of test data out there that one can work with to benchmark against? ie a collection of documents with questions that should result in particular documents or chunks being cited as the most relevant match.
- autogn0me70 days ago
  Yes check out haiku-rag benchmarks and evaluations
Oras70 days ago
The hardest part in RAQ is document parsing. If you only consider text then it should be ok, but once you start having tables, tables going multiple pages, charts, ignore TOC when available, footnotes … etc, that part becomes really hard and accuracy suffers to get the context regardless of what chunking do you use.There are some patterns to help such as RAPTOR where you make ingestion content aware and instead of just ingesting content, you start using LLMs to question and summarise the content and save that to the vector database.But reality is, having one size fits all for RAQ is not an easy task.
- Royce-CMR70 days ago
 Super noob in vector embeddings: I never considered that tables would be a complexifier. (beyond defining in a parseable format for ingestion).Do vector databases do better with long grouped text vs table formats?
 - Oras70 days ago
 The issue is the ingestion (extracting the right data in the right format). This is mainly an issue in PDFs and sometimes when there are tables added as images in Docx too. You need a mix of text and OCR extraction to get the data correctly first before start chunking and adding embeddings
_joel71 days ago
You can get local RAG with Anythingllm if you want minimal effort too fwiw. Pretty much plug and play. Used it for simple testing for an idea before getting into the weeds of langchain and agentic RAG.
mijoharas71 days ago
I'm interested in the embeddings models suggested. I had some good results with nomic in a small embedding based tool I built. I also heard a few good things about qwen3-embedding, though the latency wasn't great for my usecase so I didn't pursue it much further.Similarly, I used sqlite-vec, and was very happy with it. (if I were already using postgres I'd have gone with that, but this was more of a cli tool).If the author is here, did you try any of those models? how would you compare the ones you did use?
into_the_void70 days ago
Interesting perspective on the use of full-text search over vector databases for RAG. I appreciate the insights on agentic tool loops and handling fuzzy searching.
barbazoo71 days ago
> What that means is that when you're looking to build a fully local RAG setup, you'll need to substitute whatever SaaS providers you're using for a local option for each of those components.Even starting with having "just" the documents and vector db locally is a huge first step and much more doable than going with a local LLM at the same time. I don't know any one or any org that has the resources to run their own LLM at scale.
- mips_avatar71 days ago
 It’s also just extremely viable to just host your own vector db. You just need a server with enough ram for your hnsw index.
- procaryote71 days ago
 Aren't there a bunch of models that run OK on consumer hardware now?
 - lukan71 days ago
 Hopefully my new GPU will arrive tomorrow, then I can confirm myself, but if you look around online, there are lots of private people out there running their own models. A 16 GB GPU starts at 270€, which lets you run something like deepseek r.14, 32 GB GPUs start at 1200 € and then it goes further up, in model quality and price. (Top models require something like 60- 200 GB of GPU memory I think)So for sure any medium sized company could afford to run their own LLMs, also at scale if they want to make the investment. The question is, how much they value their confidential data. (I would not trust any of the big AI companies). And you don't usually need cutting edge reasoning and coding abilities to process basic information.
 - FuckButtons70 days ago
 A 120gb ram MacBook Pro will run gpt-oss-120b at a very respectable clip and I’ve found it to be quite serviceable for a lot of tasks.
 - adastra2270 days ago
 I bought one for this purpose, but LM Studio doesn't seem to want to run even the most quantized versions. Any suggestions?
 - FuckButtons70 days ago
 Are you using the mix quantized versions? Also, there’s a setting that disables the memory allocation guardrails they put in place, which are irrelevant since macOS handles oom quite gracefully.
spacecadet70 days ago
You can vibe code a local RAG with or without vectors in 5 minutes. Like another commenter pointed out, unless your corpus is huge, you do not need vectors, but hey using vectors is fun so why not.For what its worth, I run a local first, small model, private RAG that uses LangGraph, Neo4J knowledge graphs, I swap the models around constantly. It mostly just gets called by agent tools now.
- okeuro4970 days ago
 Do you run the models locally?No local model for me manages to get function calling right.
 - spacecadet70 days ago
 Are you using LangGraph tool nodes? I run very small, non-RLHF instruction models with maybe a 2% failure rate on response format matching tool definition. I would also guess you do not have your orchestration and pipe configured correctly.
JKCalhoun71 days ago
I kinda do want to build a local RAG? I want some significant subset of Wikipedia (I assume most people know about these) on a dedicated machine with a RAG front-end. I would have then an offline Wikipedia "librarian" I could query.But I'm lazy and assumed that someone has already built such a thing. I'm just not aware of this "Wikipedia-RAG-in-a-box".
- yakkomajuri70 days ago
 In this case do you even need a RAG? Most models will have been trained on Wikipedia anyway.Give Jan (<a href="https://www.jan.ai/" rel="nofollow">https://www.jan.ai/</a>) a try for instance. You'll need to do a bit of research as to what model will give you the best perf on your system but one of the quantized Llama or Qwen models will probably suit you well.
 - JKCalhoun70 days ago
 Thank you.
- jonwinstanley70 days ago
 Standard Ollama probably covers a lot of this
kbrisso71 days ago
I built this for local RAG <a href="https://github.com/kbrisso/byte-vision" rel="nofollow">https://github.com/kbrisso/byte-vision</a> it uses llama.cpp and Elasticsearch. On a laptop with 8 GB GPU it can handle a 30K token size and summarize a fairly large PDF.
- busssard71 days ago
 elasticsearch is the true limitation of rag systems...
 - kbrisso71 days ago
 The vector search works great once you figure it out. I wanted to focus on writing the application and not have to rewrite a document store.
0xC4571 days ago
For an open source, local (or cloud) vector DB, I would also recommend checking out Chroma (<a href="https://trychroma.com" rel="nofollow">https://trychroma.com</a>). It also supports full text search. Disclaimer: I work on Chroma cloud.
dwa359271 days ago
If you end up using any of the frontier models, don't forget to protect private information in your prompts - <a href="https://github.com/deepanwadhwa/zink" rel="nofollow">https://github.com/deepanwadhwa/zink</a>
- cjonas71 days ago
 Doesn't seems necessary if you are using claude via bedrock or gpt via azure. At that point, its not different then sending PII through a serverless function.
 - wanderingmind70 days ago
 Care to explain more? I understand the prompt might not be used for training, but how about sanitizing the PII from tracking or logging or memory bugs in these serverless functions
 - cjonas68 days ago
 My point is there are plenty of cases where you would send the same PII through a server-less function or internal API (IE PATCH /user/profile). The concerns about logging or bugs are the same in both instances. You could make a case that using a masking tool like this would make it easier to share full production logs, but there are plenty of other ways to secure logs that don't involve modifying the runtime behavior.
johnebgd71 days ago
Interesting stack. I’ve been working on doing something like this with Apple specific tech. Swiftdata is not easy to work with.
ElasticBottle71 days ago
How does this compare with orama?
throwaway1934370 days ago
No
adastra2270 days ago
Rust API?
- yakkomajuri70 days ago
 Post author (not OP) here.Would you want a Rust SDK for Skald?
 - adastra2270 days ago
 Yes, none of the languages listed are very accessible from the frameworks I'm working with.
 - yakkomajuri70 days ago
 Emailing you!