OP here.
I wrote this implementation to deeply understand the mechanics behind HNSW (layers, entry points, neighbor selection) without relying on external libraries.
While PHP isn't the typical choice for vector search engines, I found it surprisingly capable for this use case, especially with JIT enabled on PHP 8.x. It serves as a drop-in solution for PHP monoliths that need semantic search features without adding the complexity of a separate service like Qdrant or Pinecone.
If you want to jump straight to the code, the open-source repo is here: <a href="https://github.com/centamiv/vektor" rel="nofollow">https://github.com/centamiv/vektor</a>
Happy to answer any questions about the implementation details!
Thanks a lot, I liked the fantasy based examples to explain the concept.<p>Programming <i>is</i> chanting magic incarnations and spells after all.
(And fighting against evil spirits and demons)
Great writeup. Thanks for talking the time to organise and share.<p>It's tempting to use this in projects that use PHP.<p>Is it useable with a corpus of like 1.000 3kb markdown files? And 10.000 files?<p>Can I also index PHP files so that searches include function and class names? Perhaps comments?<p>How much ram and disk memory we would be talking about?<p>And the speed?<p>My first goal would to index a PHP project and its documentation so that an LLM agent could perform semantic search using my MCP tool.
I tested it myself with 1k documents (about 1.5M vectors) and performance is solid (a few milliseconds per search). I haven't run more aggressive benchmarks yet.<p>Since it only stores the vectors, the actual size of the Markdown document is irrelevant; you just need to handle the embedding and chunking phases carefully (you can use a parser to extract code snippets).<p>RAM isn't an issue because I aim for random data access as much as possible. This avoids saturating PHP, since it wasn't exactly built for this kind of workload.<p>I'm glad you found the article and repo useful! If you use it and run into any problems, feel free to open an issue on GitHub.
Great article!
I also read your other post and love it! This is exactly my thinking: Locality of Behavior (LoB)<p>Never heard this term before, but I like it.<p><a href="https://centamori.com/index.php?slug=basics-of-web-development" rel="nofollow">https://centamori.com/index.php?slug=basics-of-web-developme...</a>
The only small thing you forgot to mention - it requires use of AI. Open Ai to be specific. I've got baited.
Apologies if it felt that way! I used OpenAI in the examples just because it's the quickest 'Hello World' for embeddings right now, but the library itself is completely agnostic.<p>HNSW is just the indexing algorithm. It doesn't care where the vectors come from. You can generate them using Ollama (locally) HuggingFace, Gemini...<p>As long as you feed it an array of floats, it will index it. The dependency on OpenAI is purely in the example code, not in the engine logic.
I think you'd get a lot more people interested in trying your project out if you included steps on how to generate vectors for the search as a document.<p>I love PHP, but I will realistically admit that most people interested in using PHP probably don't have the experience to know how to do such a thing offhand.