Quack-Cluster: A Serverless Distributed SQL Query Engine with DuckDB and Ray

(github.com)

80 points by tanelpoder11 days ago

11 comments

fodkodrasz8 days ago
So DuckDB was developed to allow queries for bigish data finally without the need for a cluster to simplify data analysis... and we now put it to a cluster?I think there are solutions for that scale of data already, and simplicity is the best feature of DuckDB (at lest for me).
- augusteo8 days ago
 > "So DuckDB was developed to allow queries for bigish data finally without the need for a cluster to simplify data analysis... and we now put it to a cluster?"This is a fair point, but I think there's a middle ground. DuckDB handles surprisingly large datasets on a single machine, but "surprisingly large" still has limits. If you're querying 10TB of parquet files across S3, even DuckDB needs help.The question is whether Ray is the right distributed layer for this. Curious what the alternative would be—Spark feels like overkill, but rolling your own coordination is painful.
- AnEro8 days ago
 Big fan of this push back, because there are alot of projects that have that smell over engineering with the wrong base. (especially with vibecoding now) Thought there are use cases where some have lots of medium-sized data divided up. For compliance, I have a lot of reporting data split such that duckdb instances running in separate processes work amazing for us especially with lower complexity to other compute engines in that environment. If I wanted to move everything into somewhere a clickhouse/trino/databrick/etc would work well the compliance complexity skyrockets and makes it so we have to have perfect configs and tons of extra time invested to get the same devex
thenaturalist8 days ago
> "Forget about managing complex server infrastructure for your database needs."So what does this run on then?No docs, it's not possible to find any deployment guides for Ray using serverless solutions like Lambda, Cloud Functions or be it your own Firecracker.Instead, every other post mentions EKS or EC2.The Ray team even rejected Lambda support expressedly as far back as 2020 [0]. Uuuuuugh.No thanks! shiverI'd rather cut complexity for practically the same benefit and either do it single machine or have a thin, manageable layer on top a truly serverless infra like in this talk [1] " Processing Trillions of Records at Okta with Mini Serverless Databases".0: <a href="https://github.com/ray-project/ray/issues/9983" rel="nofollow">https://github.com/ray-project/ray/issues/9983</a>1: <a href="https://www.youtube.com/watch?v=TrmJilG4GXk" rel="nofollow">https://www.youtube.com/watch?v=TrmJilG4GXk</a>
rfonseca8 days ago
What is the lifetime of the Ray workers, or, in other words, what is the scalability / scale-to-zero story that makes this serverless?
nevalainen8 days ago
feels like a missed opportunity to call it cluster-quack xD
- chatmasta8 days ago
  Surely “clusterduck” would be better…
  - neumann7 days ago
    Agreed, but maybe that's what you call it when you get your configs wrong
dogman1238 days ago
neat. i'm pretty novice in the guts of this kind of stuff, but how does this work under the hood for blocking operators where they "cannot output a single row until the last row of their input has been seen"?i think this is where spark shuffling comes in? but how does it work here.<a href="https://duckdb.org/docs/stable/guides/performance/how_to_tune_workloads#blocking-operators" rel="nofollow">https://duckdb.org/docs/stable/guides/performance/how_to_tun...</a>
mgaunard8 days ago
In my experience ray clusters don't scale well and end up costing you more money. You need to run permanent per-user instances etc.What you need is a multi-tenancy shared infrastructure that is elastic.
whattheheckheck7 days ago
Why is everyone so scared of pyspark? Make it run in a local docker image and call it off to a sagemaker processing job
hexo7 days ago
Serverless? So it runs on... nothing?
- Imustaskforhelp7 days ago
  No it just runs on other people's servers.
pickleballcourt8 days ago
Reminds me of smallpond from deepseek
- esafak8 days ago
 Which, unfortunately, is not maintained: <a href="https://github.com/deepseek-ai/smallpond" rel="nofollow">https://github.com/deepseek-ai/smallpond</a>
asyncadventure8 days ago
[dead]