5 comments

  • pulkitsh12341 hour ago
    (not an expert in stream processing).. from the docs here <a href="https:&#x2F;&#x2F;sql-flow.com&#x2F;docs&#x2F;introduction&#x2F;basics#output-sink" rel="nofollow">https:&#x2F;&#x2F;sql-flow.com&#x2F;docs&#x2F;introduction&#x2F;basics#output-sink</a> it seems like this works on &quot;batches&quot; of data, how is this different from batch processing ? Where is the &quot;stream&quot; here ?
    • dm0351455 minutes ago
      Ha Yes! A pipeline assumes a &quot;batch&quot; of data, which is backed by an ephemeral duckdb in memory table. The goal is to provide SQL table semantics and implement pipelines in a way where the batch size can be toggled without a change to the pipeline logic.<p>The stream is achieved by the continuous flow of data from Kafka.<p>SQLFlow exposes a variable for batch size. Setting the batch size to 1 will make it so SQLFlow reads a kafka message, applies the processor SQL logic and then ensures it successfully commits the SQL results to the sink, one after another.<p>SQLFlow provides at least once delivery guarantees. It will only commit the source message once it successfully writes to the pipeline output (sink).<p><a href="https:&#x2F;&#x2F;sql-flow.com&#x2F;docs&#x2F;operations&#x2F;handling-errors" rel="nofollow">https:&#x2F;&#x2F;sql-flow.com&#x2F;docs&#x2F;operations&#x2F;handling-errors</a><p>The batch table is just a convention which allows for seamless batch size configuration. If your throughput is low, or if you require message by message processing, SQLFlow can be toggled to a batch of 1. If you need higher throughput and can tolerate the latency, then the batch can be toggled higher.
  • srameshc2 hours ago
    This looks brilliant, thank you. I love DuckDB and use it for lot of local data processing jobs. We have a data stream, not to the size where we need to push to BigQuery or elsewhere. I was thinking of trying something like sql-flow but I am glad now it makes the job very easy.
  • mihevc2 hours ago
    How does this compare to <a href="https:&#x2F;&#x2F;github.com&#x2F;Query-farm&#x2F;tributary" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;Query-farm&#x2F;tributary</a> ?
    • rustyconover2 hours ago
      The next major release of Tributary will support Avro, Protobuf and JSON along with the Schema Registry it will also bring the ability to write to Kafka with transactions.<p>But really you should get excited for DuckDB Labs to build out materialized views. Materialized views where you can ingest more streaming data to update aggregates. This way you could just keep pushing rows through aggregates from Kafka.<p>It is going to be a POWER HOUSE for streaming analytics.<p>Contact DuckDB Labs if you want to sponsor the work on materialized views: <a href="https:&#x2F;&#x2F;duckdb.org&#x2F;roadmap" rel="nofollow">https:&#x2F;&#x2F;duckdb.org&#x2F;roadmap</a>
      • buremba2 minutes ago
        Exactly. I have also been playing with DuckDB for streaming use cases, but it feels hacky to issue micro-batching queries on streaming data in short intervals.<p>DuckDB has everything that streaming engines such as Flink have; it just needs to support managing intermediate aggregate states and scheduling the materialized views itself.
    • dm035142 hours ago
      Oh yes!! I&#x27;ve seen this a couple times. I am far from an expert in tributary so please take with a grain of salt.<p>Based on the tributary documentation, I understand that tributary embeds kafka consumers into duckdb. This makes duckdb the main process that you run to perform consumption. I think that this makes creating stream processing POCs very accessible. It looks like it is quite easy to start streaming data into duckdb. What I don&#x27;t see is a full story around Devops, operations, testing, configuration as code etc.<p>SQLFlow is a service that embeds DuckDB as the storage and processing brains. Because of this, we&#x27;re able to offer metrics, testing utilities, pipelines as code, and all the other DevOps utilities that are necessary to run a huge number of streaming instances 24x7. SQLFlow was created as a tool that I wish I had to for simple stream processing in production in high availability contexts :)
      • mihevc2 hours ago
        Nice! Thanks for the context, it&#x27;s great to know!
  • itsfseven2 hours ago
    It would be great if this supported Pulsar too!
  • mbay2 hours ago
    I see an example with what looks like a lookup-type join against a Postgres DB. Are stream&#x2F;stream joins supported, though?<p>The DLQ and Prometheus integration out of the box are nice.
    • dm035142 hours ago
      Stream to stream joins are NOT currently supported. This is a regularly requested feature, and I&#x27;ll look at prioritizing it.<p>SQLFlow uses duckdb internally for windowing and stream state storage :), and I&#x27;ll look at extending it to support stream &#x2F; stream joins.<p>Could you describe a bit more about your use case? I&#x27;d really appreciate it if you could create an issue in the repo describing your use case and desired functionality a bit!<p><a href="https:&#x2F;&#x2F;github.com&#x2F;turbolytics&#x2F;sql-flow&#x2F;issues" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;turbolytics&#x2F;sql-flow&#x2F;issues</a><p>We were looking at solving some of the simplier use cases first before branching out into these more complicated ones :)
      • mbay2 hours ago
        I worked on stream processing at my previous gig but don&#x27;t have a need for it currently. Just curious.