💾 Archived View for dioskouroi.xyz › thread › 29433941 captured on 2021-12-04 at 18:04:22. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
________________________________________________________________________________
Props to the DuckDB team. I've been using DuckDB for the last little while since I discovered it on HN and it's been simply amazing.
Before, I would reach for Apache Spark to run queries on local Parquet datasets (on a single machine), but I've started using DuckDB for that and it's super fast (much faster than Pandas) and unfussy to integrate in Python code (with PySpark you need all kinds of boilerplate code).
DuckDB is so lightweight that it's also great for quick interactive work in Jupyter or IPython.
I also use it to do cross-format joins between Parquet (immutable) and CSV files (mutable) -- DuckDB can load both into the same environment -- which makes it easy to solve different kinds of programming problems. The dynamic stuff goes into the CSV files while the Parquet dataset remains static. For instance, when processing a large Parquet dataset, my code keeps track of groups that I'vee already processed in a CSV file. If the program is interrupted and I need to resume from where I left off, I just do a DuckDB join between Parquet and CSV and exclude already processed groups (particularly when you don't have a group key, and you're grouping by several fields). Yes, you can do all this with Spark too, but the DuckDB code is so much simpler and compact.
For large Parquet datasets, I currently roll my own code to chunk them so I can process them out-of-core, but sounds like this latest streaming feature in DuckDB takes care of that detail.
Sure, DuckDB doesn't do distributed compute like Spark, but as a SQL engine for Parquet, I find it's so much more ergonomic than Spark.
How does this compare to Postgres + parquet FDW? Is zero copy feasible in Postgres with FDWs?
DuckDB is an embedded database. That is, it is a library one loads into their application. This is in contrast to a server database like Postgres where the database and application live on different processes (usually on different machines).
DuckDB is often compared to SQLite. But a more apt comparison might be KDB+, a proprietary vector embedded database.
I am not very familiar with the Postgres Parquet FDW, but here is an educated guess!
Postgres is a row store engine rather than a column store, so I believe there will need to be quite a lot of translation for Postgres to be able to process parquet data (DuckDB and parquet are both columnar). My hypothesis is that DuckDB would be significantly faster! However, feel free to benchmark things!
I should also add that there is a duckdb fdw, so you could have DuckDB read from your parquet files and do faster transformations before you pull your data into Postgres!
https://github.com/alitrack/duckdb_fdw
This may be close to what you're thinking:
https://turbodbc.readthedocs.io/en/latest/pages/advanced_usa...
We saw speedups (20%+?), but wrong orders of magnitude of perf that we advocate DB vendors to aim for when we do visual analytics integrations. Arrow opens up saturating networks & PCI cards for DB<>GPU, so think going for 10-50GB/s.
"In-process, serverless" lol.. who's drinking this..
???
TLDR: Arrow got an SQL interface provided by DuckDB.
So you have a new way to run SQL on Parquet et al through DuckDB -> Arrow -> Parquet. Of course, you still need to watch out for memory usage of your SQL query if it contains JOINs or Window functions because the integration is designed for streaming rows.
You could already run SQL on Parquet with DuckDB (even on Java).
I believe it already used Arrow under the hood to read the Parquet files, could be wrong tho, but this is a more memory performant integration which is great.
DuckDB has its own Parquet reader