Estuary

More about Estuary and related technologies, straight from the team. Our blog breaks down basic concepts and takes you into the minds of our engineers. We also dig into the business principles that guide our company and allow us to build great solutions for yours.

38 postsVisit website

Try:

Estuary2 months ago

Why is Everyone Buying Change Data Capture?

The article argues that Change Data Capture (CDC) is deceptively complex in production and explains why many organizations choose to buy rather than build CDC. It surveys industry acquisitions, details technical challenges (exactly-once delivery, snapshot vs incremental sync, database-specific quirks, schema evolution, scaling and operational complexity), provides a TCO/build-vs-buy analysis, and concludes that unless CDC is a core differentiator, buying a battle-tested solution is usually the pragmatic choice.

Estuary3 months ago

AI-Powered Data Engineering: My Stack for Faster, Smarter Analytics

A practical walkthrough of an AI‑powered analytics engineering stack that turns meeting notes into production dbt models and docs. The author explains wiring VS Code to Roo (an agent framework), using Claude Code as the LLM, a CLAUDE.md project context file, and MCP servers to let the agent interact with dbt, Motherduck (DuckDB in the cloud), and Notion. The post includes config examples, a demo workflow, encountered issues (Notion/Motherduck quirks), and planned integrations (GitHub, Slack, BI tools).

Estuary4 months ago

Data Observability in the Modern Data Stack (Part 1)

An introductory overview of data observability for the modern data stack: it defines data observability, outlines five core pillars (freshness, volume, schema, data quality, lineage), explains the scope across ingestion, ETL/ELT, warehouses, BI, and ML pipelines, and details how observability addresses pipeline failures, latency, silent data quality issues, missing lineage, fragmented debugging, and governance. The article argues observability is essential for reducing MTTD/MTTR and maintaining trust in data, and previews a Part 2 that will cover tools and standards.

Estuary5 months ago

Postgres is the battleground (again)

The article analyzes recent Postgres-focused acquisitions (Databricks→Neon, Snowflake→Crunchy Data) and argues that Postgres has evolved into a strategic, extensible runtime for developer-centric, AI-enabled data platforms. It explains why Postgres — via extensions, serverless architectures, and OLTP/OLAP/AI convergence — is attractive to major platform vendors, describes what each acquisition brings, and outlines implications for startups, infra teams, and the open-source ecosystem (more integrated managed experiences but greater platform lock‑in risk).

Estuary5 months ago

Pipe Syntax for SQL: Databricks vs Snowflake vs BigQuery

A comparative overview of pipe/flow SQL syntaxes in Databricks (|>), Snowflake (->>), and BigQuery (|>). The article explains each syntax's design philosophy, key operators and examples, trade-offs and community critiques, and guidance on when to use each approach—Databricks/BigQuery for linear, transformation-first single-query workflows and Snowflake for sequential, multi-statement scripting.

Estuary7 months ago

11 Best Free Resources to Learn SQL (2025 Guide)

A curated 2025 guide listing 11 free resources to learn SQL, including interactive sites, video courses, university material, and practice platforms (Codecademy, Khan Academy, freeCodeCamp, Stanford, W3Schools, SQLZoo, SQLBolt, Seattle Data Guy, DataCamp, Kaggle, HackerRank). It explains who each resource is best for, gives tips on choosing based on learning style, and mentions using AI tools like ChatGPT as a study aid.

Estuary8 months ago

Top 8 Data Warehouse Tools for Enterprises in 2025: An In-Depth Comparison

A 2025 buyer’s guide comparing the top eight enterprise data warehouse platforms (Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse/Microsoft Fabric, Databricks Lakehouse, Oracle Autonomous Data Warehouse, SAP Datasphere, and Teradata Vantage). It explains what a data warehouse is, summarizes each platform’s architecture, strengths, and ideal use cases (scalability, performance, cloud integration, ML/BI support, governance), compares deployment and pricing trade-offs, and offers guidance on choosing the right solution based on ecosystem, workloads, and budget.

Estuary8 months ago

PyIceberg Tutorial: Manage Apache Iceberg Tables in Python

Step-by-step tutorial showing how to use PyIceberg to manage Apache Iceberg tables from Python: install and configure a local SQLite-backed catalog, create schemas and tables, append data with PyArrow, run filtered and projected queries, perform updates and deletes, and handle schema evolution and partitioning. The guide contrasts PyIceberg with Spark/Trino/DuckDB and highlights Iceberg features like ACID transactions and decoupled storage and compute.

Estuary8 months ago

How to Migrate From Snowflake Native Tables to Open Catalog (Polaris) Iceberg Tables

Step-by-step technical guide to migrate Snowflake native tables to Snowflake Open Catalog (Polaris) Iceberg tables. Covers creating an external Polaris catalog, creating a Snowflake Iceberg table with continuous sync, copying data, using PySpark to register the Iceberg metadata into a Polaris internal catalog, cleanup, optional read-only exposure back to Snowflake, and post-migration optimizations (file rewrites, partitioning, RBAC).

Estuary8 months ago

Apache Iceberg Time Travel Guide: Snapshots, Queries & Rollbacks

Practical guide to using Apache Iceberg time travel with PySpark. Shows how to create Iceberg tables, inspect snapshots, query historical data by snapshot ID or timestamp, perform incremental reads, use tags and branches, and roll back to prior snapshots. Covers real-world use cases (auditing, debugging, ML training, data recovery), outlines limitations (storage overhead, query performance, schema evolution) and offers mitigation strategies.

Estuary8 months ago

Apache Iceberg Copy-On-Write (COW) vs Merge-On-Read (MOR): A Deep Dive

A technical comparison of Apache Iceberg’s Copy‑On‑Write (COW) and Merge‑On‑Read (MOR) update modes. The article explains mechanics, pros/cons, storage and metadata effects (snapshots, manifests, delete files), performance trade‑offs, and operational concerns like compaction. PySpark examples demonstrate table creation, inserts, updates, deletes, and how file layouts change. Recommendation: use COW for read‑heavy or bulk update workloads and MOR for update‑heavy/streaming ingestion, while running periodic compaction for MOR.

Estuary10 months ago

Apache Iceberg Tutorial: The Ultimate Guide for Beginners

A hands-on beginner tutorial on Apache Iceberg covering its core features (schema evolution, hidden partitioning, ACID transactions, time travel), architecture (metadata, data files, catalogs), common use cases (data lakehouse, analytics, governance), implementation steps and code examples (catalog setup, table creation, writes/queries with Spark/PySpark), comparisons with Delta Lake and Hudi, performance and best-practice guidance, and multi-cloud integration considerations.

Estuary11 months ago

Iceberg Catalog Showdown: Apache Polaris vs Unity Catalog

A technical comparison of Apache Polaris and Databricks Unity Catalog covering architecture, supported table formats (Iceberg vs Delta Lake), transaction models, schema evolution, governance/lineage, integration with query engines (Spark, Flink, Trino), and ingestion patterns. Polaris emphasizes open standards, multi-engine flexibility and Iceberg-native design; Unity Catalog emphasizes tight Databricks/Delta Lake integration with automated schema evolution, fine-grained access controls and built-in lineage. Both enable direct querying of cloud object storage (S3/ADLS/GCS) to reduce data movement and costs. The article includes Spark/Flink/Structured Streaming examples and guidance for choosing based on vendor lock-in, integration needs, and governance requirements.

Estuary11 months ago

Getting Started with S3 Tables for Apache Iceberg

A step-by-step tutorial showing how to set up and use Apache Iceberg S3 Tables on AWS: create an S3 bucket, launch EMR/Spark with Iceberg support, create and populate Iceberg tables, run SQL queries, and manage table maintenance (compaction, snapshots, unreferenced-file removal). It also outlines limitations and optional integration with Estuary Flow for streaming and batch ingestion.

Estuary1 year ago

Measure PostgreSQL WAL Throughput With SQL: Step-by-Step Guide

A step-by-step tutorial to measure PostgreSQL WAL throughput using SQL: create a wal_lsn_history table and a record_current_wal_lsn() function, build wal_volume_analytics and wal_volume_summary views that compute WAL bytes and rates (using pg_wal_lsn_diff), and schedule periodic captures with cron or pg_cron. The guide includes example outputs, caveats about WAL contents vs CDC, and suggestions to export metrics and alert.

Estuary1 year ago

9 Best Data Visualization Tools for 2025

A 2025 buyer's-guide roundup of nine top data visualization/BI tools (Tableau, Power BI, Looker, Qlik Sense, Kibana, Sisense, Grafana, Datawrapper, Zoho Analytics), describing each tool’s key features, pros/cons, integrations, and use cases, and noting future trends like AR/VR and AI-enhanced analytics.

Estuary1 year ago

Memory Efficient Data Streaming To Parquet Files

Estuary Flow explains a memory-efficient "2-pass write" for streaming row-oriented data into Parquet files: pass 1 appends small row-group Parquet scratch files to disk while keeping in-memory buffering minimal; pass 2 reads those scratch files column-by-column and compacts many small row groups into larger output row groups that are streamed to the final Parquet file (written to S3). Implemented in Go with the Apache Parquet file module, the approach trades extra encoding/decoding and some overhead for much lower memory use; metadata growth with very wide schemas is mitigated by heuristics.

Estuary1 year ago

Explaining Data Lakes, Lakehouses, Table Formats and Catalogs

A technical overview of open table formats (Iceberg, Delta Lake, Hudi), how they enable CRUD/ACID, performance and scalability improvements over legacy Hive lakes, and how catalogs (service vs file-system, Polaris, OSS Unity Catalog, and REST catalog proposals) plus compute engines (Trino, DuckDB) combine to form modern lakehouse architectures.

Estuary1 year ago

Apache Parquet for Data Engineers: Optimized Data Storage

A technical guide to Apache Parquet explaining its history, why data lakes and analytics engines adopted it, and the internal mechanics that make it efficient. The article describes Parquet's striping algorithm, repetition and definition levels, row groups/column chunks/pages, metadata, supported data types, and compression/encoding methods. It includes practical Python examples using pandas and pyarrow to write, inspect, and read Parquet files, and summarizes Parquet's advantages and limitations for analytical workloads.

Estuary1 year ago

Real-time Fraud Detection with Databricks

A step-by-step tutorial demonstrating how to build a real-time fraud detection pipeline: spin up a Dockerized PostgreSQL data generator, configure Estuary Flow to capture CDC from transactions and users, materialize those collections into Databricks Delta tables, and run a SQL-based anomaly detection query and dashboard to flag suspicious transactions.