Estuary

More about Estuary and related technologies, straight from the team. Our blog breaks down basic concepts and takes you into the minds of our engineers. We also dig into the business principles that guide our company and allow us to build great solutions for yours.

38 postsVisit website
Try:
Estuary

The article argues that Change Data Capture (CDC) is deceptively complex in production and explains why many organizations choose to buy rather than build CDC. It surveys industry acquisitions, details technical challenges (exactly-once delivery, snapshot vs incremental sync, database-specific quirks, schema evolution, scaling and operational complexity), provides a TCO/build-vs-buy analysis, and concludes that unless CDC is a core differentiator, buying a battle-tested solution is usually the pragmatic choice.

Estuary

A practical walkthrough of an AI‑powered analytics engineering stack that turns meeting notes into production dbt models and docs. The author explains wiring VS Code to Roo (an agent framework), using Claude Code as the LLM, a CLAUDE.md project context file, and MCP servers to let the agent interact with dbt, Motherduck (DuckDB in the cloud), and Notion. The post includes config examples, a demo workflow, encountered issues (Notion/Motherduck quirks), and planned integrations (GitHub, Slack, BI tools).

Estuary

An introductory overview of data observability for the modern data stack: it defines data observability, outlines five core pillars (freshness, volume, schema, data quality, lineage), explains the scope across ingestion, ETL/ELT, warehouses, BI, and ML pipelines, and details how observability addresses pipeline failures, latency, silent data quality issues, missing lineage, fragmented debugging, and governance. The article argues observability is essential for reducing MTTD/MTTR and maintaining trust in data, and previews a Part 2 that will cover tools and standards.

Estuary

The article analyzes recent Postgres-focused acquisitions (Databricks→Neon, Snowflake→Crunchy Data) and argues that Postgres has evolved into a strategic, extensible runtime for developer-centric, AI-enabled data platforms. It explains why Postgres — via extensions, serverless architectures, and OLTP/OLAP/AI convergence — is attractive to major platform vendors, describes what each acquisition brings, and outlines implications for startups, infra teams, and the open-source ecosystem (more integrated managed experiences but greater platform lock‑in risk).

Estuary

A comparative overview of pipe/flow SQL syntaxes in Databricks (|>), Snowflake (->>), and BigQuery (|>). The article explains each syntax's design philosophy, key operators and examples, trade-offs and community critiques, and guidance on when to use each approach—Databricks/BigQuery for linear, transformation-first single-query workflows and Snowflake for sequential, multi-statement scripting.

Estuary

A curated 2025 guide listing 11 free resources to learn SQL, including interactive sites, video courses, university material, and practice platforms (Codecademy, Khan Academy, freeCodeCamp, Stanford, W3Schools, SQLZoo, SQLBolt, Seattle Data Guy, DataCamp, Kaggle, HackerRank). It explains who each resource is best for, gives tips on choosing based on learning style, and mentions using AI tools like ChatGPT as a study aid.

Estuary

A 2025 buyer’s guide comparing the top eight enterprise data warehouse platforms (Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse/Microsoft Fabric, Databricks Lakehouse, Oracle Autonomous Data Warehouse, SAP Datasphere, and Teradata Vantage). It explains what a data warehouse is, summarizes each platform’s architecture, strengths, and ideal use cases (scalability, performance, cloud integration, ML/BI support, governance), compares deployment and pricing trade-offs, and offers guidance on choosing the right solution based on ecosystem, workloads, and budget.

Estuary

Step-by-step tutorial showing how to use PyIceberg to manage Apache Iceberg tables from Python: install and configure a local SQLite-backed catalog, create schemas and tables, append data with PyArrow, run filtered and projected queries, perform updates and deletes, and handle schema evolution and partitioning. The guide contrasts PyIceberg with Spark/Trino/DuckDB and highlights Iceberg features like ACID transactions and decoupled storage and compute.

Estuary

Step-by-step technical guide to migrate Snowflake native tables to Snowflake Open Catalog (Polaris) Iceberg tables. Covers creating an external Polaris catalog, creating a Snowflake Iceberg table with continuous sync, copying data, using PySpark to register the Iceberg metadata into a Polaris internal catalog, cleanup, optional read-only exposure back to Snowflake, and post-migration optimizations (file rewrites, partitioning, RBAC).

Estuary

Practical guide to using Apache Iceberg time travel with PySpark. Shows how to create Iceberg tables, inspect snapshots, query historical data by snapshot ID or timestamp, perform incremental reads, use tags and branches, and roll back to prior snapshots. Covers real-world use cases (auditing, debugging, ML training, data recovery), outlines limitations (storage overhead, query performance, schema evolution) and offers mitigation strategies.

Estuary

A technical comparison of Apache Iceberg’s Copy‑On‑Write (COW) and Merge‑On‑Read (MOR) update modes. The article explains mechanics, pros/cons, storage and metadata effects (snapshots, manifests, delete files), performance trade‑offs, and operational concerns like compaction. PySpark examples demonstrate table creation, inserts, updates, deletes, and how file layouts change. Recommendation: use COW for read‑heavy or bulk update workloads and MOR for update‑heavy/streaming ingestion, while running periodic compaction for MOR.

Estuary

A hands-on beginner tutorial on Apache Iceberg covering its core features (schema evolution, hidden partitioning, ACID transactions, time travel), architecture (metadata, data files, catalogs), common use cases (data lakehouse, analytics, governance), implementation steps and code examples (catalog setup, table creation, writes/queries with Spark/PySpark), comparisons with Delta Lake and Hudi, performance and best-practice guidance, and multi-cloud integration considerations.

Estuary

A technical comparison of Apache Polaris and Databricks Unity Catalog covering architecture, supported table formats (Iceberg vs Delta Lake), transaction models, schema evolution, governance/lineage, integration with query engines (Spark, Flink, Trino), and ingestion patterns. Polaris emphasizes open standards, multi-engine flexibility and Iceberg-native design; Unity Catalog emphasizes tight Databricks/Delta Lake integration with automated schema evolution, fine-grained access controls and built-in lineage. Both enable direct querying of cloud object storage (S3/ADLS/GCS) to reduce data movement and costs. The article includes Spark/Flink/Structured Streaming examples and guidance for choosing based on vendor lock-in, integration needs, and governance requirements.

Estuary

A step-by-step tutorial showing how to set up and use Apache Iceberg S3 Tables on AWS: create an S3 bucket, launch EMR/Spark with Iceberg support, create and populate Iceberg tables, run SQL queries, and manage table maintenance (compaction, snapshots, unreferenced-file removal). It also outlines limitations and optional integration with Estuary Flow for streaming and batch ingestion.

Estuary

A step-by-step tutorial to measure PostgreSQL WAL throughput using SQL: create a wal_lsn_history table and a record_current_wal_lsn() function, build wal_volume_analytics and wal_volume_summary views that compute WAL bytes and rates (using pg_wal_lsn_diff), and schedule periodic captures with cron or pg_cron. The guide includes example outputs, caveats about WAL contents vs CDC, and suggestions to export metrics and alert.

Estuary

A 2025 buyer's-guide roundup of nine top data visualization/BI tools (Tableau, Power BI, Looker, Qlik Sense, Kibana, Sisense, Grafana, Datawrapper, Zoho Analytics), describing each tool’s key features, pros/cons, integrations, and use cases, and noting future trends like AR/VR and AI-enhanced analytics.

Estuary

Estuary Flow explains a memory-efficient "2-pass write" for streaming row-oriented data into Parquet files: pass 1 appends small row-group Parquet scratch files to disk while keeping in-memory buffering minimal; pass 2 reads those scratch files column-by-column and compacts many small row groups into larger output row groups that are streamed to the final Parquet file (written to S3). Implemented in Go with the Apache Parquet file module, the approach trades extra encoding/decoding and some overhead for much lower memory use; metadata growth with very wide schemas is mitigated by heuristics.

Estuary

A technical overview of open table formats (Iceberg, Delta Lake, Hudi), how they enable CRUD/ACID, performance and scalability improvements over legacy Hive lakes, and how catalogs (service vs file-system, Polaris, OSS Unity Catalog, and REST catalog proposals) plus compute engines (Trino, DuckDB) combine to form modern lakehouse architectures.

Estuary

A technical guide to Apache Parquet explaining its history, why data lakes and analytics engines adopted it, and the internal mechanics that make it efficient. The article describes Parquet's striping algorithm, repetition and definition levels, row groups/column chunks/pages, metadata, supported data types, and compression/encoding methods. It includes practical Python examples using pandas and pyarrow to write, inspect, and read Parquet files, and summarizes Parquet's advantages and limitations for analytical workloads.

Estuary

A step-by-step tutorial demonstrating how to build a real-time fraud detection pipeline: spin up a Dockerized PostgreSQL data generator, configure Estuary Flow to capture CDC from transactions and users, materialize those collections into Databricks Delta tables, and run a SQL-based anomaly detection query and dashboard to flag suspicious transactions.