Track Awesome Data Engineering Updates Daily

A curated list of data engineering tools for software developers

🏠 Home · 🔍 Search · 🔥 Feed · 📮 Subscribe · ❤️ Sponsor · 😺 igorbarinov/awesome-data-engineering · ⭐ 8.8K · 🏷️ Big Data

[ Daily / Weekly / Overview ]

Jul 19, 2026

Data Ingestion

faucet-stream (⭐7) - Config-driven data-movement platform for Rust with pluggable source and sink connectors, running ETL, CDC, and streaming pipelines declaratively from YAML or embedded as a library.

Serialization format

ParquetKit - Browser-based viewer, SQL workbench and converter for Parquet files powered by DuckDB-WASM. Fully client-side, no upload.

Workflow

Nika (⭐25) - Intent-as-code workflow engine for AI data pipelines: reviewable YAML DAGs statically checked (schema, permits, cost floor) before execution, with tamper-evident run traces.

Jul 06, 2026

Data Ingestion

enrich-companies (⭐0) - CLI tool to enrich CSV files with company data (financials, contacts, metadata) from 250M+ company records. Available on npm.

Charts and Dashboards

LunaPad - Open-source analytics notebook for reusable SQL workflows, interactive reports, and AI-assisted data exploration.

Workflow

OneQuery (⭐17) - Self-hosted gateway for safe, auditable queries for agents across approved data sources.

Datasets / Data Dumps

LatAm Synth - Synthetic financial savings behavior generator for Latin America: users, savings goals, and transactions calibrated on 506K real records (2015–2024). Reproducible by seed, 100% synthetic.

Schema / Data Profiler

SchemaCrawler - Open-source and free relational database schema discovery and comprehension tool. Documents and diagrams relational database schemas from your Java programs, build tools and the command line. Find database design issues with lint, and write scripts against the database. Includes an MCP Server for use by AI agents.

Jun 22, 2026

Data Ingestion

Duckle (⭐783) - Local-first, open-source desktop ETL/ELT studio: drag a pipeline onto a canvas (or describe it to a built-in on-device AI assistant) and run it at native speed through DuckDB. 290+ connectors, a scheduler, and an MCP server for driving pipelines from an LLM. No cloud, no servers.

Rawbbit - Open-source self-hosted analytics pipeline that lands raw events as Parquet in your own object storage. Uses NATS JetStream for durable buffering and BigQuery external tables for querying. Designed for teams that want to own their raw event data.

Jun 16, 2026

Data Comparison

FutureSearch SDK (⭐46) - Python SDK that dispatches parallel web-research agents across table rows, synthesizing multi-agent findings into structured columns.

Charts and Dashboards

Dekart (⭐350) - Open-source SQL to map platform for BigQuery, Snowflake, and PostGIS.

Workflow

OrionBelt Semantic Layer (⭐64) - Open-source semantic sidecar that compiles YAML-defined dimensions, measures, and metrics into optimized SQL across 8 engines (BigQuery, ClickHouse, Databricks, Dremio, DuckDB, MySQL, PostgreSQL, Snowflake). Unified REST, MCP, and Postgres wire protocol; one model powers AI agents, analytics, DQ rules, and KPIs.

DataFlow (⭐6k) - Open-source platform for data preparation, synthetic data generation, and AI/data pipelines. Includes reusable skills for automating workflow steps across data and AI tasks.

Datasets / Realtime

Eventum - Data generation platform for producing synthetic event streams with complex correlations.

May 15, 2026

Data Ingestion

CARQ (⭐2) - Context-Aware RAG Processing Queue for high availability and adaptive rate-limiting.

Datasets / Data Dumps

The Quiet-Broke Index - A 30-metro composite of US household cost burdens (housing, taxes, childcare, healthcare, transport) aggregated from Census ACS, BLS Consumer Expenditure Survey, and HUD Fair Market Rents. Open methodology, free, no email gate.

Testing / Data Profiler

Fixzi - JSON/XML validation and API contract monitoring tool for debugging and testing structured data.

May 13, 2026

Data Ingestion

Enrich.sh - Managed event ingestion service that converts JSON sent to a REST API into Hive-partitioned Parquet on Cloudflare R2, queryable from DuckDB, ClickHouse, BigQuery, Snowflake, and Python.

Testing / Data Profiler

Aegis DQ (⭐4) - Open-source agentic data quality framework with LLM-powered diagnosis, root-cause analysis, SQL auto-fix proposals, and 31 rule types — DuckDB, Postgres, BigQuery, Databricks, Athena, Snowflake.

May 08, 2026

Data Ingestion

drt - OSS Reverse ETL CLI. Sync data from warehouses to business tools via YAML.

May 06, 2026

Data Ingestion

DataSpoc Pipe (⭐2) - Data ingestion engine that connects 400+ Singer taps to Parquet files in cloud buckets (S3, GCS, Azure). Streaming, incremental, with auto-catalog.

DBConvert Streams - self-hosted database migration and change data capture (CDC) tool with built-in SQL IDE.

data-genie (⭐16) - High-performance, streaming-first ETL engine for Node.js and TypeScript with constant memory footprint.

pdfmux - Python PDF-to-Markdown orchestrator. Classifies each page and routes to the optimal backend (PyMuPDF, Docling, RapidOCR, Gemini Flash), emitting Markdown plus a per-page confidence score so ingestion pipelines can quarantine low-trust pages before feeding LLMs or retrieval.

LinkedIn Jobs Scraper - Crawlee-based actor extracting structured LinkedIn job listings at scale without API keys.

Serialization format

AKF (⭐13) - The AI native file format. Trust scores, source provenance, and compliance metadata that embed into 20+ formats (DOCX, PDF, images, code). EXIF for AI.

PFC-JSONL (⭐5) - Specialized JSONL log compressor with block-level timestamp indexing and DuckDB integration. Achieves ~9% compression ratio (better than gzip) with time-range random access queries.

Charts and Dashboards

stratif.io - Open-source, self-hosted, warehouse-native product analytics. Runs funnels, retention, and paths directly on DuckDB, Postgres, Snowflake, or ClickHouse.

AI for Database - Agentic AI platform to connect any database (PostgreSQL, MySQL, MongoDB, etc.) and query in plain English; includes self-refreshing intelligent dashboards and action workflows triggered by data changes.

Workflow

Dotflow (⭐9) - A lightweight Python library for building execution pipelines with retry, parallel execution, cron scheduling, and async support.

Data Lake Management

rawquery - Managed lakehouse platform on Apache Iceberg with DuckDB query compute, S3 storage, Postgres wire protocol, and SQL transforms.

Datasets / Realtime

Helium MCP (⭐11) - Remote MCP server for real-time financial data, 3.2M+ news articles, ML options pricing, and news bias analysis. Free, no API key. MCP

Sorsa API - Real-time X (Twitter) data API providing tweets, profiles, search, communities and engagement metrics. Up to 50x cheaper than the official X API with 20 req/sec rate limit, JSON output.

Datasets / Data Dumps

Mindweave Synthetic Business Data - 42-table synthetic SME dataset with double-entry accounting, tax compliance (AU/US/UK), and temporal realism. CSV, SQL, Parquet, SQLite. Ideal for ETL pipeline testing.

Monitoring / Prometheus

Signals CLI - Intent signal monitoring CLI. Track LinkedIn engagers, keyword posters, job changers, funding events. JSON output for data pipelines.

Testing / Data Profiler

Scherlok - Zero-config data quality CLI. Profiles every table on first run, then auto-detects anomalies (volume drops, schema drift, freshness misses, distribution shifts) on subsequent runs. No YAML, no rules to write. Works with Postgres, BigQuery, Snowflake, and dbt.

Community / Forums

AI Dev Jobs - Job board focused on AI, ML, and data engineering roles with 7,400+ listings, salary data, and a free REST API.

Apr 06, 2026

Databases

Graph
- ArcadeDB - Open-source multi-model database with native graph, document, key-value, and vector support. SQL, Cypher, and Gremlin query languages. Apache 2.0 license.
- Neo4j - The world's leading graph database.
- Omnigraph (⭐730) - Typed graph database where agents branch and merge like Git. S3-native, Rust, traversal + vector + BM25 in one runtime.
- OrientDB - 2nd Generation Distributed Graph Database with the flexibility of Documents in one product with an Open Source commercial friendly license.
- ArangoDB - A distributed free and open-source database with a flexible data model for documents, graphs, and key-values.
- Titan - A scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster.
- FlockDB - A distributed, fault-tolerant graph database by Twitter. Deprecated.
- Actionbase - A database for user interactions (likes, views, follows) represented as graphs, with precomputed reads served in real-time.

Data Ingestion

DataRaven - Managed cloud object storage transfers for ingestion workflows.

Xquik - Real-time X (Twitter) data extraction platform with REST API (76 endpoints), 20 bulk extraction tools, account monitoring, HMAC-signed webhooks, and MCP server for AI agent integration.

Arpe.io - High-speed CLI tools for database export, import, replication and migration with parallel streaming to CSV, Parquet, JSON and cloud storage, supporting PostgreSQL, MySQL, Oracle, SQL Server and 80+ sources.

Crustdata - A real-time B2B data API for company and people intelligence, providing firmographics, headcount signals, job listings, web traffic, and funding events via REST API and webhooks for data enrichment pipelines.

crdt-merge (⭐3) - Conflict-free merge for DataFrames, JSON, ML models & distributed agents — powered by CRDTs.

Batch Processing

dna-claude-analysis (⭐51) - Personal genome analysis toolkit with Python scripts analyzing raw DNA data across 17 categories (health risks, ancestry, pharmacogenomics, nutrition, psychology, etc.) and generating a terminal-style single-page HTML visualization.

Workflow

Bonnard - Governed, multi-tenant MCP access to your customers' data. Turn your warehouse, dbt, or semantic layer into a secure, per-customer MCP for AI agents.

Datasets / Realtime

DexPaprika - Free real-time DEX data via SSE streaming across 34 blockchains. 30M+ pools, 27M+ tokens, ~1 second price updates. No API key, no rate limits. Docs

Datasets / Data Dumps

FirstData (⭐169) - The world's most comprehensive authoritative data source knowledge base. 160+ curated sources from governments, international organizations, and research institutions with MCP integration.

Testing / Data Profiler

Provero (⭐16) - A vendor-neutral, declarative data quality engine. Define checks in YAML, run anywhere. Includes 16 built-in check types, SQL batch optimizer, anomaly detection, and data contracts.

DataScreenIQ - Real-time data quality firewall for pipelines and APIs. Screens rows in milliseconds for schema drift, null spikes, type mismatches, and data anomalies with PASS / WARN / BLOCK decisions.

DataDriven - Interview practice with SQL query execution, Python, and data modeling exercises.

Community / Podcasts

Chain of Thought - Interviews with AI and data infrastructure leaders on building production systems.

Latent Space - Technical deep dives on AI engineering, from model training to deployment.

Practical AI - Making AI practical, productive, and accessible to everyone.

Software Engineering Daily - Daily interviews about technical software topics, including data infrastructure.

The Analytics Engineering Podcast - How analytics engineers build and maintain data pipelines at scale.

Feb 21, 2026

Data Comparison

dvt (⭐515) - Data Validation Tool compares data from source and target tables to ensure that they match. It provides column validation, row validation, schema validation, custom query validation, and ad hoc SQL exploration.

koala-diff (⭐7) - A high-performance Python library for comparing large datasets (CSV, Parquet) locally using Rust and Polars. It features zero-copy streaming to prevent OOM errors and generates interactive HTML data quality reports.

Feb 11, 2026

Data Ingestion

Kreuzberg - Polyglot document intelligence library with a Rust core and bindings for Python, TypeScript, Go, and more. Extracts text, tables, and metadata from 62+ document formats for data pipeline ingestion.

Jan 31, 2026

Data Ingestion

ingestr (⭐3.8k) - CLI tool to copy data between databases with a single command. Supports 50+ sources including PostgreSQL, MySQL, MongoDB, Salesforce, Shopify to any data warehouse.

Workflow

Bruin (⭐1.6k) - End-to-end data pipeline tool that combines ingestion, transformation (SQL + Python), and data quality in a single CLI. Connects to BigQuery, Snowflake, PostgreSQL, Redshift, and more. Includes VS Code extension with live previews.

Jan 06, 2026

Testing / Data Profiler

Snowflake Emulator (⭐43) - A Snowflake-compatible emulator for local development and testing.

Dec 30, 2025

Testing / Data Profiler

daffy (⭐58) - Decorator-first DataFrame contracts/validation (columns/dtypes/constraints) at function boundaries. Supports Pandas/Polars/PyArrow/Modin.

Dec 01, 2025

Data Lake Management

FlightPath Data - FlightPath is a gateway to a data lake's bronze layer, protecting it from invalid external data file feeds as a trusted publisher.

Profiling / Data Profiler

YData Profiling - A general-purpose open-source data profiler for high-level analysis of a dataset.

Desbordante (⭐490) - An open-source data profiler specifically focused on discovery and validation of complex patterns in data.

Nov 01, 2025

Stream Processing

Pathway (⭐63k) - Performant open-source Python ETL framework with Rust runtime, supporting 300+ data sources.

Testing / Data Profiler

GreatExpectation - Open Source data validation framework to manage data quality. Users can define and document “expectations” rules about how data should look and behave.

Sep 23, 2025

Workflow

SQLMesh - An open-source data transformation framework for managing, testing, and deploying SQL and Python-based data pipelines with version control, environment isolation, and automatic dependency resolution.

Sep 15, 2025

Data Ingestion

db2lake (⭐2) - Lightweight Node.js ETL framework for databases → data lakes/warehouses.

Sep 12, 2025

Community / Books

Learn AI Data Engineering in a Month of Lunches - A fast, friendly guide to integrating large language models into your data workflows.

Aug 25, 2025

Charts and Dashboards

QueryGPT (⭐34) - Natural language database query interface with automatic chart generation, supporting Chinese and English queries.

Aug 15, 2025

Community / Books

Architecting an Apache Iceberg Lakehouse - A guide to designing an Apache Iceberg lakehouse from scratch.

Aug 01, 2025

Testing / Data Profiler

Spark Playground - Write, run, and test PySpark code on Spark Playground's online compiler. Access real-world sample datasets & solve interview questions to enhance your PySpark skills for data engineering roles.

Jul 08, 2025

Data Ingestion

Estuary Flow - No/low-code data pipeline platform that handles both batch and real-time data ingestion.

Jun 22, 2025

Data Lake Management

Gravitino (⭐3k) - An open-source, unified metadata management for data lakes, data warehouses, and external catalogs.

Apr 19, 2025

Charts and Dashboards

Seaborn - A Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

Apr 08, 2025

Data Lake Management

Ilum - A modular Data Lakehouse platform that simplifies the management and monitoring of Apache Spark clusters across Kubernetes and Hadoop environments.

Community / Books

Best Data Science Books - This blog offers a curated list of top data science books, categorized by topics and learning stages, to aid readers in building foundational knowledge and staying updated with industry trends.

Mar 15, 2025

Batch Processing

Substation (⭐403) - A cloud native data pipeline and transformation toolkit written in Go.

Community / Books

Snowflake Data Engineering - A practical introduction to data engineering on the Snowflake cloud data platform.

Mar 14, 2025

Stream Processing

CocoIndex (⭐11k) - An open source ETL framework to build fresh index for AI.

Testing / Data Profiler

RunSQL - Free online SQL playground for MySQL, PostgreSQL, and SQL Server. Create database structures, run queries, and share results instantly.

Feb 18, 2025

Data Ingestion

CsvPath Framework - A delimited data preboarding framework that fills the gap between MFT and the data lake.

Oct 25, 2024

Workflow

Hamilton (⭐2.5k) - A lightweight library to define data transformations as a directed-acyclic graph (DAG). If you like dbt for SQL transforms, you will like Hamilton for Python processing.

Sep 04, 2024

Data Ingestion

Artie - Real-time data ingestion tool leveraging change data capture.

Aug 01, 2024

Workflow

Mage - Open-source data pipeline tool for transforming and integrating data.

Jul 24, 2024

Stream Processing

SwimOS - A framework for building real-time streaming data processing applications that supports a wide range of ingestion sources.

Jul 16, 2024

Testing / Data Profiler

DataKitchen - Open Source Data Observability for end-to-end Data Journey Observability, data profiling, anomaly detection, and auto-created data quality validation tests.

Jun 26, 2024

Data Ingestion

Google Sheets ETL (⭐22) - Live import all your Google Sheets to your data warehouse.

Workflow

Kestra (⭐27k) - Scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.

Jun 01, 2024

Data Ingestion

AWS Data Wrangler (⭐4.1k) - Utility belt to handle data on AWS.

File System

JuiceFS (⭐14k) - A high-performance Cloud-Native file system driven by object storage for large-scale data storage.

May 30, 2024

Workflow

CronQ - An application cron-like system. Used w/Luigi. Deprecated.

May 25, 2024

Data Ingestion

Meltano - CLI & code-first ELT.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.

Apr 09, 2024

Data Comparison

datacompy (⭐653) - A Python library that facilitates the comparison of two DataFrames in Pandas, Polars, Spark and more. The library goes beyond basic equality checks by providing detailed insights into discrepancies at both row and column levels.

Mar 27, 2024

Testing / Data Profiler

DQOps (⭐194) - An open-source data quality platform for the whole data platform lifecycle from profiling new data sources to applying full automation of data quality monitoring.

Mar 26, 2024

Data Ingestion

dlt - A fast&simple pipeline building library for Python data devs, runs in notebooks, cloud functions, airflow, etc.

Mar 18, 2024

Data Lake Management

Project Nessie - A Transactional Catalog for Data Lakes with Git-like semantics. Works with Apache Iceberg tables.

Community / Podcasts

The Data Stack Show - A show where they talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

Feb 29, 2024

Workflow

SuprSend - Create automated workflows and logic using API's for your notification service. Add templates, batching, preferences, inapp inbox with workflows to trigger notifications directly from your data warehouse.

Feb 21, 2024

Workflow

Multiwoven (⭐1.7k) - The open-source reverse ETL, data activation platform for modern data teams.

Feb 06, 2024

Profiling / Data Profiler

Data Profiler - The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy.

Jan 29, 2024

Workflow

Prefect - An orchestration and observability platform. With it, developers can rapidly build and scale resilient code, and triage disruptions effortlessly.

Jan 13, 2024

Data Ingestion

Sling - CLI data integration tool specialized in moving data between databases, as well as storage systems.

Dec 08, 2023

Workflow

PACE (⭐39) - An open source framework that allows you to enforce agreements on how data should be accessed, used, and transformed, regardless of the data platform (Snowflake, BigQuery, DataBricks, etc.)

Dec 01, 2023

Databases

Relational
- RQLite - Replicated SQLite using the Raft consensus protocol.
- MySQL - The world's most popular open source database.
  - TiDB (⭐40k) - A distributed NewSQL database compatible with MySQL protocol.
  - Percona XtraBackup - A free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®.
  - mysql_utils (⭐879) - Pinterest MySQL Management Tools.
- MariaDB - An enhanced, drop-in replacement for MySQL.
- PostgreSQL - The world's most advanced open source database.
- Rivestack - Managed PostgreSQL with pgvector for AI workloads. HNSW indexing, sub-4ms latency, and built-in SQL editor with automatic embedding generation.
- Amazon RDS - Makes it easy to set up, operate, and scale a relational database in the cloud.
- Crate.IO - Scalable SQL database with the NOSQL goodies.

Key-Value
- Redis - An open source, BSD licensed, advanced key-value cache and store.
- Riak - A distributed database designed to deliver maximum data availability by distributing data across multiple servers.
- AWS DynamoDB - A fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale.
- HyperDex (⭐1.4k) - A scalable, searchable key-value store. Deprecated.
- SSDB - A high performance NoSQL database supporting many data structures, an alternative to Redis.
- Kyoto Tycoon (⭐279) - A lightweight network server on top of the Kyoto Cabinet key-value database, built for high-performance and concurrency.
- IonDB (⭐594) - A key-value store for microcontroller and IoT applications.

Column
- Cassandra - The right choice when you need scalability and high availability without compromising performance.
  - Cassandra Calculator - This simple form allows you to try out different values for your Apache Cassandra cluster and see what the impact is for your application.
  - CCM (⭐1.2k) - A script to easily create and destroy an Apache Cassandra cluster on localhost.
  - ScyllaDB (⭐16k) - NoSQL data store using the seastar framework, compatible with Apache Cassandra.
- HBase - The Hadoop database, a distributed, scalable, big data store.
- AWS Redshift - A fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools.
- FiloDB - Distributed. Columnar. Versioned. Streaming. SQL.
- Vertica - Distributed, MPP columnar database with extensive analytics SQL.
- ClickHouse - Distributed columnar DBMS for OLAP. SQL.

Document
- MongoDB - An open-source, document database designed for ease of development and scaling.
  - Percona Server for MongoDB - Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Community Edition that includes enterprise-grade features and functionality.
  - MemDB (⭐593) - Distributed Transactional In-Memory Database (based on MongoDB).
- Elasticsearch - Search & Analyze Data in Real Time.
- Couchbase - The highest performing NoSQL distributed database.
- RethinkDB - The open-source database for the realtime web.
- RavenDB - Fully Transactional NoSQL Document Database.

Distributed
- DAtomic - The fully transactional, cloud-ready, distributed database.
- Apache Geode - An open source, distributed, in-memory database for scale-out applications.
- Gaffer - A large-scale graph database.

Timeseries
- InfluxDB (⭐32k) - Scalable datastore for metrics, events, and real-time analytics.
- OpenTSDB - A scalable, distributed Time Series Database.
- QuestDB - A relational column-oriented database designed for real-time analytics on time series and event data.
- kairosdb (⭐1.8k) - Fast scalable time series database.
- Heroic (⭐846) - A scalable time series database based on Cassandra and Elasticsearch, by Spotify.
- Druid (⭐14k) - Column oriented distributed data store ideal for powering interactive applications.
- Riak-TS - Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.
- Akumuli (⭐840) - A numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
- Rhombus - A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
- Dalmatiner DB - Fast distributed metrics database.
- Blueflood (⭐598) - A distributed system designed to ingest and process time series data.
- Timely (⭐393) - A time series database application that provides secure access to time series data based on Accumulo and Grafana.

Other
- Tarantool - An in-memory database and application server.
- GreenPlum - The Greenplum Database (GPDB) - An advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes.
- cayley (⭐15k) - An open-source graph database. Google.
- Snappydata - OLTP + OLAP Database built on Apache Spark.
- TimescaleDB - Built as an extension on top of PostgreSQL, TimescaleDB is a time-series SQL database providing fast analytics, scalability, with automated data management on a proven storage engine.
- DuckDB - A fast in-process analytical database that has zero external dependencies, runs on Linux/macOS/Windows, offers a rich SQL dialect, and is free and extensible.
- SlothDB (⭐439) - In-process analytical SQL database written in C++20. Reads Parquet, CSV, JSON, Avro, Arrow, SQLite, and Excel directly. Single binary, Python package, and 1.3 MB WASM build for the browser.
- chDB - Embedded ClickHouse — full ClickHouse SQL dialect, ~80 data formats, and 12+ source connectors (S3, Postgres, MongoDB, Kafka, Iceberg) in core. Python, Go, Rust, Node, Bun, Zig, and Ruby bindings.
- zvec (⭐15k) - An embedded vector database for on-device RAG and edge AI, the SQLite of vector databases.

Data Ingestion

Kafka - Publish-subscribe messaging rethought as a distributed commit log.
- BottledWater - Change data capture from PostgreSQL into Kafka. Deprecated.
- kafkat (⭐502) - Simplified command-line administration for Kafka brokers.
- kafkacat (⭐5.8k) - Generic command line non-JVM Apache Kafka producer and consumer.
- pg-kafka - A PostgreSQL extension to produce messages to Apache Kafka.
- librdkafka (⭐998) - The Apache Kafka C/C++ library.
- kafka-docker - Kafka in Docker.
- kafka-manager (⭐12k) - A tool for managing Apache Kafka.
- kafka-node (⭐2.7k) - Node.js client for Apache Kafka 0.8.
- Secor (⭐1.9k) - Pinterest's Kafka to S3 distributed consumer.
- Kafka-logger (⭐45) - Kafka-winston logger for Node.js from Uber.
- Kroxylicious (⭐290) - A Kafka Proxy, solving problems like encrypting your Kafka data at rest.

AWS Kinesis - A fully managed, cloud-based service for real-time data processing over large, distributed data streams.

RabbitMQ - Robust messaging for applications.

FluentD - An open source data collector for unified logging layer.

Embulk - An open source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.

Apache Sqoop - A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Heka (⭐3.4k) - Data Acquisition and Processing Made Easy. Deprecated.

Gobblin (⭐2.3k) - Universal data ingestion framework for Hadoop from LinkedIn.

Nakadi - An open source event messaging platform that provides a REST API on top of Kafka-like queues.

Pravega - Provides a new storage abstraction - a stream - for continuous and unbounded data.

Apache Pulsar - An open-source distributed pub-sub messaging system.

Airbyte - Open-source data integration for modern data teams.

File System

HDFS - A distributed file system designed to run on commodity hardware.
- Snakebite (⭐857) - A pure python HDFS client.

AWS S3 - Object storage built to retrieve any amount of data from anywhere.
- smart_open (⭐3.5k) - Utils for streaming large files (S3, HDFS, gzip, bz2).

Alluxio - A memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce.

CEPH - A unified, distributed storage system designed for excellent performance, reliability, and scalability.

OrangeFS - Orange File System is a branch of the Parallel Virtual File System.

SnackFS (⭐13) - A bite-sized, lightweight HDFS compatible file system built over Cassandra.

GlusterFS - Gluster Filesystem.

XtreemFS - Fault-tolerant distributed file system for all storage needs.

SeaweedFS (⭐30) - Seaweed-FS is a simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast! Instead of supporting full POSIX file system semantics, Seaweed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS".

S3QL (⭐1.3k) - A file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack.

LizardFS - Software Defined Storage is a distributed, parallel, scalable, fault-tolerant, Geo-Redundant and highly available file system.

Serialization format

Apache Avro - Apache Avro™ is a data serialization system.

Apache Parquet - A columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
- Snappy (⭐6.6k) - A fast compressor/decompressor. Used with Parquet.
- PigZ - A parallel implementation of gzip for modern multi-processor, multi-core machines.

Apache ORC - The smallest, fastest columnar storage for Hadoop workloads.

Apache Thrift - The Apache Thrift software framework, for scalable cross-language services development.

ProtoBuf (⭐71k) - Protocol Buffers - Google's data interchange format.

SequenceFile - A flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.

Kryo (⭐6.5k) - A fast and efficient object graph serialization framework for Java.

Stream Processing

Apache Beam - A unified programming model that implements both batch and streaming data processing jobs that run on many execution engines.

Spark Streaming - Makes it easy to build scalable fault-tolerant streaming applications.

Apache Flink - A streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.

Apache Storm - A free and open source distributed realtime computation system.

Apache Samza - A distributed stream processing framework.

Apache NiFi - An easy to use, powerful, and reliable system to process and distribute data.

Apache Hudi - An open source framework for managing storage for real time processing, one of the most interesting feature is the Upsert.

VoltDB - An ACID-compliant RDBMS which uses a shared nothing architecture.

PipelineDB (⭐2.7k) - The Streaming SQL Database.

Spring Cloud Dataflow - Streaming and tasks execution between Spring Boot apps.

Bonobo - A data-processing toolkit for python 3.5+.

Robinhood's Faust (⭐1.9k) - Forever scalable event processing & in-memory durable K/V store as a library with asyncio & static typing.

HStreamDB (⭐721) - The streaming database built for IoT data storage and real-time processing.

Kuiper (⭐1.7k) - An edge lightweight IoT data analytics/streaming software implemented by Golang, and it can be run at all kinds of resource-constrained edge devices.

Zilla (⭐691) - - An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT, and the native Kafka protocol.

Batch Processing

Hadoop MapReduce - A software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) - in-parallel on large clusters (thousands of nodes) - of commodity hardware in a reliable, fault-tolerant manner.

Spark - A multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
- Spark Packages - A community index of packages for Apache Spark.
- Deep Spark (⭐197) - Connecting Apache Spark with different data stores. Deprecated.
- Spark RDD API Examples - Examples by Zhen He.
- Livy - The REST Spark Server.
- Delight (⭐345) - A free & cross platform monitoring tool (Spark UI / Spark History Server alternative).

AWS EMR - A web service that makes it easy to quickly and cost-effectively process vast amounts of data.

Data Mechanics - A cloud-based platform deployed on Kubernetes making Apache Spark more developer-friendly and cost-effective.

Tez - An application framework which allows for a complex directed-acyclic-graph of tasks for processing data.

Bistro - A light-weight engine for general-purpose data processing including both batch and stream analytics. It is based on a novel unique data model, which represents data via functions and processes data via columns operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.

Batch ML
- H2O - Fast scalable machine learning API for smarter applications.
- Mahout - An environment for quickly creating scalable performant machine learning applications.
- Spark MLlib - Spark's scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
- Datatrax (⭐10) - Pure-Go classic machine learning toolkit and data engineering utilities. Eight algorithms with zero external dependencies.
- Zingg - Open source Master Data Management platform using machine learning for entity resolution at scale. Native to Databricks, Microsoft Fabric, Snowflake, AWS, and GCP. Golden records are maintained through a persistent Zingg ID across all systems and sources.

Batch Graph
- GraphLab Create - A machine learning platform that enables data scientists and app developers to easily create intelligent apps at scale.
- Giraph - An iterative graph processing system built for high scalability.
- Spark GraphX - Apache Spark's API for graphs and graph-parallel computation.

Batch SQL
- Presto - A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.
- Hive - Data warehouse software facilitates querying and managing large datasets residing in distributed storage.
  - Hivemall (⭐313) - Scalable machine learning library for Hive/Hadoop.
  - PyHive (⭐1.7k) - Python interface to Hive and Presto.
- Drill - Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.

Charts and Dashboards

Highcharts - A charting library written in pure JavaScript, offering an easy way of adding interactive charts to your web site or web application.

ZingChart - Fast JavaScript charts for any data set.

C3.js - D3-based reusable chart library.

D3.js - A JavaScript library for manipulating documents based on data.
- D3Plus - D3's simpler, easier to use cousin. Mostly predefined templates that you can just plug data in.

SmoothieCharts - A JavaScript Charting Library for Streaming Data.

PyXley (⭐2.3k) - Python helpers for building dashboards using Flask and React.

Plotly (⭐24k) - Flask, JS, and CSS boilerplate for interactive, web-based visualization apps in Python.

Apache Superset (⭐74k) - A modern, enterprise-ready business intelligence web application.

Redash - Make Your Company Data Driven. Connect to any data source, easily visualize and share your data.

Metabase (⭐48k) - The easy, open source way for everyone in your company to ask questions and learn from data.

PyQtGraph - A pure-python graphics and GUI library built on PyQt4 / PySide and numpy. It is intended for use in mathematics / scientific / engineering applications.

Workflow

Luigi (⭐19k) - A Python module that helps you build complex pipelines of batch jobs.

Cascading - Java based application development platform.

Airflow (⭐46k) - A system to programmatically author, schedule, and monitor data pipelines.

Azkaban - A batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy-to-use web user interface to maintain and track your workflows.

Oozie - A workflow scheduler system to manage Apache Hadoop jobs.

Pinball (⭐1k) - DAG based workflow manager. Job flows are defined programmatically in Python. Support output passing between jobs.

Dagster (⭐16k) - An open-source Python library for building data applications.

Kedro - A framework that makes it easy to build robust and scalable data pipelines by providing uniform project templates, data abstraction, configuration and pipeline assembly.

Dataform - An open-source framework and web based IDE to manage datasets and their dependencies. SQLX extends your existing SQL warehouse dialect to add features that support dependency management, testing, documentation and more.

Census - A reverse-ETL tool that let you sync data from your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, Zendesk, etc. No engineering favors required—just SQL.

dbt - A command line tool that enables data analysts and engineers to transform data in their warehouses more effectively.

RudderStack (⭐4.5k) - A warehouse-first Customer Data Platform that enables you to collect data from every application, website and SaaS platform, and then activate it in your warehouse and business tools.

Data Lake Management

lakeFS - An open source platform that delivers resilience and manageability to object-storage based data lakes.

ELK Elastic Logstash Kibana

docker-logstash (⭐237) - A highly configurable Logstash (1.4.4) - Docker image running Elasticsearch (1.7.0) - and Kibana (3.1.2).

elasticsearch-jdbc - JDBC importer for Elasticsearch.

ZomboDB (⭐4.7k) - PostgreSQL Extension that allows creating an index backed by Elasticsearch.

Docker

Gockerize - Package golang service into minimal Docker containers.

Flocker (⭐3.4k) - Easily manage Docker containers & their data.

Rancher - RancherOS is a 20mb Linux distro that runs the entire OS as Docker containers.

Kontena - Application Containers for Masses.

Weave (⭐6.6k) - Weaving Docker containers into applications.

Zodiac - A lightweight tool for easy deployment and rollback of dockerized applications.

cAdvisor (⭐19k) - Analyzes resource usage and performance characteristics of running containers.

Micro S3 persistence - Docker microservice for saving/restoring volume data to S3.

Rocker-compose (⭐408) - Docker composition tool with idempotency features for deploying apps composed of multiple containers. Deprecated.

Nomad (⭐17k) - A cluster manager, designed for both long-lived services and short-lived batch processing workloads.

ImageLayers - Visualize Docker images and the layers that compose them.

Nov 22, 2023

Testing / Data Profiler

Grai (⭐315) - A data catalog tool that integrates into your CI system exposing downstream impact testing of data changes. These tests prevent data changes which might break data pipelines or BI dashboards from making it to production.

Feb 13, 2021

Community / Conferences

Data Council - The first technical conference that bridges the gap between data scientists, data engineers and data analysts.

Jan 28, 2019

Datasets / Realtime

Twitter Realtime - The Streaming APIs give developers low latency access to Twitter's global stream of Tweet data.

Datasets / Data Dumps

GitHub Archive - GitHub's public timeline since 2011, updated every hour.

Aug 23, 2017

Community / Podcasts

Data Engineering Podcast - The show about modern data infrastructure.

Apr 12, 2017

Community / Forums

/r/dataengineering - News, tips, and background on Data Engineering.

/r/etl - Subreddit focused on ETL.

Mar 24, 2017

Datasets / Realtime

Reddit - Real-time data is available including comments, submissions and links posted to reddit.

Datasets / Data Dumps

Common Crawl - Open source repository of web crawl data.

Wikipedia - Wikipedia's complete copy of all wikis, in the form of Wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available.

Sep 03, 2015

Datasets / Realtime

Eventsim - Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.

Jul 17, 2015

Monitoring / Prometheus

HAProxy Exporter (⭐627) - Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption.

Jul 16, 2015

Monitoring / Prometheus

Prometheus.io - An open-source service monitoring system and time series database.