Top 50 Awesome List

Higher Education

Higher Education

igorbarinov/awesome-data-engineering

Big Data  4 months ago  4.4k
A curated list of data engineering tools for software developers
View byDAY/WEEK/README
View on Github

Apr 11th - Apr 17th, 2022

Stream Processing

  • Robinhood's Fauststars623 Forever scalable event processing & in-memory durable K/V store as a library with asyncio & static typing.
  • May 17th - May 23rd, 2021

    Workflow

  • Dataform is an open-source framework and web based IDE to manage datasets and their dependencies. SQLX extends your existing SQL warehouse dialect to add features that support dependency management, testing, documentation and more.
  • Mar 22nd - Mar 28th, 2021

    Stream Processing

  • HStreamDBstars522 The streaming database built for IoT data storage and real-time processing.
  • Kuiperstars747 An edge lightweight IoT data analytics/streaming software implemented by Golang, and it can be run at all kinds of resource-constrained edge devices.
  • Mar 8th - Mar 14th, 2021

    Workflow

  • Census is a reverse-ETL tool that let you sync data from your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, Zendesk, etc. No engineering favors required—just SQL.
  • dbt is a command line tool that enables data analysts and engineers to transform data in their warehouses more effectively.
  • Feb 8th - Feb 14th, 2021

    Conferences

  • Data Council Data Council is the first technical conference that bridges the gap between data scientists, data engineers and data analysts.
  • Aug 3rd - Aug 9th, 2020

    Data Lake Management

  • lakeFSstars2.8k lakeFS is an open source platform that delivers resilience and manageability to object-storage based data lakes.
  • May 18th - May 24th, 2020

    Stream Processing

  • Apache Hudi Apache Hudi is an open source framework for managing storage for real time processing, one of the most interesting feature is the Upsert
  • Oct 7th - Oct 13th, 2019

    Data Ingestion

  • AWS Data Wranlgerstars3k Utility belt to handle data on AWS.
  • Sep 30th - Oct 6th, 2019

    Workflow

  • Dagsterstars5.2k Dagster is an open-source Python library for building data applications.
  • Jan 28th - Feb 3rd, 2019

    Data Ingestion

  • Kafka Publish-subscribe messaging rethought as a distributed commit log.
  • AWS Kinesis A fully managed, cloud-based service for real-time data processing over large, distributed data streams.
  • RabbitMQ Robust messaging for applications.
  • FluentD An open source data collector for unified logging layer.
  • Embulk An open source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.
  • Hekastars3.4k Data Acquisition and Processing Made Easy. Deprecated.
  • Gobblinstars2.1k Universal data ingestion framework for Hadoop from Linkedin
  • Nakadi Nakadi is an open source event messaging platform that provides a REST API on top of Kafka-like queues.
  • Pravega Pravega provides a new storage abstraction - a stream - for continuous and unbounded data.
  • File System

  • HDFS
  • AWS S3
  • Alluxio Alluxio is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce
  • CEPH Ceph is a unified, distributed storage system designed for excellent performance, reliability and scalability
  • OrangeFS Orange File System is a branch of the Parallel Virtual File System
  • GlusterFS Gluster Filesystem
  • S3QLstars912 S3QL is a file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack.
  • Stream Processing

  • Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data
  • PipelineDBstars2.5k The Streaming SQL Database
  • Spring Cloud Dataflow Streaming and tasks execution between Spring Boot apps
  • Batch Processing

  • Hadoop MapReduce Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner
  • AWS EMR A web service that makes it easy to quickly and cost-effectively process vast amounts of data.
  • Charts and Dashboards

  • Highcharts A charting library written in pure JavaScript, offering an easy way of adding interactive charts to your web site or web application.
  • ZingChart Fast JavaScript charts for any data set.
  • C3.js D3-based reusable chart library.
  • D3.js A JavaScript library for manipulating documents based on data.
    • D3Plus D3's simplier, easier to use cousin. Mostly predefined templates that you can just plug data in.
  • Apache Supersetstars47.6k Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application
  • Workflow

  • Cascading Java based application development platform.
  • Airflowstars27k Airflow is a system to programmaticaly author, schedule and monitor data pipelines.
  • Oozie Oozie is a workflow scheduler system to manage Apache Hadoop jobs
  • Docker

  • Gockerizestars663 Package golang service into minimal docker containers
  • Rancher RancherOS is a 20mb Linux distro that runs the entire OS as Docker containers
  • Kontena Application Containers for Masses
  • Weavestars6.3k Weaving Docker containers into applications
  • Micro S3 persistencestars11 Docker microservice for saving/restoring volume data to S3
  • Rocker-composestars409 Docker composition tool with idempotency features for deploying apps composed of multiple containers. Deprecated.
  • Realtime

  • Twitter Realtime The Streaming APIs give developers low latency access to Twitter’s global stream of Tweet data.
  • Data Dumps

  • GitHub Archive GitHub's public timeline since 2011, updated every hour
  • Sep 3rd - Sep 9th, 2018

    Stream Processing

  • VoltDB VoltDb is an ACID-compliant RDBMS which uses a shared nothing architecture.
  • Apr 16th - Apr 22nd, 2018

    Data Ingestion

  • Apache Pulsar Apache Pulsar is an open-source distributed pub-sub messaging system.
  • Batch Processing

  • Bistrostars3 is a light-weight engine for general-purpose data processing including both batch and stream analytics. It is based on a novel unique data model, which represents data via functions and processes data via columns operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.
  • Apr 9th - Apr 15th, 2018

    Charts and Dashboards

  • PyQtGraph PyQtGraph is a pure-python graphics and GUI library built on PyQt4 / PySide and numpy. It is intended for use in mathematics / scientific / engineering applications.
  • Feb 26th - Mar 4th, 2018

    Stream Processing

  • Apache Beam Apache Beam is a unified programming model that implements both batch and streaming data processing jobs that run on many execution engines.
  • Jan 8th - Jan 14th, 2018

    Stream Processing

  • Bonobo Bonobo is a data-processing toolkit for python 3.5+
  • Oct 23rd - Oct 29th, 2017

    Charts and Dashboards

  • Redash Make Your Company Data Driven. Connect to any data source, easily visualize and share your data.
  • Sep 25th - Oct 1st, 2017

    Charts and Dashboards

  • Metabasestars29.4k Metabase is the easy, open source way for everyone in your company to ask questions and learn from data.
  • Aug 21st - Aug 27th, 2017

    Podcasts

  • Data Engineering Podcast The show about modern data infrastructure.
  • Apr 10th - Apr 16th, 2017

    Forums

  • /r/dataengineering News, tips and background on Data Engineering
  • /r/etl Subreddit focused on ETL
  • Mar 20th - Mar 26th, 2017

    Batch Processing

  • Spark
  • Tez An application framework which allows for a complex directed-acyclic-graph of tasks for processing data.
  • Workflow

  • Luigistars15.9k Luigi is a Python module that helps you build complex pipelines of batch jobs.
    • CronQ An application cron-like system. Used w/Luige. Deprecated.
  • Azkaban Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows.
  • Realtime

  • Reddit Real-time data is available including comments, submissions and links posted to reddit
  • Data Dumps

  • Common Crawl Open source repository of web crawl data
  • Wikipedia Wikipedia's complete copy of all wikis, in the form of wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available.
  • Feb 27th - Mar 5th, 2017

    File System

  • LizardFS LizardFS Software Defined Storage is a distributed, parallel, scalable, fault-tolerant, Geo-Redundant and highly available file system.
  • Sep 12th - Sep 18th, 2016

    Workflow

  • Pinballstars1k DAG based workflow manager. Job flows are defined programmaticaly in Python. Support output passing between jobs.
  • Feb 1st - Feb 7th, 2016

    Charts and Dashboards

  • Plotlystars17.1k Flask, JS, and CSS boilerplate for interactive, web-based visualization apps in Python
  • Jan 11th - Jan 17th, 2016

    Data Ingestion

  • Apache Sqoop A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • Dec 21st - Dec 27th, 2015

    Docker

  • ImageLayers Vizualize docker images and the layers that compose them
  • Oct 19th - Oct 25th, 2015

    Databases

  • Other
    • Tarantoolstars3k Tarantool is an in-memory database and application server.
    • GreenPlumstars5.3k The Greenplum Database (GPDB) is an advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes.
    • cayleystars14.3k An open-source graph database. Google.
    • Snappydatastars1k SnappyData: OLTP + OLAP Database built on Apache Spark
    • TimescaleDB: Built as an extension on top of PostgreSQL, TimescaleDB is a time-series SQL database providing fast analytics, scalability, with automated data management on a proven storage engine.
  • Sep 28th - Oct 4th, 2015

    Docker

  • Nomadstars12.1k Nomad is a cluster manager, designed for both long lived services and short lived batch processing workloads
  • Aug 31st - Sep 6th, 2015

    Realtime

  • Eventsimstars411 Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.
  • Aug 24th - Aug 30th, 2015

    Charts and Dashboards

  • SmoothieCharts A JavaScript Charting Library for Streaming Data.
  • PyXleystars2.3k Python helpers for building dashboards using Flask and React
  • ELK Elastic Logstash Kibana

  • docker-logstashstars239 A highly configurable logstash (1.4.4) docker image running Elasticsearch (1.7.0) and Kibana (3.1.2).
  • Aug 10th - Aug 16th, 2015

    Docker

  • cAdvisorstars13.7k Analyzes resource usage and performance characteristics of running containers
  • Zodiacstars195 A lightweight tool for easy deployment and rollback of dockerized applications
  • File System

  • SeaweedFSstars15k Seaweed-FS is a simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast! Instead of supporting full POSIX file system semantics, Seaweed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS".
  • Aug 3rd - Aug 9th, 2015

    Databases

  • Timeseries
    • InfluxDBstars24k Scalable datastore for metrics, events, and real-time analytics.
    • OpenTSDBstars4.7k A scalable, distributed Time Series Database.
    • QuestDB A relational column-oriented database designed for real-time analytics on time series and event data.
    • kairosdbstars1.7k Fast scalable time series database.
    • Heroicstars843 A scalable time series database based on Cassandra and Elasticsearch, by Spotify
    • Druidstars12k Column oriented distributed data store ideal for powering interactive applications
    • Riak-TS Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data
    • Akumulistars796 Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
    • Rhombus A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
    • Dalmatiner DBstars701 Fast distributed metrics database
    • Bluefloodstars593 A distributed system designed to ingest and process time series data
    • Timelystars364 Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.
  • Jul 20th - Jul 26th, 2015

    ELK Elastic Logstash Kibana

  • ZomboDBstars4.1k Postgres Extension that allows creating an index backed by Elasticsearch
  • Jul 13th - Jul 19th, 2015

    Prometheus

  • HAProxy Exporterstars566 Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption
  • Prometheus.iostars43.8k An open-source service monitoring system and time series database
  • File System

  • XtreemFS fault-tolerant distributed file system for all storage needs
  • Jul 6th - Jul 12th, 2015

    File System

  • SnackFSstars13 SnackFS is our bite-sized, lightweight HDFS compatible FileSystem built over Cassandra
  • Jun 29th - Jul 5th, 2015

    Databases

  • Distributed
    • DAtomic The fully transactional, cloud-ready, distributed database.
    • Apache Geode An open source, distributed, in-memory database for scale-out applications.
    • Gaffer stars1.7k A large-scale graph database
  • ELK Elastic Logstash Kibana

  • elasticsearch-jdbcstars2.8k JDBC importer for Elasticsearch
  • Jun 22nd - Jun 28th, 2015

    Stream Processing

  • Apache Flink Apache Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
  • Spark Streaming Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
  • Apache Storm Apache Storm is a free and open source distributed realtime computation system
  • Apache Samza Apache Samza is a distributed stream processing framework
  • Databases

  • Relational
    • RQLitestars10.9k Replicated SQLite using the Raft consensus protocol
    • MySQL The world's most popular open source database.
      • TiDBstars32.1k TiDB is a distributed NewSQL database compatible with MySQL protocol
      • Percona XtraBackup Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®
      • mysql_utilsstars875 Pinterest MySQL Management Tools
    • MariaDB An enhanced, drop-in replacement for MySQL.
    • PostgreSQL The world's most advanced open source database.
    • Amazon RDS Amazon RDS makes it easy to set up, operate, and scale a relational database in the cloud.
    • Crate.IO Scalable SQL database with the NOSQL goodies.
  • Key-Value
    • Redis An open source, BSD licensed, advanced key-value cache and store.
    • Riak A distributed database designed to deliver maximum data availability by distributing data across multiple servers.
    • AWS DynamoDB A fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale.
    • HyperDexstars1.4k HyperDex is a scalable, searchable key-value store. Deprecated.
    • SSDB A high performance NoSQL database supporting many data structures, an alternative to Redis
    • Kyoto Tycoonstars254 Kyoto Tycoon is a lightweight network server on top of the Kyoto Cabinet key-value database, built for high-performance and concurrency
    • IonDBstars570 A key-value store for microcontroller and IoT applications
  • Column
    • Cassandra The right choice when you need scalability and high availability without compromising performance.
      • Cassandra Calculator This simple form allows you to try out different values for your Apache Cassandra cluster and see what the impact is for your application.
      • CCMstars1.2k A script to easily create and destroy an Apache Cassandra cluster on localhost
      • ScyllaDBstars8.2k NoSQL data store using the seastar framework, compatible with Apache Cassandra https://www.scylladb.com/
    • HBase The Hadoop database, a distributed, scalable, big data store.
    • AWS Redshift A fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools.
    • FiloDBstars1.4k Distributed. Columnar. Versioned. Streaming. SQL.
    • Vertica Distributed, MPP columnar database with extensive analytics SQL.
    • ClickHouse Distributed columnar DBMS for OLAP. SQL.
  • Document
    • MongoDB An open-source, document database designed for ease of development and scaling.
      • Percona Server for MongoDB Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Community Edition that includes enterprise-grade features and functionality.
      • MemDBstars596 Distributed Transactional In-Memory Database (based on MongoDB)
    • Elasticsearch Search & Analyze Data in Real Time.
    • Couchbase The highest performing NoSQL distributed database.
    • RethinkDB The open-source database for the realtime web.
    • RavenDB Fully Transactional NoSQL Document Database.
  • Graph
    • Neo4j The world’s leading graph database.
    • OrientDB 2nd Generation Distributed Graph Database with the flexibility of Documents in one product with an Open Source commercial friendly license.
    • ArangoDB A distributed free and open-source database with a flexible data model for documents, graphs, and key-values.
    • Titan A scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster.
    • FlockDBstars3.3k A distributed, fault-tolerant graph database by Twitter. Deprecated.
  • Jun 15th - Jun 21st, 2015

    Docker

  • Flockerstars3.3k Easily manage Docker containers & their data
  • Serialization format

  • Apache Avro Apache Avro™ is a data serialization system
  • Apache Parquet Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
    • Snappystars5.3k A fast compressor/decompressor. Used with Parquet
    • PigZ A parallel implementation of gzip for modern
  • Last Checked At: 2022-08-15T14:51:40.631Z
    Previous
    youngwookim/awesome-hadoop
    Next
    manuzhang/awesome-streaming

    About

    Track your favorite github awesome repo, not just star it. trackawesomelist.com provides website, newsletter, RSS for tracking the popular awesome list by daily and weekly.
    Contact us: [email protected]
    Track Awesome List - Track your favorite Github awesome repos, not just star them | Product Hunt

    Subscribe

    Subscribe to our weekly newsletter to receive the awesome updates! We never send spam and you can unsubscribe instantly with one click. Here's past issues.

    Links

    Follow us on TwitterSubscribe us on TelegramSubmit awesome list repoNewsletterDonateSitemap