Top 50 Awesome List

0xnr/awesome-bigdata

Big Data  24 days ago  10.3k
A curated list of awesome big data frameworks, ressources and other awesomeness.
View byDAY/WEEK/README
View on Github

Oct 1st

Time-Series Databases

  • InfluxDB - a time series database with optimised IO and queries, supports pgsql and influx wire protocols.
  • QuestDB - high-performance, open-source SQL database for applications in financial services, IoT, machine learning, DevOps and observability.
  • Mar 24th

    Internet of things and sensor data

  • Ably - Pub/sub messaging platform for IoT
  • Mar 11th

    Data Ingestion

  • Census - A reverse ETL product that let you sync data from your data warehouse to SaaS Applications. No engineering favors required—just SQL.
  • Business Intelligence

  • Dekart - Large scale geospatial analytics for Google BigQuery based on Kepler.gl.
  • Data Visualization

  • Dekart - Large scale geospatial analytics for Google BigQuery based on Kepler.gl.
  • Mar 1st

    Frameworks

  • Smooksstars318 - An extensible Java framework for building XML and non-XML (CSV, EDI, Java, etc...) streaming applications.
  • Feb 11th

    Scheduling

  • Croniclestars1.2k - Distributed, easy to install, NodeJS based, task scheduler
  • Data Visualization

  • Dashstars15.3k - Analytical Web Apps for Python, R, Julia, and Jupyter. Built on top of plotly, no JS required
  • Feb 6th

    Other Awesome Lists

  • Google Bigtablestars23.
  • Feb 2nd

    Business Intelligence

  • Count - notebook-based anlytics and visualisation platform using SQL or drag-and-drop.
  • Jan 1st

    Machine Learning

  • Shapleystars122 - A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
  • Dec 17th, 2020

    Applications

  • HASH - open source simulation and visualization platform.
  • Nov 17th, 2020

    Machine Learning

  • PyTorch Geometric Temporalstars1.1k - a temporal extension library for PyTorch Geometric .
  • Nov 5th, 2020

    Books

    Streaming

  • Azure Data Engineering - A book about data engineering in general and the Azure platform specifically
  • Oct 2nd, 2020

    Key-value Data Model

  • Gravitonstars402 - a simple, fast, versioned, authenticated, embeddable key-value store database in pure Go(lang).
  • Sep 17th, 2020

    Videos

  • Elasticsearch 7 and Elastic Stack - LiveVideo tutorial that covers searching, analyzing, and visualizing big data on a cluster with Elasticsearch, Logstash, Beats, Kibana, and more.
  • Sep 16th, 2020

    Scheduling

  • Dagsterstars3.9k - a data orchestrator for machine learning, analytics, and ETL.
  • Aug 24th, 2020

    Videos

  • Data warehouse schema design - dimensional modeling and star schema - Introduction to schema design for data warehouse using the star schema method.
  • Aug 19th, 2020

    SQL-like processing

  • Invantive SQL - SQL engine for online and on-premise use with integrated local data replication and 70+ connectors.
  • Aug 7th, 2020

    SQL-like processing

  • Materializestars3.2k - is a streaming database for real-time applications using SQL for queries and supporting a large fraction of PostgreSQL.
  • Jul 18th, 2020

    Key-value Data Model

  • GhostDBstars707 - a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.
  • Data Ingestion

  • Apache Pulsarstars9.8k - a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.
  • Jul 10th, 2020

    Search engine and framework

  • Weaviatestars1.8k - Weaviate is a GraphQL-based semantic search engine with build-in (word) embeddings.
  • Jun 12th, 2020

    Books

    Streaming

  • Grokking Streaming Systems - Grokking Streaming Systems helps you unravel what streaming systems are, how they work, and whether they’re right for your business. Written to be tool-agnostic, you’ll be able to apply what you learn no matter which framework you choose.
  • May 21st, 2020

    Data Ingestion

  • redpanda - A Kafka® replacement for mission critical systems; 10x faster. Written in C++.
  • May 18th, 2020

    Machine Learning

  • Little Ball of Furstars563 - A subsampling library for graph structured data. Python
  • May 7th, 2020

    Data Ingestion

  • RudderStackstars2.8k - an open source customer data infrastructure (segment, mParticle alternative) written in go.
  • Apr 29th, 2020

    Data Ingestion

  • Gazettestars285 - Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.
  • Mar 8th, 2020

    Interesting Papers

    2001 - 2010

  • 2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
  • 2008 - AMPLab - Chukwa: A large-scale monitoring system.
  • NewSQL Databases

  • BayesDBstars881 - statistic oriented SQL database.
  • Machine Learning

  • Oryxstars1.8k - Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning.
  • Lambdostars1 - Lambdo is a workflow engine which significantly simplifies the analysis process by unifying feature engineering and machine learning operations.
  • Frameworks

  • Bistrostars1k - general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via functions and processes data via column operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.
  • Jan 25th, 2020

    Machine Learning

  • Karate Clubstars1.4k - An unsupervised machine learning library for graph structured data. Python
  • Jan 13th, 2020

    Data Visualization

  • DataSphere Studiostars1.6k - one-stop data application development management portal.
  • System Deployment

  • Linkis - Linkis helps easily connect to various back-end computation/storage engines.
  • Jan 10th, 2020

    Distributed Programming

  • Apache Spark Streaming - framework for stream processing, part of Spark.
  • Dec 26th, 2019

    NewSQL Databases

  • yugabyteDBstars5.7k - open source, high-performance, distributed SQL database compatible with PostgreSQL.
  • Dec 13th, 2019

    Other Awesome Lists

  • Monte Carlo Tree Search Papers awesome-monte-carlo-tree-search-papersstars431.
  • Dec 4th, 2019

    Business Intelligence

  • Saiku Analytics - Open source analytics platform.
  • Time-Series Databases

  • TDenginestars17k - a time series database in C utilizing unique features of IoT to improve read/write throughput and reduce space needed to store data
  • Oct 8th, 2019

    NewSQL Databases

  • KarelDBstars370 - a relational database backed by Apache Kafka.
  • Oct 6th, 2019

    Business Intelligence

  • Knowage - open source business intelligence platform. (former SpagoBi)
  • Oct 2nd, 2019

    Search engine and framework

  • Facebook Faissstars14.9k - is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.
  • Annoystars9.1k - is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.
  • Sep 17th, 2019

    Videos

  • Machine Learning, Data Science and Deep Learning with Python - LiveVideo tutorial that covers machine learning, Tensorflow, artificial intelligence, and neural networks.
  • Sep 14th, 2019

    Machine Learning

  • ML Workspacestars2.3k - All-in-one web-based IDE specialized for machine learning and data science.
  • Sep 9th, 2019

    Business Intelligence

  • intermix.io - Performance Monitoring for Amazon Redshift
  • Aug 30th, 2019

    SQL-like processing

  • Apache HCatalog - table and storage management layer for Hadoop.
  • Jul 17th, 2019

    Applications

  • Indicative - Web & mobile analytics tool, with data warehouse (AWS, BigQuery) integration.
  • Jul 14th, 2019

    Other Awesome Lists

  • Graph Classification awesome-graph-classificationstars4.3k.
  • Jul 8th, 2019

    SQL-like processing

  • Dremio - an open-source, SQL-like Data-as-a-Service Platform based on Apache Arrow.
  • Jun 19th, 2019

    Other Awesome Lists

  • Kafka awesome-kafkastars138.
  • Books

    Streaming

  • Spark in Action & Spark in Action 2nd Ed. - Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0.
  • Applications

  • Jupyter - Notebook and project application for interactive data science and scientific computing across all programming languages.
  • Time-Series Databases

  • IronDB - scalable, general-purpose time series database.
  • Business Intelligence

  • Blazerstars3.1k - business intelligence made simple.
  • May 28th, 2019

    Other Awesome Lists

  • Decision Tree Papers awesome-decision-tree-papersstars1.9k.
  • Fraud Detection Papers awesome-fraud-detection-papersstars977.
  • Gradient Boosting Papers awesome-gradient-boosting-papersstars754.
  • May 26th, 2019

    Time-Series Databases

  • VictoriaMetricsstars5.3k - fast, scalable and resource-effective open-source TSDB compatible with Prometheus. Single-node and cluster versions included
  • May 24th, 2019

    Distributed Programming

  • Raystars17.9k - A fast and simple framework for building and running distributed applications.
  • Feb 1st, 2019

    Data Visualization

  • Vegastars9.5k - a visualization grammar.
  • Jan 31st, 2019

    Graph Data Model

  • JanusGraph - open-source, distributed graph database with multiple options for storage backends (Bigtable, HBase, Cassandra, etc.) and indexing backends (Elasticsearch, Solr, Lucene).
  • Jan 26th, 2019

    Other Awesome Lists

  • Network Embedding awesome-network-embeddingstars2.4k.
  • Community Detection awesome-community-detectionstars1.8k.
  • Jan 25th, 2019

    Frameworks

  • Polyaxonstars2.9k - A platform for reproducible and scalable machine learning and deep learning.
  • Jan 7th, 2019

    Machine Learning

  • Feaststars2.4k - A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and model serving.
  • Nov 16th, 2018

    Business Intelligence

  • Numeracy - Fast, clean SQL client and business intelligence.
  • Oct 31st, 2018

    Books

    Streaming

  • Fusion in Action - Fusion in Action teaches you to build a full-featured data analytics pipeline, including document and data search and distributed data clustering.
  • Interesting Readings

  • Monitoring Cassandra performance - Guide to monitoring Cassandra, including native methods for metrics collection.
  • Oct 29th, 2018

    Service Programming

  • Marastars1.8k - A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
  • Oct 27th, 2018

    Books

    Streaming

  • Data Science at Scale with Python and Dask - Data Science at Scale with Python and Dask teaches you how to build distributed data projects that can handle huge amounts of data.
  • Oct 21st, 2018

    Books

    Graph Based approach

  • Graph-Powered Machine Learning - Alessandro Negro. Combine graph theory and models to improve machine learning projects
  • Oct 6th, 2018

    Time-Series Databases

  • M3DB - a distributed time series database that can be used for storing realtime metrics at long retention.
  • Oct 2nd, 2018

    Data Visualization

  • Frappe Charts - GitHub-inspired simple and modern SVG charts for the web with zero dependencies.
  • Oct 1st, 2018

    Graph Data Model

  • Microsoft Graph Enginestars2k - a distributed in-memory data processing engine, underpinned by a strongly-typed in-memory key-value store and a general distributed computation engine.
  • Data Ingestion

  • Amazon Web Services Glue - serverless fully managed extract, transform, and load (ETL) service
  • Aug 25th, 2018

    Business Intelligence

  • Metabasestars26.3k - The simplest, fastest way to get business intelligence and analytics to everyone in your company.
  • NewSQL Databases

  • ActorDBstars1.9k - a distributed SQL database with the scalability of a KV store, while keeping the query capabilities of a relational database.
  • Map-D - GPU in-memory database, big data analysis and visualization platform.
  • VoltDB - claims to be fastest in-memory database.
  • Jul 13th, 2018

    Data Visualization

  • DevExtreme React Chart - High-performance plugin-based React chart for Bootstrap and Material Design.
  • Jul 9th, 2018

    Columnar Databases

  • Google BigQuery - Google's cloud offering backed by their pioneering work on Dremel.
  • Amazon Redshift - Amazon's cloud offering, also based on a columnar datastore backend.
  • IndexRstars442 - an open-source columnar storage format for fast & realtime analytic with big data.
  • LocustDBstars1.3k - an experimental analytics database aiming to set a new standard for query performance on commodity hardware.
  • Jun 21st, 2018

    Distributed Programming

  • Apache Beam - an unified model and set of language-specific SDKs for defining and executing data processing workflows.
  • May 20th, 2018

    Time-Series Databases

  • Thanosstars9.6k - Thanos is a set of components to create a highly available metric system with unlimited storage capacity using multiple (existing) Prometheus deployments.
  • Apr 20th, 2018

    Distributed Index

  • Pilosastars2.2k Open source distributed bitmap index that dramatically accelerates queries across multiple, massive data sets.
  • Feb 26th, 2018

    Other Awesome Lists

  • Public Datasets awesome-public-datasetsstars46.2k.
  • Feb 19th, 2018

    System Deployment

  • Kubernetes - a system for automating deployment, scaling, and management of containerized applications.
  • Jan 12th, 2018

    Internet of things and sensor data

  • NetLyticsstars9 - Analytics platform to process network data on Spark.
  • Dec 27th, 2017

    Videos

  • Spark in Motion - Spark in Motion teaches you how to use Spark for batch and streaming data analytics.
  • Dec 25th, 2017

    Key-value Data Model

  • Ignite - is an in-memory key-value data store providing full SQL-compliant data access that can optionally be backed by disk storage.
  • Dec 20th, 2017

    Data Ingestion

  • Apache NiFi - Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.
  • Dec 6th, 2017

    Graph Data Model

  • Neo4j - graph database written entirely in Java.
  • Nov 28th, 2017

    Distributed Programming

  • Baidu Bigflow - an interface that allows for writing distributed computing programs providing lots of simple, flexible, powerful APIs to easily handle data of any scale.
  • Nov 17th, 2017

    Business Intelligence

  • SparklineData SNAP - modern B.I platform powered by Apache Spark.
  • Nov 16th, 2017

    Books

    Streaming

  • Kafka in Action - Kafka in Action is a fast-paced introduction to every aspect of working with Kafka you need to really reap its benefits.
  • Reactive Data Handling - Reactive Data Handling is a collection of five hand-picked chapters, selected by Manuel Bernhardt, that introduce you to building reactive applications capable of handling real-time processing with large data loads--free eBook!
  • Oct 31st, 2017

    Search engine and framework

  • Vespa - is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time.
  • Oct 27th, 2017

    NewSQL Databases

  • SenseiDB - distributed, realtime, semi-structured database.
  • Oct 23rd, 2017

    Time-Series Databases

  • SiriDBstars455 Highly-scalable, robust and fast, open source time series database with cluster functionality.
  • Oct 14th, 2017

    RDBMS

  • Teradata - high-performance MPP data warehouse platform.
  • Distributed Filesystem

  • Apache Kudu - Hadoop's storage layer to enable fast analytics on fast data.
  • SQL-like processing

  • Aster Database - SQL-like analytic processing for MapReduce.
  • Oct 13th, 2017

    Time-Series Databases

  • Axibase Time Series Database - Integrated time series database on top of HBase with built-in visualization, rule-engine and SQL support.
  • Oct 11th, 2017

    Applications

  • AthenaXstars1.2k - a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).
  • Oct 8th, 2017

    Books

    Streaming

  • Kafka Streams in Action - Kafka Streams in Action teaches you everything you need to know to implement stream processing on data flowing into your Kafka platform, allowing you to focus on getting more from your data without sacrificing time or effort.
  • Big Data - Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data.
  • Oct 2nd, 2017

    Graph Data Model

  • NodeXL - A free, open-source template for Microsoft® Excel® 2007, 2010, 2013 and 2016 that makes it easy to explore network graphs.
  • Sep 27th, 2017

    Distributed Programming

  • Wallaroo - The ultrafast and elastic data processing engine. Big or fast data - no fuss, no Java needed.
  • Sep 24th, 2017

    Security

  • BDAstars103 - The vulnerability detector for Hadoop and Spark
  • Aug 3rd, 2017

    Scheduling

  • Apache Airflowstars23.5k - a platform to programmatically author, schedule and monitor workflows.
  • Azure Data Factory - cloud-based pipeline orchestration for on-prem, cloud and HDInsight
  • Jul 21st, 2017

    PostgreSQL forks and evolutions

  • TimescaleDB - An open-source time-series database optimized for fast ingest and complex queries
  • PipelineDB - The Streaming SQL Database. An open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables
  • Jul 12th, 2017

    NewSQL Databases

  • Comdb2stars1k - a clustered RDBMS built on optimistic concurrency control techniques.
  • Jul 6th, 2017

    RDBMS

  • MySQL The world's most popular open source database.
  • PostgreSQL The world's most advanced open source database.
  • Distributed Programming

  • Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
  • Apache S4 - framework for stream processing, implementation of S4.
  • Google Dataflow - create data pipelines to help themæingest, transform and analyze data.
  • Google MapReduce - map reduce framework.
  • Google MillWheel - fault tolerant stream processing framework.
  • Onyx - Distributed computation for the cloud.
  • Pinterest Pinlater - asynchronous job execution system.
  • Twitter TSAR - TimeSeries AggregatoR by Twitter.
  • Distributed Filesystem

  • BeeGFS - formerly FhGFS, parallel distributed file system.
  • Google Megastore - scalable, highly available storage.
  • GridGain - GGFS, Hadoop compliant in-memory file system.
  • Red Hat GlusterFS - scale-out network-attached storage file system.
  • Document Data Model

  • Actian Versant - commercial object-oriented database management systems .
  • LinkedIn Espresso - horizontally scalable document-oriented NoSQL data store.
  • Microsoft Azure DocumentDB - NoSQL cloud database service with protocol support for MongoDB
  • MongoDB - Document-oriented database system.
  • RethinkDB - document database that supports queries like table joins and group by.
  • Key Map Data Model

  • Google BigTable - column-oriented distributed datastore.
  • Twitter Manhattan - real-time, multi-tenant distributed database for Twitter scale.
  • Key-value Data Model

  • Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.
  • Redis - in memory key value datastore.
  • Graph Data Model

  • GCHQ Gafferstars1.6k - Gaffer by GCHQ is a framework that makes it easy to store large-scale graphs in which the nodes and edges have statistics.
  • Google Cayleystars14k - open-source graph database.
  • Twitter FlockDBstars3.3k - distributed graph database.
  • Columnar Databases

  • Pivotal Greenplum - purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one.
  • SQream DB - A GPU powered big data database, designed for analytics and data warehousing, with ANSI-92 compliant SQL, suitable for data sets from 10TB to 1PB.
  • NewSQL Databases

  • Google F1 - distributed SQL database built on Spanner.
  • Google Spanner - globally distributed semi-relational database.
  • SAP HANA - is an in-memory, column-oriented, relational database management system.
  • Time-Series Databases

  • Prometheus - a time series database and service monitoring system.
  • SQL-like processing

  • Actian SQL for Hadoop - high performance interactive SQL access to all Hadoop data.
  • Cloudera Impala - framework for interactive analysis, Inspired by Dremel.
  • Google BigQuery - framework for interactive analysis, implementation of Dremel.
  • Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
  • Stinger - interactive query for Hive.
  • Data Ingestion

  • Amazon Kinesis - real-time processing of streaming data at massive scale.
  • Embulk - open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.
  • Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
  • LinkedIn Databus - stream of change capture events for a database.
  • Service Programming

  • Google Chubby - a lock service for loosely-coupled distributed systems.
  • Linkedin Norbert - cluster manager.
  • OpenMPI - message passing framework.
  • Serf - decentralized solution for service discovery and orchestration.
  • Machine Learning

  • MonkeyLearn - Text mining made easy. Extract and classify data from text.
  • PredictionIO - machine learning server buit on Hadoop, Mahout and Cascading.
  • Security

  • Apache Eagle - real time monitoring solution
  • System Deployment

  • Apache YARN - Cluster manager.
  • Google Borg - job scheduling and monitoring system.
  • Hortonworks HOYA - application that can deploy HBase cluster on YARN.
  • Applications

  • Apache Metron - a platform that integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis.
  • Argusstars478 - Time series monitoring and alerting platform.
  • Hunk - Splunk analytics for Hadoop.
  • MADlib - data-processing library of an RDBMS to analyze data.
  • Splunk - analyzer for machine-generated data.
  • Sumo Logic - cloud based analyzer for machine-generated data.
  • Talend - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.
  • Search engine and framework

  • Elassandrastars1.6k - is a fork of Elasticsearch modified to run on top of Apache Cassandra in a scalable and resilient peer-to-peer architecture.
  • Enigma.io – Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web.
  • Google Percolator - continuous indexing system.
  • LinkedIn Galene - search architecture at LinkedIn.
  • MySQL forks and evolutions

  • Amazon RDS - MySQL databases in Amazon's cloud.
  • Google Cloud SQL - MySQL databases in Google's cloud.
  • MySQL Cluster - MySQL implementation using NDB Cluster storage engine.
  • PostgreSQL forks and evolutions

  • Yahoo Everest - multi-peta-byte database / MPP derived by PostgreSQL.
  • Embedded Databases

  • BerkeleyDB - a software library that provides a high-performance embedded database for key/value data.
  • LMDB - ultra-fast, ultra-compact key-value embedded data store developed by Symas.
  • Business Intelligence

  • BIME Analytics - business intelligence platform in the cloud.
  • datapine - self-service business intelligence tool in the cloud.
  • GoodData - platform for data products and embedded analytics.
  • Jedox Palo - customisable Business Intelligence platform.
  • Jethrodata - Interactive Big Data Analytics.
  • Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
  • Qlik - business intelligence and analytics platform.
  • Redash - Open source business intelligence platform, supporting multiple data sources and planned queries.
  • Zoomdata - Big Data Analytics.
  • Data Visualization

  • D3 - javaScript library for manipulating documents.
  • FnordMetric - write SQL queries that return SVG charts rather than tables
  • Grafana - graphite dashboard frontend, editor and graph composer.
  • Graphite - scalable Realtime Graphing.
  • Highcharts - simple and flexible charting API.
  • Metricsgraphic.js - a library built on top of D3 that is optimized for time-series data
  • Supersetstars41k - a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at the speed of thought.
  • Zeppelinstars420 - a notebook-style collaborative data analysis.
  • Zing Charts - JavaScript charting library for big data.
  • Internet of things and sensor data

  • ThingWorx - Rapid development and connection of intelligent systems
  • Interesting Readings

  • NoSQL Comparison - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.
  • Interesting Papers

    2013 - 2014

  • 2013 - Google - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
  • 2013 - Google - F1: A Distributed SQL Database That Scales.
  • Interesting Papers

    2001 - 2010

  • 2010 - Google - Large-scale Incremental Processing Using Distributed Transactions and Notifications base of Percolator and Caffeine.
  • 2010 - Google - Dremel: Interactive Analysis of Web-Scale Datasets.
  • 2006 - Google - The Chubby lock service for loosely-coupled distributed systems.
  • 2004 - Google - MapReduce: Simplied Data Processing on Large Clusters.
  • 2003 - Google - The Google File System.
  • Books

    Streaming

  • Unified Log Processing - Unified Log Processing is a practical guide to implementing a unified log of event streams (Kafka or Kinesis) in your business
  • Jun 27th, 2017

    Data Ingestion

  • Alooma - data pipeline as a service enabling moving data sources such as MySQL into data warehouses.
  • Jun 19th, 2017

    Key-value Data Model

  • BTDBstars120 - Key Value Database in .Net with Object DB Layer, RPC, dynamic IL and much more
  • Jun 4th, 2017

    SQL-like processing

  • PipelineDB - an open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables.
  • Jun 3rd, 2017

    Key-value Data Model

  • Badger - a fast, simple, efficient, and persistent key-value store written natively in Go.
  • May 27th, 2017

    Graph Data Model

  • AgensGraph - a new generation multi-model graph database for the modern complex data environment.
  • May 25th, 2017

    Search engine and framework

  • MG4J - MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance and provides state-of-the-art features and new research algorithms.
  • Mar 31st, 2017

    Frameworks

  • IBM Streams - platform for distributed processing and real-time analytics. Integrates with many of the popular technologies in the Big Data ecosystem (Kafka, HDFS, Spark, etc.)
  • Distributed Programming

  • IBM Streams - platform for distributed processing and real-time analytics. Provides toolkits for advanced analytics like geospatial, time series, etc. out of the box.
  • streamsx.topologystars27 - Libraries to enable building IBM Streams application in Java, Python or Scala.
  • Internet of things and sensor data

  • Apache Edgent (Incubating) - a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the edge devices.
  • Mar 22nd, 2017

    Books

    Distributed systems

  • Distributed Systems for fun and profit – Theory of distributed systems. Include parts about time and ordering, replication and impossibility results.
  • Feb 28th, 2017

    Distributed Filesystem

  • Microsoft Azure Data Lake Store - HDFS-compatible storage in Azure cloud
  • SQL-like processing

  • Pivotal HDB - SQL-like data warehouse system for Hadoop.
  • Machine Learning

  • Azure ML Studio - Cloud-based AzureML, R, Python Machine Learning platform
  • Security

  • Apache Ranger - Central security admin & fine-grained authorization for Hadoop
  • Internet of things and sensor data

  • Azure IoT Hub - Cloud-based bi-directional monitoring and messaging hub
  • Feb 23rd, 2017

    Distributed Programming

  • Rackerlabs Blueflood - multi-tenant distributed metric processing system
  • Key Map Data Model

  • ScyllaDB - column-oriented distributed datastore written in C++, totally compatible with Apache Cassandra.
  • Graph Data Model

  • GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
  • Benchmarking

  • Yahoo Gridmix3 - Hadoop cluster benchmarking from Yahoo engineer team.
  • Data Visualization

  • Lumify - open source big data analysis and visualization platform
  • Feb 8th, 2017

    Frameworks

  • Pachyderm - Pachyderm is a data storage platform built on Docker and Kubernetes to provide reproducible data processing and analysis.
  • Feb 2nd, 2017

    Service Programming

  • Hydrosphere Miststars316 - a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.
  • Jan 26th, 2017

    Key-value Data Model

  • Edisstars461 - is a protocol-compatible Server replacement for Redis.
  • Data Ingestion

  • Kestrelstars6 - distributed message queue system.
  • Dec 23rd, 2016

    Time-Series Databases

  • Beringeistars3.1k - Facebook's in-memory time-series database.
  • Nov 23rd, 2016

    Distributed Programming

  • Skalestars395 - High performance distributed data processing in NodeJS.
  • Nov 14th, 2016

    Applications

  • Rakamstars785 - open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB.
  • Oct 25th, 2016

    Key-value Data Model

  • Tile38stars7.8k - a geolocation data store, spatial index, and realtime geofence, supporting a variety of object types including latitude/longitude points, bounding boxes, XYZ tiles, Geohashes, and GeoJSON
  • SummitDBstars1.3k - an in-memory, NoSQL key/value database, with disk persistance and using the Raft consensus algorithm.
  • Applications

  • 411stars961 - an web application for alert management resulting from scheduled searches into Elasticsearch.
  • Atlasstars3k - a backend for managing dimensional time series data.
  • Graph Data Model

  • EliasDBstars844 - a lightweight graph based database that does not require any third-party libraries.
  • NewSQL Databases

  • Bedrock - a simple, modular, networked and distributed transaction layer built atop SQLite.
  • Key Map Data Model

  • Baidu Terastars1.8k - an Internet-scale database, inspired by BigTable.
  • Oct 24th, 2016

    Distributed Filesystem

  • Baidu File Systemstars2.8k - distributed filesystem.
  • Oct 21st, 2016

    Machine Learning

  • DataVecstars281 - A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem.
  • Deeplearning4j - Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train nets on multiple GPUs and CPUs.
  • H2Ostars5.6k - statistical, machine learning and math runtime with Hadoop. R and Python.
  • Kerasstars52.9k - An intuitive neural net API inspired by Torch that runs atop Theano and Tensorflow.
  • Mahout - An Apache-backed machine learning library for Hadoop.
  • ND4Jstars1.8k - A matrix library for the JVM. Numpy for Java.
  • RL4Jstars338 - Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem.
  • Sibyl - System for Large Scale Machine Learning at Google.
  • TensorFlowstars160.1k - Library from Google for machine learning using data flow graphs.
  • Theano - A Python-focused machine learning library supported by the University of Montreal.
  • Torch - A deep learning library with a Lua API, supported by NYU and Facebook.
  • Veloxstars111 - System for serving machine learning predictions.
  • Benchmarking

  • Deeplearning4j Benchmarksstars31
  • Sep 29th, 2016

    Time-Series Databases

  • Timelystars361 Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.
  • Bluefloodstars592 A distributed system designed to ingest and process time series data
  • Dalmatiner DBstars703 Fast distributed metrics database
  • Rhombus A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
  • Akumulistars771 Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
  • Riak-TS Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.
  • Druidstars11.3k Column oriented distributed data store ideal for powering interactive applications
  • Sep 23rd, 2016

    Interesting Readings

  • Monitoring Kafka performance - Guide to monitoring Apache Kafka, including native methods for metrics collection.
  • Monitoring Hadoop performance - Guide to monitoring Hadoop, with an overview of Hadoop architecture, and native methods for metrics collection.
  • Sep 14th, 2016

    Books

    Streaming

  • Streaming Data - Streaming Data introduces the concepts and requirements of streaming and real-time data systems.
  • Storm Applied - Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams.
  • Fundamentals of Stream Processing: Application Design, Systems, and Analytics - This comprehensive, hands-on guide combining the fundamental building blocks and emerging research in stream processing is ideal for application designers, system builders, analytic developers, as well as students and researchers in the field.
  • Stream Data Processing: A Quality of Service Perspective - Presents a new paradigm suitable for stream and complex event processing.
  • Aug 30th, 2016

    SQL-like processing

  • Apache Calcite - framework that allows efficient translation of queries involving heterogeneous and federated data.
  • Aug 19th, 2016

    Applications

  • ElastAertstars7.6k - ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch.
  • Kapacitorstars2.1k - an open source framework for processing, monitoring, and alerting on time series data.
  • Columnar Databases

  • ClickHouse - an open-source column-oriented database management system that allows generating analytical data reports in real time.
  • EventQL - a distributed, column-oriented database built for large-scale event collection and analytics.
  • Distributed Filesystem

  • Ambrystars1.5k - a distributed object store that supports storage of trillion of small immutable objects as well as billions of large objects.
  • Key-value Data Model

  • BuntDBstars3.5k - a fast, embeddable, in-memory key/value database for Go with custom indexing and geospatial support.
  • Boltstars12.3k - an embedded key-value database for Go.
  • HyperDexstars1.4k - a scalable, next generation key-value and document store with a wide array of features, including consistency, fault tolerance and high performance.
  • Aug 17th, 2016

    Data Visualization

  • ReCharts - A composable charting library built on React components
  • Jul 2nd, 2016

    Time-Series Databases

  • Chronix - a time series storage built to store time series highly compressed and for fast access times.
  • Jun 21st, 2016

    Distributed Programming

  • Apache Gearpump - real-time big data streaming engine based on Akka.
  • Jun 2nd, 2016

    Time-Series Databases

  • Cube - uses MongoDB to store time series data.
  • Newts - a time series database based on Apache Cassandra.
  • TrailDB - an efficient tool for storing and querying series of events.
  • Data Visualization

  • AnyChart - fast, simple and flexible JavaScript (HTML5) charting library featuring pure JS API.
  • May 28th, 2016

    Data Visualization

  • Bloomerystars16 - Web UI for Impala.
  • May 27th, 2016

    Time-Series Databases

  • Kairosdbstars1.6k - similar to OpenTSDB but allows for Cassandra.