Top 50 Awesome List

0xnr/awesome-bigdata

Big Data  2 months ago  10.4k
A curated list of awesome big data frameworks, ressources and other awesomeness.
View byDAY/WEEK/README
View on Github

Sep 27th - Oct 3rd, 2021

Time-Series Databases

  • InfluxDB - a time series database with optimised IO and queries, supports pgsql and influx wire protocols.
  • QuestDB - high-performance, open-source SQL database for applications in financial services, IoT, machine learning, DevOps and observability.
  • Mar 22nd - Mar 28th, 2021

    Internet of things and sensor data

  • Ably - Pub/sub messaging platform for IoT
  • Mar 8th - Mar 14th, 2021

    Data Ingestion

  • Census - A reverse ETL product that let you sync data from your data warehouse to SaaS Applications. No engineering favors required—just SQL.
  • Business Intelligence

  • Dekart - Large scale geospatial analytics for Google BigQuery based on Kepler.gl.
  • Data Visualization

  • Dekart - Large scale geospatial analytics for Google BigQuery based on Kepler.gl.
  • Mar 1st - Mar 7th, 2021

    Frameworks

  • Smooksstars318 - An extensible Java framework for building XML and non-XML (CSV, EDI, Java, etc...) streaming applications.
  • Feb 8th - Feb 14th, 2021

    Scheduling

  • Croniclestars1.2k - Distributed, easy to install, NodeJS based, task scheduler
  • Data Visualization

  • Dashstars15.5k - Analytical Web Apps for Python, R, Julia, and Jupyter. Built on top of plotly, no JS required
  • Feb 1st - Feb 7th, 2021

    Other Awesome Lists

  • Google Bigtablestars24.
  • Business Intelligence

  • Count - notebook-based anlytics and visualisation platform using SQL or drag-and-drop.
  • Dec 28th - Jan 3rd, 2020

    Machine Learning

  • Shapleystars126 - A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
  • Dec 14th - Dec 20th, 2020

    Applications

  • HASH - open source simulation and visualization platform.
  • Nov 16th - Nov 22nd, 2020

    Machine Learning

  • PyTorch Geometric Temporalstars1.2k - a temporal extension library for PyTorch Geometric .
  • Nov 2nd - Nov 8th, 2020

    Books

    Streaming

  • Azure Data Engineering - A book about data engineering in general and the Azure platform specifically
  • Sep 28th - Oct 4th, 2020

    Key-value Data Model

  • Gravitonstars403 - a simple, fast, versioned, authenticated, embeddable key-value store database in pure Go(lang).
  • Sep 14th - Sep 20th, 2020

    Videos

  • Elasticsearch 7 and Elastic Stack - LiveVideo tutorial that covers searching, analyzing, and visualizing big data on a cluster with Elasticsearch, Logstash, Beats, Kibana, and more.
  • Scheduling

  • Dagsterstars4k - a data orchestrator for machine learning, analytics, and ETL.
  • Aug 24th - Aug 30th, 2020

    Videos

  • Data warehouse schema design - dimensional modeling and star schema - Introduction to schema design for data warehouse using the star schema method.
  • Aug 17th - Aug 23rd, 2020

    SQL-like processing

  • Invantive SQL - SQL engine for online and on-premise use with integrated local data replication and 70+ connectors.
  • Aug 3rd - Aug 9th, 2020

    SQL-like processing

  • Materializestars3.2k - is a streaming database for real-time applications using SQL for queries and supporting a large fraction of PostgreSQL.
  • Jul 13th - Jul 19th, 2020

    Key-value Data Model

  • GhostDBstars708 - a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.
  • Data Ingestion

  • Apache Pulsarstars10k - a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.
  • Jul 6th - Jul 12th, 2020

    Search engine and framework

  • Weaviatestars1.9k - Weaviate is a GraphQL-based semantic search engine with build-in (word) embeddings.
  • Jun 8th - Jun 14th, 2020

    Books

    Streaming

  • Grokking Streaming Systems - Grokking Streaming Systems helps you unravel what streaming systems are, how they work, and whether they’re right for your business. Written to be tool-agnostic, you’ll be able to apply what you learn no matter which framework you choose.
  • May 18th - May 24th, 2020

    Data Ingestion

  • redpanda - A Kafka® replacement for mission critical systems; 10x faster. Written in C++.
  • Machine Learning

  • Little Ball of Furstars570 - A subsampling library for graph structured data. Python
  • May 4th - May 10th, 2020

    Data Ingestion

  • RudderStackstars2.8k - an open source customer data infrastructure (segment, mParticle alternative) written in go.
  • Apr 27th - May 3rd, 2020

    Data Ingestion

  • Gazettestars291 - Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.
  • Mar 2nd - Mar 8th, 2020

    Interesting Papers

    2001 - 2010

  • 2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
  • 2008 - AMPLab - Chukwa: A large-scale monitoring system.
  • NewSQL Databases

  • BayesDBstars880 - statistic oriented SQL database.
  • Machine Learning

  • Oryxstars1.8k - Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning.
  • Lambdostars1 - Lambdo is a workflow engine which significantly simplifies the analysis process by unifying feature engineering and machine learning operations.
  • Frameworks

  • Bistrostars1k - general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via functions and processes data via column operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.
  • Jan 20th - Jan 26th, 2020

    Machine Learning

  • Karate Clubstars1.4k - An unsupervised machine learning library for graph structured data. Python
  • Jan 13th - Jan 19th, 2020

    Data Visualization

  • DataSphere Studiostars1.7k - one-stop data application development management portal.
  • System Deployment

  • Linkisstars2.3k - Linkis helps easily connect to various back-end computation/storage engines.
  • Jan 6th - Jan 12th, 2020

    Distributed Programming

  • Apache Spark Streaming - framework for stream processing, part of Spark.
  • Dec 23rd - Dec 29th, 2019

    NewSQL Databases

  • yugabyteDBstars5.8k - open source, high-performance, distributed SQL database compatible with PostgreSQL.
  • Dec 9th - Dec 15th, 2019

    Other Awesome Lists

  • Monte Carlo Tree Search Papers awesome-monte-carlo-tree-search-papersstars438.
  • Dec 2nd - Dec 8th, 2019

    Business Intelligence

  • Saiku Analytics - Open source analytics platform.
  • Time-Series Databases

  • TDenginestars17.3k - a time series database in C utilizing unique features of IoT to improve read/write throughput and reduce space needed to store data
  • Oct 7th - Oct 13th, 2019

    NewSQL Databases

  • KarelDBstars369 - a relational database backed by Apache Kafka.
  • Sep 30th - Oct 6th, 2019

    Business Intelligence

  • Knowage - open source business intelligence platform. (former SpagoBi)
  • Search engine and framework

  • Facebook Faissstars15.2k - is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.
  • Annoystars9.2k - is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.
  • Sep 16th - Sep 22nd, 2019

    Videos

  • Machine Learning, Data Science and Deep Learning with Python - LiveVideo tutorial that covers machine learning, Tensorflow, artificial intelligence, and neural networks.
  • Sep 9th - Sep 15th, 2019

    Machine Learning

  • ML Workspacestars2.3k - All-in-one web-based IDE specialized for machine learning and data science.
  • Business Intelligence

  • intermix.io - Performance Monitoring for Amazon Redshift
  • Aug 26th - Sep 1st, 2019

    SQL-like processing

  • Apache HCatalog - table and storage management layer for Hadoop.
  • Jul 15th - Jul 21st, 2019

    Applications

  • Indicative - Web & mobile analytics tool, with data warehouse (AWS, BigQuery) integration.
  • Jul 8th - Jul 14th, 2019

    Other Awesome Lists

  • Graph Classification awesome-graph-classificationstars4.3k.
  • SQL-like processing

  • Dremio - an open-source, SQL-like Data-as-a-Service Platform based on Apache Arrow.
  • Jun 17th - Jun 23rd, 2019

    Other Awesome Lists

  • Kafka awesome-kafkastars138.
  • Books

    Streaming

  • Spark in Action & Spark in Action 2nd Ed. - Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0.
  • Applications

  • Jupyter - Notebook and project application for interactive data science and scientific computing across all programming languages.
  • Time-Series Databases

  • IronDB - scalable, general-purpose time series database.
  • Business Intelligence

  • Blazerstars3.1k - business intelligence made simple.
  • May 27th - Jun 2nd, 2019

    Other Awesome Lists

  • Decision Tree Papers awesome-decision-tree-papersstars1.9k.
  • Fraud Detection Papers awesome-fraud-detection-papersstars1k.
  • Gradient Boosting Papers awesome-gradient-boosting-papersstars759.
  • May 20th - May 26th, 2019

    Time-Series Databases

  • VictoriaMetricsstars5.4k - fast, scalable and resource-effective open-source TSDB compatible with Prometheus. Single-node and cluster versions included
  • Distributed Programming

  • Raystars18.3k - A fast and simple framework for building and running distributed applications.
  • Jan 28th - Feb 3rd, 2019

    Data Visualization

  • Vegastars9.5k - a visualization grammar.
  • Graph Data Model

  • JanusGraph - open-source, distributed graph database with multiple options for storage backends (Bigtable, HBase, Cassandra, etc.) and indexing backends (Elasticsearch, Solr, Lucene).
  • Jan 21st - Jan 27th, 2019

    Other Awesome Lists

  • Network Embedding awesome-network-embeddingstars2.4k.
  • Community Detection awesome-community-detectionstars1.9k.
  • Frameworks

  • Polyaxonstars3k - A platform for reproducible and scalable machine learning and deep learning.
  • Jan 7th - Jan 13th, 2019

    Machine Learning

  • Feaststars2.5k - A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and model serving.
  • Nov 12th - Nov 18th, 2018

    Business Intelligence

  • Numeracy - Fast, clean SQL client and business intelligence.
  • Oct 29th - Nov 4th, 2018

    Books

    Streaming

  • Fusion in Action - Fusion in Action teaches you to build a full-featured data analytics pipeline, including document and data search and distributed data clustering.
  • Interesting Readings

  • Monitoring Cassandra performance - Guide to monitoring Cassandra, including native methods for metrics collection.
  • Service Programming

  • Marastars1.8k - A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
  • Oct 22nd - Oct 28th, 2018

    Books

    Streaming

  • Data Science at Scale with Python and Dask - Data Science at Scale with Python and Dask teaches you how to build distributed data projects that can handle huge amounts of data.
  • Oct 15th - Oct 21st, 2018

    Books

    Graph Based approach

  • Graph-Powered Machine Learning - Alessandro Negro. Combine graph theory and models to improve machine learning projects
  • Oct 1st - Oct 7th, 2018

    Time-Series Databases

  • M3DB - a distributed time series database that can be used for storing realtime metrics at long retention.
  • Data Visualization

  • Frappe Charts - GitHub-inspired simple and modern SVG charts for the web with zero dependencies.
  • Graph Data Model

  • Microsoft Graph Enginestars2k - a distributed in-memory data processing engine, underpinned by a strongly-typed in-memory key-value store and a general distributed computation engine.
  • Data Ingestion

  • Amazon Web Services Glue - serverless fully managed extract, transform, and load (ETL) service
  • Aug 20th - Aug 26th, 2018

    Business Intelligence

  • Metabasestars26.6k - The simplest, fastest way to get business intelligence and analytics to everyone in your company.
  • NewSQL Databases

  • ActorDBstars1.9k - a distributed SQL database with the scalability of a KV store, while keeping the query capabilities of a relational database.
  • Map-D - GPU in-memory database, big data analysis and visualization platform.
  • VoltDB - claims to be fastest in-memory database.
  • Jul 9th - Jul 15th, 2018

    Data Visualization

  • DevExtreme React Chart - High-performance plugin-based React chart for Bootstrap and Material Design.
  • Columnar Databases

  • Google BigQuery - Google's cloud offering backed by their pioneering work on Dremel.
  • Amazon Redshift - Amazon's cloud offering, also based on a columnar datastore backend.
  • IndexRstars442 - an open-source columnar storage format for fast & realtime analytic with big data.
  • LocustDBstars1.3k - an experimental analytics database aiming to set a new standard for query performance on commodity hardware.
  • Jun 18th - Jun 24th, 2018

    Distributed Programming

  • Apache Beam - an unified model and set of language-specific SDKs for defining and executing data processing workflows.
  • May 14th - May 20th, 2018

    Time-Series Databases

  • Thanosstars9.7k - Thanos is a set of components to create a highly available metric system with unlimited storage capacity using multiple (existing) Prometheus deployments.
  • Apr 16th - Apr 22nd, 2018

    Distributed Index

  • Pilosastars2.2k Open source distributed bitmap index that dramatically accelerates queries across multiple, massive data sets.
  • Feb 26th - Mar 4th, 2018

    Other Awesome Lists

  • Public Datasets awesome-public-datasetsstars46.6k.
  • Feb 19th - Feb 25th, 2018

    System Deployment

  • Kubernetes - a system for automating deployment, scaling, and management of containerized applications.
  • Jan 8th - Jan 14th, 2018

    Internet of things and sensor data

  • NetLyticsstars9 - Analytics platform to process network data on Spark.
  • Dec 25th - Dec 31st, 2017

    Videos

  • Spark in Motion - Spark in Motion teaches you how to use Spark for batch and streaming data analytics.
  • Key-value Data Model

  • Ignite - is an in-memory key-value data store providing full SQL-compliant data access that can optionally be backed by disk storage.
  • Dec 18th - Dec 24th, 2017

    Data Ingestion

  • Apache NiFi - Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.
  • Dec 4th - Dec 10th, 2017

    Graph Data Model

  • Neo4j - graph database written entirely in Java.
  • Nov 27th - Dec 3rd, 2017

    Distributed Programming

  • Baidu Bigflow - an interface that allows for writing distributed computing programs providing lots of simple, flexible, powerful APIs to easily handle data of any scale.
  • Nov 13th - Nov 19th, 2017

    Business Intelligence

  • SparklineData SNAP - modern B.I platform powered by Apache Spark.
  • Books

    Streaming

  • Kafka in Action - Kafka in Action is a fast-paced introduction to every aspect of working with Kafka you need to really reap its benefits.
  • Reactive Data Handling - Reactive Data Handling is a collection of five hand-picked chapters, selected by Manuel Bernhardt, that introduce you to building reactive applications capable of handling real-time processing with large data loads--free eBook!
  • Oct 30th - Nov 5th, 2017

    Search engine and framework

  • Vespa - is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time.
  • Oct 23rd - Oct 29th, 2017

    NewSQL Databases

  • SenseiDB - distributed, realtime, semi-structured database.
  • Time-Series Databases

  • SiriDBstars455 Highly-scalable, robust and fast, open source time series database with cluster functionality.
  • Oct 9th - Oct 15th, 2017

    RDBMS

  • Teradata - high-performance MPP data warehouse platform.
  • Distributed Filesystem

  • Apache Kudu - Hadoop's storage layer to enable fast analytics on fast data.
  • SQL-like processing

  • Aster Database - SQL-like analytic processing for MapReduce.
  • Time-Series Databases

  • Axibase Time Series Database - Integrated time series database on top of HBase with built-in visualization, rule-engine and SQL support.
  • Applications

  • AthenaXstars1.2k - a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).
  • Oct 2nd - Oct 8th, 2017

    Books

    Streaming

  • Kafka Streams in Action - Kafka Streams in Action teaches you everything you need to know to implement stream processing on data flowing into your Kafka platform, allowing you to focus on getting more from your data without sacrificing time or effort.
  • Big Data - Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data.
  • Graph Data Model

  • NodeXL - A free, open-source template for Microsoft® Excel® 2007, 2010, 2013 and 2016 that makes it easy to explore network graphs.
  • Sep 25th - Oct 1st, 2017

    Distributed Programming

  • Wallaroo - The ultrafast and elastic data processing engine. Big or fast data - no fuss, no Java needed.
  • Sep 18th - Sep 24th, 2017

    Security

  • BDAstars104 - The vulnerability detector for Hadoop and Spark
  • Jul 31st - Aug 6th, 2017

    Scheduling

  • Apache Airflowstars23.9k - a platform to programmatically author, schedule and monitor workflows.
  • Azure Data Factory - cloud-based pipeline orchestration for on-prem, cloud and HDInsight
  • Jul 17th - Jul 23rd, 2017

    PostgreSQL forks and evolutions

  • TimescaleDB - An open-source time-series database optimized for fast ingest and complex queries
  • PipelineDB - The Streaming SQL Database. An open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables
  • Jul 10th - Jul 16th, 2017

    NewSQL Databases

  • Comdb2stars1k - a clustered RDBMS built on optimistic concurrency control techniques.
  • Jul 3rd - Jul 9th, 2017

    RDBMS

  • MySQL The world's most popular open source database.
  • PostgreSQL The world's most advanced open source database.
  • Distributed Programming

  • Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
  • Apache S4 - framework for stream processing, implementation of S4.
  • Google Dataflow - create data pipelines to help themæingest, transform and analyze data.
  • Google MapReduce - map reduce framework.
  • Google MillWheel - fault tolerant stream processing framework.
  • Onyx - Distributed computation for the cloud.
  • Pinterest Pinlater - asynchronous job execution system.
  • Twitter TSAR - TimeSeries AggregatoR by Twitter.
  • Distributed Filesystem

  • BeeGFS - formerly FhGFS, parallel distributed file system.
  • Google Megastore - scalable, highly available storage.
  • GridGain - GGFS, Hadoop compliant in-memory file system.
  • Red Hat GlusterFS - scale-out network-attached storage file system.
  • Document Data Model

  • Actian Versant - commercial object-oriented database management systems .
  • LinkedIn Espresso - horizontally scalable document-oriented NoSQL data store.
  • Microsoft Azure DocumentDB - NoSQL cloud database service with protocol support for MongoDB
  • MongoDB - Document-oriented database system.
  • RethinkDB - document database that supports queries like table joins and group by.
  • Key Map Data Model

  • Google BigTable - column-oriented distributed datastore.
  • Twitter Manhattan - real-time, multi-tenant distributed database for Twitter scale.
  • Key-value Data Model

  • Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.
  • Redis - in memory key value datastore.
  • Graph Data Model

  • GCHQ Gafferstars1.6k - Gaffer by GCHQ is a framework that makes it easy to store large-scale graphs in which the nodes and edges have statistics.
  • Google Cayleystars14k - open-source graph database.
  • Twitter FlockDBstars3.3k - distributed graph database.
  • Columnar Databases

  • Pivotal Greenplum - purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one.
  • SQream DB - A GPU powered big data database, designed for analytics and data warehousing, with ANSI-92 compliant SQL, suitable for data sets from 10TB to 1PB.
  • NewSQL Databases

  • Google F1 - distributed SQL database built on Spanner.
  • Google Spanner - globally distributed semi-relational database.
  • SAP HANA - is an in-memory, column-oriented, relational database management system.
  • Time-Series Databases

  • Prometheus - a time series database and service monitoring system.
  • SQL-like processing

  • Actian SQL for Hadoop - high performance interactive SQL access to all Hadoop data.
  • Cloudera Impala - framework for interactive analysis, Inspired by Dremel.
  • Google BigQuery - framework for interactive analysis, implementation of Dremel.
  • Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
  • Stinger - interactive query for Hive.
  • Data Ingestion

  • Amazon Kinesis - real-time processing of streaming data at massive scale.
  • Embulk - open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.
  • Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
  • LinkedIn Databus - stream of change capture events for a database.
  • Service Programming

  • Google Chubby - a lock service for loosely-coupled distributed systems.
  • Linkedin Norbert - cluster manager.
  • OpenMPI - message passing framework.
  • Serf - decentralized solution for service discovery and orchestration.
  • Machine Learning

  • MonkeyLearn - Text mining made easy. Extract and classify data from text.
  • PredictionIO - machine learning server buit on Hadoop, Mahout and Cascading.
  • Security

  • Apache Eagle - real time monitoring solution
  • System Deployment

  • Apache YARN - Cluster manager.
  • Google Borg - job scheduling and monitoring system.
  • Hortonworks HOYA - application that can deploy HBase cluster on YARN.
  • Applications

  • Apache Metron - a platform that integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis.
  • Argusstars481 - Time series monitoring and alerting platform.
  • Hunk - Splunk analytics for Hadoop.
  • MADlib - data-processing library of an RDBMS to analyze data.
  • Splunk - analyzer for machine-generated data.
  • Sumo Logic - cloud based analyzer for machine-generated data.
  • Talend - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.
  • Search engine and framework

  • Elassandrastars1.6k - is a fork of Elasticsearch modified to run on top of Apache Cassandra in a scalable and resilient peer-to-peer architecture.
  • Enigma.io – Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web.
  • Google Percolator - continuous indexing system.
  • LinkedIn Galene - search architecture at LinkedIn.
  • MySQL forks and evolutions

  • Amazon RDS - MySQL databases in Amazon's cloud.
  • Google Cloud SQL - MySQL databases in Google's cloud.
  • MySQL Cluster - MySQL implementation using NDB Cluster storage engine.
  • PostgreSQL forks and evolutions

  • Yahoo Everest - multi-peta-byte database / MPP derived by PostgreSQL.
  • Embedded Databases

  • BerkeleyDB - a software library that provides a high-performance embedded database for key/value data.
  • LMDB - ultra-fast, ultra-compact key-value embedded data store developed by Symas.
  • Business Intelligence

  • BIME Analytics - business intelligence platform in the cloud.
  • datapine - self-service business intelligence tool in the cloud.
  • GoodData - platform for data products and embedded analytics.
  • Jedox Palo - customisable Business Intelligence platform.
  • Jethrodata - Interactive Big Data Analytics.
  • Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
  • Qlik - business intelligence and analytics platform.
  • Redash - Open source business intelligence platform, supporting multiple data sources and planned queries.
  • Zoomdata - Big Data Analytics.
  • Data Visualization

  • D3 - javaScript library for manipulating documents.
  • FnordMetric - write SQL queries that return SVG charts rather than tables
  • Grafana - graphite dashboard frontend, editor and graph composer.
  • Graphite - scalable Realtime Graphing.
  • Highcharts - simple and flexible charting API.
  • Metricsgraphic.js - a library built on top of D3 that is optimized for time-series data
  • Supersetstars42.1k - a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at the speed of thought.
  • Zeppelinstars420 - a notebook-style collaborative data analysis.
  • Zing Charts - JavaScript charting library for big data.
  • Internet of things and sensor data

  • ThingWorx - Rapid development and connection of intelligent systems
  • Interesting Readings

  • NoSQL Comparison - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.
  • Interesting Papers

    2013 - 2014

  • 2013 - Google - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
  • 2013 - Google - F1: A Distributed SQL Database That Scales.
  • Interesting Papers

    2001 - 2010

  • 2010 - Google - Large-scale Incremental Processing Using Distributed Transactions and Notifications base of Percolator and Caffeine.
  • 2010 - Google - Dremel: Interactive Analysis of Web-Scale Datasets.
  • 2006 - Google - The Chubby lock service for loosely-coupled distributed systems.
  • 2004 - Google - MapReduce: Simplied Data Processing on Large Clusters.
  • 2003 - Google - The Google File System.
  • Books

    Streaming

  • Unified Log Processing - Unified Log Processing is a practical guide to implementing a unified log of event streams (Kafka or Kinesis) in your business
  • Jun 26th - Jul 2nd, 2017

    Data Ingestion

  • Alooma - data pipeline as a service enabling moving data sources such as MySQL into data warehouses.
  • Jun 19th - Jun 25th, 2017

    Key-value Data Model

  • BTDBstars121 - Key Value Database in .Net with Object DB Layer, RPC, dynamic IL and much more
  • May 29th - Jun 4th, 2017

    SQL-like processing

  • PipelineDB - an open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables.
  • Key-value Data Model

  • Badger - a fast, simple, efficient, and persistent key-value store written natively in Go.
  • May 22nd - May 28th, 2017

    Graph Data Model

  • AgensGraph - a new generation multi-model graph database for the modern complex data environment.
  • Search engine and framework

  • MG4J - MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance and provides state-of-the-art features and new research algorithms.
  • Mar 27th - Apr 2nd, 2017

    Frameworks

  • IBM Streams - platform for distributed processing and real-time analytics. Integrates with many of the popular technologies in the Big Data ecosystem (Kafka, HDFS, Spark, etc.)
  • Distributed Programming

  • IBM Streams - platform for distributed processing and real-time analytics. Provides toolkits for advanced analytics like geospatial, time series, etc. out of the box.
  • streamsx.topologystars27 - Libraries to enable building IBM Streams application in Java, Python or Scala.
  • Internet of things and sensor data

  • Apache Edgent (Incubating) - a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the edge devices.
  • Mar 20th - Mar 26th, 2017

    Books

    Distributed systems

  • Distributed Systems for fun and profit – Theory of distributed systems. Include parts about time and ordering, replication and impossibility results.
  • Feb 27th - Mar 5th, 2017

    Distributed Filesystem

  • Microsoft Azure Data Lake Store - HDFS-compatible storage in Azure cloud
  • SQL-like processing

  • Pivotal HDB - SQL-like data warehouse system for Hadoop.
  • Machine Learning

  • Azure ML Studio - Cloud-based AzureML, R, Python Machine Learning platform
  • Security

  • Apache Ranger - Central security admin & fine-grained authorization for Hadoop
  • Internet of things and sensor data

  • Azure IoT Hub - Cloud-based bi-directional monitoring and messaging hub
  • Feb 20th - Feb 26th, 2017

    Distributed Programming

  • Rackerlabs Blueflood - multi-tenant distributed metric processing system
  • Key Map Data Model

  • ScyllaDB - column-oriented distributed datastore written in C++, totally compatible with Apache Cassandra.
  • Graph Data Model

  • GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
  • Benchmarking

  • Yahoo Gridmix3 - Hadoop cluster benchmarking from Yahoo engineer team.
  • Data Visualization

  • Lumify - open source big data analysis and visualization platform
  • Feb 6th - Feb 12th, 2017

    Frameworks

  • Pachyderm - Pachyderm is a data storage platform built on Docker and Kubernetes to provide reproducible data processing and analysis.
  • Jan 30th - Feb 5th, 2017

    Service Programming

  • Hydrosphere Miststars317 - a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.
  • Jan 23rd - Jan 29th, 2017

    Key-value Data Model

  • Edisstars461 - is a protocol-compatible Server replacement for Redis.
  • Data Ingestion

  • Kestrelstars6 - distributed message queue system.
  • Dec 19th - Dec 25th, 2016

    Time-Series Databases

  • Beringeistars3.1k - Facebook's in-memory time-series database.
  • Nov 21st - Nov 27th, 2016

    Distributed Programming

  • Skalestars396 - High performance distributed data processing in NodeJS.
  • Nov 14th - Nov 20th, 2016

    Applications

  • Rakamstars786 - open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB.
  • Oct 24th - Oct 30th, 2016

    Key-value Data Model

  • Tile38stars7.9k - a geolocation data store, spatial index, and realtime geofence, supporting a variety of object types including latitude/longitude points, bounding boxes, XYZ tiles, Geohashes, and GeoJSON
  • SummitDBstars1.3k - an in-memory, NoSQL key/value database, with disk persistance and using the Raft consensus algorithm.
  • Applications

  • 411stars961 - an web application for alert management resulting from scheduled searches into Elasticsearch.
  • Atlasstars3k - a backend for managing dimensional time series data.
  • Graph Data Model

  • EliasDBstars853 - a lightweight graph based database that does not require any third-party libraries.
  • NewSQL Databases

  • Bedrock - a simple, modular, networked and distributed transaction layer built atop SQLite.
  • Key Map Data Model

  • Baidu Terastars1.8k - an Internet-scale database, inspired by BigTable.
  • Distributed Filesystem

  • Baidu File Systemstars2.8k - distributed filesystem.
  • Oct 17th - Oct 23rd, 2016

    Machine Learning

  • DataVecstars282 - A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem.
  • Deeplearning4j - Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train nets on multiple GPUs and CPUs.
  • H2Ostars5.6k - statistical, machine learning and math runtime with Hadoop. R and Python.
  • Kerasstars53.3k - An intuitive neural net API inspired by Torch that runs atop Theano and Tensorflow.
  • Mahout - An Apache-backed machine learning library for Hadoop.
  • ND4Jstars1.8k - A matrix library for the JVM. Numpy for Java.
  • RL4Jstars339 - Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem.
  • Sibyl - System for Large Scale Machine Learning at Google.
  • TensorFlowstars160.9k - Library from Google for machine learning using data flow graphs.
  • Theano - A Python-focused machine learning library supported by the University of Montreal.
  • Torch - A deep learning library with a Lua API, supported by NYU and Facebook.
  • Veloxstars111 - System for serving machine learning predictions.
  • Benchmarking

  • Deeplearning4j Benchmarksstars31
  • Sep 26th - Oct 2nd, 2016

    Time-Series Databases

  • Timelystars364 Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.
  • Bluefloodstars591 A distributed system designed to ingest and process time series data
  • Dalmatiner DBstars704 Fast distributed metrics database
  • Rhombus A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
  • Akumulistars772 Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
  • Riak-TS Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.
  • Druidstars11.4k Column oriented distributed data store ideal for powering interactive applications
  • Sep 19th - Sep 25th, 2016

    Interesting Readings

  • Monitoring Kafka performance - Guide to monitoring Apache Kafka, including native methods for metrics collection.
  • Monitoring Hadoop performance - Guide to monitoring Hadoop, with an overview of Hadoop architecture, and native methods for metrics collection.
  • Sep 12th - Sep 18th, 2016

    Books

    Streaming

  • Streaming Data - Streaming Data introduces the concepts and requirements of streaming and real-time data systems.
  • Storm Applied - Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams.
  • Fundamentals of Stream Processing: Application Design, Systems, and Analytics - This comprehensive, hands-on guide combining the fundamental building blocks and emerging research in stream processing is ideal for application designers, system builders, analytic developers, as well as students and researchers in the field.
  • Stream Data Processing: A Quality of Service Perspective - Presents a new paradigm suitable for stream and complex event processing.
  • Aug 29th - Sep 4th, 2016

    SQL-like processing

  • Apache Calcite - framework that allows efficient translation of queries involving heterogeneous and federated data.
  • Aug 15th - Aug 21st, 2016

    Applications

  • ElastAertstars7.6k - ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch.
  • Kapacitorstars2.1k - an open source framework for processing, monitoring, and alerting on time series data.
  • Columnar Databases

  • ClickHouse - an open-source column-oriented database management system that allows generating analytical data reports in real time.
  • EventQL - a distributed, column-oriented database built for large-scale event collection and analytics.
  • Distributed Filesystem

  • Ambrystars1.5k - a distributed object store that supports storage of trillion of small immutable objects as well as billions of large objects.
  • Key-value Data Model

  • BuntDBstars3.5k - a fast, embeddable, in-memory key/value database for Go with custom indexing and geospatial support.
  • Boltstars12.3k - an embedded key-value database for Go.
  • HyperDexstars1.4k - a scalable, next generation key-value and document store with a wide array of features, including consistency, fault tolerance and high performance.
  • Data Visualization

  • ReCharts - A composable charting library built on React components
  • Jun 27th - Jul 3rd, 2016

    Time-Series Databases

  • Chronix - a time series storage built to store time series highly compressed and for fast access times.
  • Jun 20th - Jun 26th, 2016

    Distributed Programming

  • Apache Gearpump - real-time big data streaming engine based on Akka.
  • May 30th - Jun 5th, 2016

    Time-Series Databases

  • Cube - uses MongoDB to store time series data.
  • Newts - a time series database based on Apache Cassandra.
  • TrailDB - an efficient tool for storing and querying series of events.
  • Data Visualization

  • AnyChart - fast, simple and flexible JavaScript (HTML5) charting library featuring pure JS API.
  • May 23rd - May 29th, 2016

    Data Visualization

  • Bloomerystars16 - Web UI for Impala.
  • Time-Series Databases

  • Kairosdbstars1.6k - similar to OpenTSDB but allows for Cassandra.
  • Distributed Programming

  • Twitter Heronstars3.6k - Heron is a realtime, distributed, fault-tolerant stream processing engine from Twitter replacing Storm.
  • Machine Learning

  • MOA - MOA performs big data stream mining in real time, and large scale machine learning.
  • May 16th - May 22nd, 2016

    Distributed Programming

  • Apache APEX - a unified, enterprise platform for big data stream and batch processing.
  • Apr 18th - Apr 24th, 2016

    Data Visualization

  • chartd - responsive, retina-compatible charts with just an img tag.
  • Apr 4th - Apr 10th, 2016

    Key-value Data Model

  • TiKVstars10.3k - a distributed key-value database powered by Rust and inspired by Google Spanner and HBase.
  • Mar 28th - Apr 3rd, 2016

    Distributed Programming

  • Netflix PigPenstars532 - map-reduce for Clojure which compiles to Apache Pig.
  • Streamdrill - useful for counting activities of event streams over different time windows and finding the most active one.
  • Mar 21st - Mar 27th, 2016

    Graph Data Model

  • DGraphstars17k - A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to be serving real time user queries, over terabytes of structured data.
  • Mar 14th - Mar 20th, 2016

    Distributed Filesystem

  • Alluxio - reliable file sharing at memory speed across cluster frameworks.
  • Mar 7th - Mar 13th, 2016

    Interesting Papers

    2015 - 2016

  • 2015 - Facebook - One Trillion Edges: Graph Processing at Facebook-Scale.
  • Feb 29th - Mar 6th, 2016

    Data Ingestion

  • Skizzestars774 - sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.
  • Data Visualization

  • Shiny - a web application framework for R.
  • Feb 22nd - Feb 28th, 2016

    Key-value Data Model

  • GridDBstars1.6k - suitable for sensor data stored in a timeseries.
  • Applications

  • SnappyDatastars1k - a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) built on Spark in a single integrated cluster.
  • Jan 18th - Jan 24th, 2016

    Distributed Programming

  • Tuktustars