Azure Data Engineering - A book about data engineering in general and the Azure platform specifically
Sep 28th - Oct 4th, 2020
Key-value Data Model
Gravitonstars407 - a simple, fast, versioned, authenticated, embeddable key-value store database in pure Go(lang).
Sep 14th - Sep 20th, 2020
Videos
Elasticsearch 7 and Elastic Stack - LiveVideo tutorial that covers searching, analyzing, and visualizing big data on a cluster with Elasticsearch, Logstash, Beats, Kibana, and more.
Scheduling
Dagsterstars4.8k - a data orchestrator for machine learning, analytics, and ETL.
Invantive SQL - SQL engine for online and on-premise use with integrated local data replication and 70+ connectors.
Aug 3rd - Aug 9th, 2020
SQL-like processing
Materializestars4k - is a streaming database for real-time applications using SQL for queries and supporting a large fraction of PostgreSQL.
Jul 13th - Jul 19th, 2020
Key-value Data Model
GhostDBstars722 - a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.
Data Ingestion
Apache Pulsarstars10.9k - a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.
Jul 6th - Jul 12th, 2020
Search engine and framework
Weaviatestars2.5k - Weaviate is a GraphQL-based semantic search engine with build-in (word) embeddings.
Jun 8th - Jun 14th, 2020
Books
Streaming
Grokking Streaming Systems - Grokking Streaming Systems helps you unravel what streaming systems are, how they work, and whether they’re right for your business. Written to be tool-agnostic, you’ll be able to apply what you learn no matter which framework you choose.
May 18th - May 24th, 2020
Data Ingestion
redpanda - A Kafka® replacement for mission critical systems; 10x faster. Written in C++.
Oryxstars1.8k - Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning.
Lambdostars1 - Lambdo is a workflow engine which significantly simplifies the analysis process by unifying feature engineering and machine learning operations.
Frameworks
Bistrostars1k - general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via functions and processes data via column operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.
Jan 20th - Jan 26th, 2020
Machine Learning
Karate Clubstars1.6k - An unsupervised machine learning library for graph structured data. Python
Jan 13th - Jan 19th, 2020
Data Visualization
DataSphere Studio - one-stop data application development management portal.
System Deployment
Linkisstars2.5k - Linkis helps easily connect to various back-end computation/storage engines.
TDenginestars18.4k - a time series database in C utilizing unique features of IoT to improve read/write throughput and reduce space needed to store data
Oct 7th - Oct 13th, 2019
NewSQL Databases
KarelDBstars376 - a relational database backed by Apache Kafka.
Sep 30th - Oct 6th, 2019
Business Intelligence
Knowage - open source business intelligence platform. (former SpagoBi)
Search engine and framework
Facebook Faissstars17k - is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.
Annoystars9.9k - is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.
Spark in Action & Spark in Action 2nd Ed. - Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0.
Applications
Jupyter - Notebook and project application for interactive data science and scientific computing across all programming languages.
Time-Series Databases
IronDB - scalable, general-purpose time series database.
VictoriaMetricsstars6.4k - fast, scalable and resource-effective open-source TSDB compatible with Prometheus. Single-node and cluster versions included
Distributed Programming
Raystars20.6k - A fast and simple framework for building and running distributed applications.
Polyaxonstars3.1k - A platform for reproducible and scalable machine learning and deep learning.
Jan 7th - Jan 13th, 2019
Machine Learning
Feaststars3.2k - A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and model serving.
Nov 12th - Nov 18th, 2018
Business Intelligence
Numeracy - Fast, clean SQL client and business intelligence.
Oct 29th - Nov 4th, 2018
Books
Streaming
Fusion in Action - Fusion in Action teaches you to build a full-featured data analytics pipeline, including document and data search and distributed data clustering.
Marastars1.9k - A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
Oct 22nd - Oct 28th, 2018
Books
Streaming
Data Science at Scale with Python and Dask - Data Science at Scale with Python and Dask teaches you how to build distributed data projects that can handle huge amounts of data.
M3DB - a distributed time series database that can be used for storing realtime metrics at long retention.
Data Visualization
Frappe Charts - GitHub-inspired simple and modern SVG charts for the web with zero dependencies.
Graph Data Model
Microsoft Graph Enginestars2k - a distributed in-memory data processing engine, underpinned by a strongly-typed in-memory key-value store and a general distributed computation engine.
DevExtreme React Chart - High-performance plugin-based React chart for Bootstrap and Material Design.
Columnar Databases
Google BigQuery - Google's cloud offering backed by their pioneering work on Dremel.
Amazon Redshift - Amazon's cloud offering, also based on a columnar datastore backend.
IndexRstars445 - an open-source columnar storage format for fast & realtime analytic with big data.
LocustDBstars1.4k - an experimental analytics database aiming to set a new standard for query performance on commodity hardware.
Jun 18th - Jun 24th, 2018
Distributed Programming
Apache Beam - an unified model and set of language-specific SDKs for defining and executing data processing workflows.
May 14th - May 20th, 2018
Time-Series Databases
Thanosstars10.4k - Thanos is a set of components to create a highly available metric system with unlimited storage capacity using multiple (existing) Prometheus deployments.
Apr 16th - Apr 22nd, 2018
Distributed Index
Pilosastars2.3k Open source distributed bitmap index that dramatically accelerates queries across multiple, massive data sets.
Baidu Bigflow - an interface that allows for writing distributed computing programs providing lots of simple, flexible, powerful APIs to easily handle data of any scale.
Kafka in Action - Kafka in Action is a fast-paced introduction to every aspect of working with Kafka you need to really reap its benefits.
Reactive Data Handling - Reactive Data Handling is a collection of five hand-picked chapters, selected by Manuel Bernhardt, that introduce you to building reactive applications capable of handling real-time processing with large data loads--free eBook!
Oct 30th - Nov 5th, 2017
Search engine and framework
Vespa - is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time.
SiriDBstars461 Highly-scalable, robust and fast, open source time series database with cluster functionality.
Oct 9th - Oct 15th, 2017
RDBMS
Teradata - high-performance MPP data warehouse platform.
Distributed Filesystem
Apache Kudu - Hadoop's storage layer to enable fast analytics on fast data.
SQL-like processing
Aster Database - SQL-like analytic processing for MapReduce.
Time-Series Databases
Axibase Time Series Database - Integrated time series database on top of HBase with built-in visualization, rule-engine and SQL support.
Applications
AthenaXstars1.2k - a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).
Oct 2nd - Oct 8th, 2017
Books
Streaming
Kafka Streams in Action - Kafka Streams in Action teaches you everything you need to know to implement stream processing on data flowing into your Kafka platform, allowing you to focus on getting more from your data without sacrificing time or effort.
Big Data - Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data.
Graph Data Model
NodeXL - A free, open-source template for Microsoft® Excel® 2007, 2010, 2013 and 2016 that makes it easy to explore network graphs.
Sep 25th - Oct 1st, 2017
Distributed Programming
Wallaroo - The ultrafast and elastic data processing engine. Big or fast data - no fuss, no Java needed.
Sep 18th - Sep 24th, 2017
Security
BDAstars104 - The vulnerability detector for Hadoop and Spark
Jul 31st - Aug 6th, 2017
Scheduling
Apache Airflowstars26k - a platform to programmatically author, schedule and monitor workflows.
Azure Data Factory - cloud-based pipeline orchestration for on-prem, cloud and HDInsight
Jul 17th - Jul 23rd, 2017
PostgreSQL forks and evolutions
TimescaleDB - An open-source time-series database optimized for fast ingest and complex queries
PipelineDB - The Streaming SQL Database. An open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables
Jul 10th - Jul 16th, 2017
NewSQL Databases
Comdb2stars1.1k - a clustered RDBMS built on optimistic concurrency control techniques.
Jul 3rd - Jul 9th, 2017
RDBMS
MySQL The world's most popular open source database.
PostgreSQL The world's most advanced open source database.
Distributed Programming
Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
Apache S4 - framework for stream processing, implementation of S4.
Google Dataflow - create data pipelines to help themæingest, transform and analyze data.
Pivotal Greenplum - purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one.
SQream DB - A GPU powered big data database, designed for analytics and data warehousing, with ANSI-92 compliant SQL, suitable for data sets from 10TB to 1PB.
NewSQL Databases
Google F1 - distributed SQL database built on Spanner.
Google Spanner - globally distributed semi-relational database.
SAP HANA - is an in-memory, column-oriented, relational database management system.
Time-Series Databases
Prometheus - a time series database and service monitoring system.
Amazon Kinesis - real-time processing of streaming data at massive scale.
Embulk - open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.
Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
LinkedIn Databus - stream of change capture events for a database.
Service Programming
Google Chubby - a lock service for loosely-coupled distributed systems.
Google Borg - job scheduling and monitoring system.
Hortonworks HOYA - application that can deploy HBase cluster on YARN.
Applications
Apache Metron - a platform that integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis.
Argusstars487 - Time series monitoring and alerting platform.
Metricsgraphic.js - a library built on top of D3 that is optimized for time-series data
Supersetstars46.3k - a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at the speed of thought.
Zeppelinstars417 - a notebook-style collaborative data analysis.
Zing Charts - JavaScript charting library for big data.
Internet of things and sensor data
ThingWorx - Rapid development and connection of intelligent systems
Interesting Readings
NoSQL Comparison - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.
Interesting Papers
2013 - 2014
2013 - Google - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
2013 - Google - F1: A Distributed SQL Database That Scales.
Interesting Papers
2001 - 2010
2010 - Google - Large-scale Incremental Processing Using Distributed Transactions and Notifications base of Percolator and Caffeine.
2010 - Google - Dremel: Interactive Analysis of Web-Scale Datasets.
2006 - Google - The Chubby lock service for loosely-coupled distributed systems.
2004 - Google - MapReduce: Simplied Data Processing on Large Clusters.
Unified Log Processing - Unified Log Processing is a practical guide to implementing a unified log of event streams (Kafka or Kinesis) in your business
Jun 26th - Jul 2nd, 2017
Data Ingestion
Alooma - data pipeline as a service enabling moving data sources such as MySQL into data warehouses.
Jun 19th - Jun 25th, 2017
Key-value Data Model
BTDBstars123 - Key Value Database in .Net with Object DB Layer, RPC, dynamic IL and much more
May 29th - Jun 4th, 2017
SQL-like processing
PipelineDB - an open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables.
Key-value Data Model
Badger - a fast, simple, efficient, and persistent key-value store written natively in Go.
May 22nd - May 28th, 2017
Graph Data Model
AgensGraph - a new generation multi-model graph database for the modern complex data environment.
Search engine and framework
MG4J - MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance and provides state-of-the-art features and new research algorithms.
Mar 27th - Apr 2nd, 2017
Frameworks
IBM Streams - platform for distributed processing and real-time analytics. Integrates with many of the popular technologies in the Big Data ecosystem (Kafka, HDFS, Spark, etc.)
Distributed Programming
IBM Streams - platform for distributed processing and real-time analytics. Provides toolkits for advanced analytics like geospatial, time series, etc. out of the box.
streamsx.topologystars27 - Libraries to enable building IBM Streams application in Java, Python or Scala.
Internet of things and sensor data
Apache Edgent (Incubating) - a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the edge devices.
ScyllaDB - column-oriented distributed datastore written in C++, totally compatible with Apache Cassandra.
Graph Data Model
GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
Benchmarking
Yahoo Gridmix3 - Hadoop cluster benchmarking from Yahoo engineer team.
Data Visualization
Lumify - open source big data analysis and visualization platform
Feb 6th - Feb 12th, 2017
Frameworks
Pachyderm - Pachyderm is a data storage platform built on Docker and Kubernetes to provide reproducible data processing and analysis.
Jan 30th - Feb 5th, 2017
Service Programming
Hydrosphere Miststars318 - a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.
Jan 23rd - Jan 29th, 2017
Key-value Data Model
Edisstars462 - is a protocol-compatible Server replacement for Redis.
Skalestars397 - High performance distributed data processing in NodeJS.
Nov 14th - Nov 20th, 2016
Applications
Rakamstars789 - open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB.
Oct 24th - Oct 30th, 2016
Key-value Data Model
Tile38stars8.1k - a geolocation data store, spatial index, and realtime geofence, supporting a variety of object types including latitude/longitude points, bounding boxes, XYZ tiles, Geohashes, and GeoJSON
SummitDBstars1.4k - an in-memory, NoSQL key/value database, with disk persistance and using the Raft consensus algorithm.
Applications
411stars965 - an web application for alert management resulting from scheduled searches into Elasticsearch.
Atlasstars3.1k - a backend for managing dimensional time series data.
Graph Data Model
EliasDBstars910 - a lightweight graph based database that does not require any third-party libraries.
NewSQL Databases
Bedrock - a simple, modular, networked and distributed transaction layer built atop SQLite.
DataVec - A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem.
Deeplearning4j - Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train nets on multiple GPUs and CPUs.
H2Ostars5.8k - statistical, machine learning and math runtime with Hadoop. R and Python.
Kerasstars55.3k - An intuitive neural net API inspired by Torch that runs atop Theano and Tensorflow.
Mahout - An Apache-backed machine learning library for Hadoop.
ND4J - A matrix library for the JVM. Numpy for Java.
RL4J - Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem.
Sibyl - System for Large Scale Machine Learning at Google.
TensorFlowstars165.1k - Library from Google for machine learning using data flow graphs.
Theano - A Python-focused machine learning library supported by the University of Montreal.
Torch - A deep learning library with a Lua API, supported by NYU and Facebook.
Veloxstars111 - System for serving machine learning predictions.
Rhombus A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
Akumulistars791 Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
Riak-TS Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.
Druidstars11.8k Column oriented distributed data store ideal for powering interactive applications
Sep 19th - Sep 25th, 2016
Interesting Readings
Monitoring Kafka performance - Guide to monitoring Apache Kafka, including native methods for metrics collection.
Monitoring Hadoop performance - Guide to monitoring Hadoop, with an overview of Hadoop architecture, and native methods for metrics collection.
Sep 12th - Sep 18th, 2016
Books
Streaming
Streaming Data - Streaming Data introduces the concepts and requirements of streaming and real-time data systems.
Storm Applied - Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams.
Fundamentals of Stream Processing: Application Design, Systems, and Analytics - This comprehensive, hands-on guide combining the fundamental building blocks and emerging research in stream processing is ideal for application designers, system builders, analytic developers, as well as students and researchers in the field.
HyperDexstars1.4k - a scalable, next generation key-value and document store with a wide array of features, including consistency, fault tolerance and high performance.
Data Visualization
ReCharts - A composable charting library built on React components
Jun 27th - Jul 3rd, 2016
Time-Series Databases
Chronix - a time series storage built to store time series highly compressed and for fast access times.
Jun 20th - Jun 26th, 2016
Distributed Programming
Apache Gearpump - real-time big data streaming engine based on Akka.
Streamdrill - useful for counting activities of event streams over different time windows and finding the most active one.
Mar 21st - Mar 27th, 2016
Graph Data Model
DGraphstars18.1k - A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to be serving real time user queries, over terabytes of structured data.
Mar 14th - Mar 20th, 2016
Distributed Filesystem
Alluxio - reliable file sharing at memory speed across cluster frameworks.
Mar 7th - Mar 13th, 2016
Interesting Papers
2015 - 2016
2015 - Facebook - One Trillion Edges: Graph Processing at Facebook-Scale.
Feb 29th - Mar 6th, 2016
Data Ingestion
Skizzestars778 - sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.
GridDBstars1.7k - suitable for sensor data stored in a timeseries.
Applications
SnappyDatastars1k - a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) built on Spark in a single integrated cluster.
Jan 18th - Jan 24th, 2016
Distributed Programming
Tuktustars58 - Easy-to-use platform for batch and streaming computation, built using Scala, Akka and Play!
Document Data Model
RavenDB - A transactional, open-source Document Database.
Key Map Data Model
Hypertable - column-oriented distributed datastore, inspired by BigTable.
Applications
Countly - open source mobile and web analytics platform, based on Node.js & MongoDB.
Kylin - open source Distributed Analytics Engine from eBay.
Data Visualization
Redashstars21.1k - open-source platform to query and visualize data.
Dec 14th - Dec 20th, 2015
Data Visualization
D3.composestars699 - Compose complex, data-driven visualizations from reusable charts and components.
Dec 7th - Dec 13th, 2015
Machine Learning
BidMachstars912 - CPU and GPU-accelerated Machine Learning Library.
ENCOG - machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data.
GraphLab Create - A machine learning platform in Python with a broad collection of ML toolkits, data engineering, and deployment tools.
Data Visualization
Bokeh - A powerful Python interactive visualization library that targets modern web browsers for presentation, with the goal of providing elegant, concise construction of novel graphics in the style of D3.js, but also delivering this capability with high-performance interactivity over very large or streaming datasets.
Plot.ly - Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly's online spreadsheet. Fork others' plots.