Top 50 Awesome List

youngwookim/awesome-hadoop

Big Data  8 months ago  1k
A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources
View byDAY/WEEK/README
View on Github

Jan 24th

Packaging, Provisioning and Monitoring

  • Logit.io - Send logs from Hadoop to Elasticsearch for monitoring and alerting.
  • Realtime Data Processing

  • Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.
  • Jan 14th, 2019

    Hadoop

  • Apache Hadoop Ozone - An Object Store for Apache Hadoop
  • Machine learning and Big Data analytics

  • Apache Hivemall (incubating) - Apache Hivemall is a scalable machine learning library that runs on Apache Hive, Spark and Pig.
  • Realtime Data Processing

  • Apache Druid (incubating) - A high-performance, column-oriented, distributed data store.
  • Apr 12th, 2018

    Data Management

  • Confluent Schema registry for Kafkastars1.8k - Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving Avro schemas.
  • Hortonworks Schema Registrystars216 - Schema Registry is a framework to build metadata repositories.
  • Libraries and Tools

  • Schema Registry UIstars397 - Web tool for the Confluent Schema Registry in order to create / view / search / evolve / view history & configure Avro schemas of your Kafka cluster.
  • Dec 11th, 2017

    Libraries and Tools

  • Apache Superset (incubating) - Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application
  • snakebite - A pure python HDFS client
  • Apache Parquet - Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
  • Data Management

  • Apache Kudu - Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer, complementing HDFS and Apache HBase.
  • Machine learning and Big Data analytics

  • BigDL - BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters.
  • Search

  • Apache Solr - Apache Solr is an open source search platform built upon a Java library called Lucene.
  • Distributed Computing and Programming

  • Apache Livy (incubating) - Apache Livy (incubating) is web service that exposes a REST interface for managing long running Apache Spark contexts in your cluster. With Livy, new applications can be built on top of Apache Spark that require fine grained interaction with many Spark contexts.
  • Hadoop and Big Data Events

  • Spark Summit
  • DataWorks Summit
  • Realtime Data Processing

  • Apache Pulsar (incubating) - Apache Pulsar (incubating) is a highly scalable, low latency messaging platform running on commodity hardware. It provides simple pub-sub semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication.
  • SQL on Hadoop

  • Apache Impala - Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.
  • Jul 21st, 2016

    Jun 2nd, 2016

    SQL on Hadoop

  • Apache HAWQ (incubating) - Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop
  • Apache Hive - The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
  • Apache Trafodion
  • Machine learning and Big Data analytics

  • Apache SINGA (incubating) - SINGA is a general distributed deep learning platform for training big deep learning models over large datasets
  • Workflow, Lifecycle and Governance

  • Apache AirFlowstars27.7k - Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines
  • Jun 1st, 2016

    SQL on Hadoop

  • Apache Drill - Schema-free SQL Query Engine
  • Feb 23rd, 2016

    Hadoop and Big Data Events

  • ApacheCon
  • Strata + Hadoop World
  • Hadoop

  • Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop
  • Distributed Computing and Programming

  • Apache Apex (incubating) - Enterprise-grade unified stream and batch processing engine.
  • Nov 14th, 2015

    Hadoop

  • Elasticsearch Hadoopstars1.9k - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
  • Apache Ignite - Distributed in-memory platform
  • YARN

  • mpich2-yarnstars110 - Running MPICH2 on Yarn
  • SQL on Hadoop

  • Presto - Distributed SQL Query Engine for Big Data. Open sourced by Facebook.
  • Data Management

  • Apache Calcite - A Dynamic Data Management Framework
  • Workflow, Lifecycle and Governance

  • Apache Falcon - Data management and processing platform
  • Apache NiFi - A dataflow system
  • DSL

  • vaharastars51 - Machine learning and natural language processing with Apache Pig
  • Libraries and Tools

  • Elephant Birdstars1.1k - Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
  • Realtime Data Processing

  • Apache Storm
  • Apache Samza
  • Distributed Computing and Programming

  • SparkHub - A community site for Apache Spark
  • Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing.
  • Search

  • ElasticSearch
  • Machine learning and Big Data analytics

  • Apache Lens
  • Oct 18th, 2015

    Search Engine Framework

  • Apache Nutch - Apache Nutch is a highly extensible and scalable open source web crawler software project.
  • Websites

  • Hadoop360
  • Sep 9th, 2015

    Libraries and Tools

  • Oozie Eclipse Plugin - A graphical editor for editing Apache Oozie workflows inside Eclipse.
  • Jul 28th, 2015

    Machine learning and Big Data analytics

  • RHadoop including RHDFS, RHBase, RMR2, plyrmr
  • Jul 27th, 2015

    NoSQL

  • Apache Phoenix - A SQL skin over HBase supporting secondary indices
  • SQL on Hadoop

  • Apache Phoenix A SQL skin over HBase supporting secondary indices
  • Lingual - SQL interface for Cascading (MR/Tez job generator)
  • Data Management

  • Apache Atlas - Metadata tagging & lineage capture suppoting complex business data taxonomies
  • Workflow, Lifecycle and Governance

  • Luigi - Python package that helps you build complex pipelines of batch jobs
  • Realtime Data Processing

  • Apache Spark
  • Jul 23rd, 2015

    Distributed Computing and Programming

  • Spark Packages - A community index of packages for Apache Spark
  • Jul 9th, 2015

    Benchmark

  • YCSBstars4.3k - The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.
  • Jul 8th, 2015

    Workflow, Lifecycle and Governance

  • Apache Oozie - Apache Oozie
  • Jul 1st, 2015

    Data Ingestion and Integration

  • Gobblin from LinkedInstars2.1k - Universal data ingestion framework for Hadoop
  • Jun 29th, 2015

    Misc.

  • Hive Plugins
  • UDF
  • Storage Handler
  • Libraries and tools
  • Flume Plugins
  • Machine learning and Big Data analytics

  • Oryx 2stars1.8k - Lambda architecture on Spark, Kafka for real-time large scale machine learning
  • Jun 18th, 2015

    Hadoop

  • Apache Kylin - Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
  • May 15th, 2015

    Machine learning and Big Data analytics

  • Apache Mahout
  • Apr 28th, 2015

    Libraries and Tools

  • Apache Zeppelin - A web-based notebook that enables interactive data analytics
  • Jan 29th, 2015

    Hadoop

  • Crunchstars206 - Go-based toolkit for ETL and feature extraction on Hadoop
  • SQL on Hadoop

  • Apache Tajo - Data warehouse system for Apache Hadoop
  • Libraries and Tools

  • hdfs - A native go client for HDFSstars1.2k
  • Packaging, Provisioning and Monitoring

  • invisostars199 - Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization.
  • Search

  • Bananastars667 - Kibana port for Apache Solr
  • Security

  • Apache Ranger - Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
  • Apache Sentry - An authorization module for Hadoop
  • Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.
  • Websites

  • AWS BigData Blog
  • Oct 29th, 2014

    Data Ingestion and Integration

  • Apache Sqoop - Apache Sqoop
  • Apache Kafka - Apache Kafka
  • YARN

  • Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
  • Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.
  • Presentations

  • Docker based Hadoop provisioning
  • Jul 14th, 2014

    Libraries and Tools

  • Hue - A Web interface for analyzing data with Apache Hadoop.
  • Spring for Apache Hadoop
  • Apache Thrift
  • Apache Avro - Apache Avro is a data serialization system.
  • Presentations

  • Apache Hadoop In Theory And Practice
  • Hadoop Operations at LinkedIn
  • Hadoop Performance at LinkedIn
  • Books

  • Programming Pig
  • Programming Hive
  • Hadoop: The Definitive Guide
  • Hadoop Operations
  • Apache Hadoop Yarn
  • HBase: The Definitive Guide
  • Machine learning and Big Data analytics

  • MLlib - MLlib is Apache Spark's scalable machine learning library.
  • R - R is a free software environment for statistical computing and graphics.
  • NoSQL

  • Apache Cassandra
  • Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
  • OpenTSDB - The Scalable Time Series Database
  • Distributed Computing and Programming

  • Apache Spark
  • Apache Crunch
  • Cascading - Cascading is the proven application development platform for building data applications on Hadoop.
  • Packaging, Provisioning and Monitoring

  • Apache Zookeeper - Apache Zookeeper
  • Apache Curator - ZooKeeper client wrapper and rich ZooKeeper framework
  • Apache Ambari - Apache Ambari
  • ankushstars21 - A big data cluster management tool that creates and manages clusters of different technologies.
  • Websites

  • The Hadoop Ecosystem Table
  • Hadoop illuminated - Open Source Hadoop Book
  • Hadoop

  • Geniestars1.6k - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
  • Data Ingestion and Integration

  • Surostars773 - Netflix's distributed Data Pipeline
  • DSL

  • PigPenstars541 - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.
  • Jul 9th, 2014

    Packaging, Provisioning and Monitoring

  • Apache Bigtop - Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
  • Hadoop

  • White Elephantstars191 - Hadoop log aggregator and dashboard
  • hadoopystars243 - Python MapReduce library written in Cython.
  • mrjobstars2.6k - mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
  • pydoop - Pydoop is a package that provides a Python API for Hadoop.
  • hdfs-dustars228 - HDFS-DU is an interactive visualization of the Hadoop distributed file system.
  • Websites

  • Hadoop Weekly
  • Jul 8th, 2014

    Hadoop

  • SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
  • GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
  • Apache Hadoop - Apache Hadoop
  • NoSQL

  • Haeinsastars158 - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
  • hindexstars589 - Secondary Index for HBase
  • Hannibalstars170 - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
  • happybasestars594 - A developer-friendly Python library to interact with Apache HBase.
  • Apache HBase - Apache HBase
  • Packaging, Provisioning and Monitoring

  • Ganglia Monitoring System
  • Libraries and Tools

  • gohadoopstars307 - Native go clients for Apache Hadoop YARN.
  • Kite Software Development Kit - A set of libraries, tools, examples, and documentation
  • Benchmark

  • HiBenchstars1.3k
  • Big Data Benchmark
  • Workflow, Lifecycle and Governance

  • Azkaban
  • DSL

  • Apache Pig - Apache Pig
  • akelastars76 - Mozilla's utility library for Hadoop, HBase, Pig, etc.
  • packetpigstars301 - Open Source Big Data Security Analytics
  • Lipstickstars460 - Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig
  • seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
  • Apache DataFu - A collection of libraries for working with large-scale data in Hadoop
  • Data Ingestion and Integration

  • Apache Flume - Apache Flume
  • Last Checked At: 2022-09-30T06:23:53.005Z
    Previous
    newTendermint/awesome-bigdata
    Next
    igorbarinov/awesome-data-engineering

    About

    Track your favorite github awesome repo, not just star it. trackawesomelist.com provides website, newsletter, RSS for tracking the popular awesome list by daily and weekly.
    Contact us: [email protected]
    Track Awesome List - Track your favorite Github awesome repos, not just star them | Product Hunt

    Subscribe

    Subscribe to our weekly newsletter to receive the awesome updates! We never send spam and you can unsubscribe instantly with one click. Here's past issues.

    Links

    Follow us on TwitterSubscribe us on TelegramSubmit awesome list repoNewsletterDonateSitemap