Top 50 Awesome List

Higher Education

Higher Education


Big Data  7 months ago  1k
A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources
View on Github

Awesome Hadoop Awesome

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources. Inspired by Awesome PHPstars27.1k, Awesome Pythonstars137.5k and Awesome Sysadminstars22.3k


  • Apache Hadoop - Apache Hadoop
  • Apache Hadoop Ozone - An Object Store for Apache Hadoop
  • Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop
  • SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
  • GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
  • Elasticsearch Hadoopstars1.9k - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
  • hadoopystars244 - Python MapReduce library written in Cython.
  • mrjobstars2.6k - mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
  • pydoop - Pydoop is a package that provides a Python API for Hadoop.
  • hdfs-dustars227 - HDFS-DU is an interactive visualization of the Hadoop distributed file system.
  • White Elephantstars191 - Hadoop log aggregator and dashboard
  • Geniestars1.6k - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
  • Apache Kylin - Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
  • Crunchstars206 - Go-based toolkit for ETL and feature extraction on Hadoop
  • Apache Ignite - Distributed in-memory platform


  • Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.
  • Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
  • mpich2-yarnstars110 - Running MPICH2 on Yarn


Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.

  • Apache HBase - Apache HBase
  • Apache Phoenix - A SQL skin over HBase supporting secondary indices
  • happybasestars590 - A developer-friendly Python library to interact with Apache HBase.
  • Hannibalstars170 - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
  • Haeinsastars158 - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
  • hindexstars589 - Secondary Index for HBase
  • Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
  • OpenTSDB - The Scalable Time Series Database
  • Apache Cassandra

SQL on Hadoop

SQL on Hadoop

  • Apache Hive - The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
  • Apache Phoenix A SQL skin over HBase supporting secondary indices
  • Apache HAWQ (incubating) - Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop
  • Lingual - SQL interface for Cascading (MR/Tez job generator)
  • Apache Impala - Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.
  • Presto - Distributed SQL Query Engine for Big Data. Open sourced by Facebook.
  • Apache Tajo - Data warehouse system for Apache Hadoop
  • Apache Drill - Schema-free SQL Query Engine
  • Apache Trafodion

Data Management

  • Apache Calcite - A Dynamic Data Management Framework
  • Apache Atlas - Metadata tagging & lineage capture suppoting complex business data taxonomies
  • Apache Kudu - Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer, complementing HDFS and Apache HBase.
  • Confluent Schema registry for Kafkastars1.8k - Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving Avro schemas.
  • Hortonworks Schema Registrystars214 - Schema Registry is a framework to build metadata repositories.

Workflow, Lifecycle and Governance

Data Ingestion and Integration


Libraries and Tools

Realtime Data Processing

  • Apache Storm
  • Apache Samza
  • Apache Spark
  • Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.
  • Apache Pulsar (incubating) - Apache Pulsar (incubating) is a highly scalable, low latency messaging platform running on commodity hardware. It provides simple pub-sub semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication.
  • Apache Druid (incubating) - A high-performance, column-oriented, distributed data store.

Distributed Computing and Programming

  • Apache Spark
  • Spark Packages - A community index of packages for Apache Spark
  • SparkHub - A community site for Apache Spark
  • Apache Crunch
  • Cascading - Cascading is the proven application development platform for building data applications on Hadoop.
  • Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing.
  • Apache Apex (incubating) - Enterprise-grade unified stream and batch processing engine.
  • Apache Livy (incubating) - Apache Livy (incubating) is web service that exposes a REST interface for managing long running Apache Spark contexts in your cluster. With Livy, new applications can be built on top of Apache Spark that require fine grained interaction with many Spark contexts.

Packaging, Provisioning and Monitoring

  • Apache Bigtop - Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
  • Apache Ambari - Apache Ambari
  • Ganglia Monitoring System
  • ankushstars21 - A big data cluster management tool that creates and manages clusters of different technologies.
  • Apache Zookeeper - Apache Zookeeper
  • Apache Curator - ZooKeeper client wrapper and rich ZooKeeper framework
  • invisostars197 - Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization.
  • - Send logs from Hadoop to Elasticsearch for monitoring and alerting.

Search Engine Framework

  • Apache Nutch - Apache Nutch is a highly extensible and scalable open source web crawler software project.


  • Apache Ranger - Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
  • Apache Sentry - An authorization module for Hadoop
  • Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.


  • Big Data Benchmark
  • HiBenchstars1.3k
  • YCSBstars4.2k - The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.

Machine learning and Big Data analytics

  • Apache Mahout
  • Oryx 2stars1.8k - Lambda architecture on Spark, Kafka for real-time large scale machine learning
  • MLlib - MLlib is Apache Spark's scalable machine learning library.
  • R - R is a free software environment for statistical computing and graphics.
  • RHadoop including RHDFS, RHBase, RMR2, plyrmr
  • Apache Lens
  • Apache SINGA (incubating) - SINGA is a general distributed deep learning platform for training big deep learning models over large datasets
  • BigDL - BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters.
  • Apache Hivemall (incubating) - Apache Hivemall is a scalable machine learning library that runs on Apache Hive, Spark and Pig.



Various resources, such as books, websites and articles.


Useful websites and articles



Hadoop and Big Data Events

Other Awesome Lists

Other amazingly awesome lists can be found in the awesome-awesomenessstars29.2k and awesomestars214.5k list.


  1. Awesome Hadoop Awesome
  2. Hadoop
  3. YARN
  4. NoSQL
  5. SQL on Hadoop
  6. Data Management
  7. Workflow, Lifecycle and Governance
  8. Data Ingestion and Integration
  9. DSL
  10. Libraries and Tools
  11. Realtime Data Processing
  12. Distributed Computing and Programming
  13. Packaging, Provisioning and Monitoring
  14. Search
  15. Search Engine Framework
  16. Security
  17. Benchmark
  18. Machine learning and Big Data analytics
  19. Misc.
  20. Resources
  21. Websites
  22. Presentations
  23. Books
  24. Hadoop and Big Data Events
  25. Other Awesome Lists
Last Checked At: 2022-08-15T14:51:22.890Z


Track your favorite github awesome repo, not just star it. provides website, newsletter, RSS for tracking the popular awesome list by daily and weekly.
Contact us: [email protected]
Track Awesome List - Track your favorite Github awesome repos, not just star them | Product Hunt


Subscribe to our weekly newsletter to receive the awesome updates! We never send spam and you can unsubscribe instantly with one click. Here's past issues.


Follow us on TwitterSubscribe us on TelegramSubmit awesome list repoNewsletterDonateSitemap