Track Awesome Spark Updates Weekly
A curated list of awesome Apache Spark packages and resources.
🏠 Home · 🔍 Search · 🔥 Feed · 📮 Subscribe · ❤️ Sponsor · 😺 awesome-spark/awesome-spark · ⭐ 1.5K · 🏷️ Big Data
Apr 17 - Apr 23, 2023
Packages / Middleware
- Apache Kyuubi (⭐1.6k)
- A distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark.
Resources / Docker Images
- apache/spark - Apache Spark Official Docker images.
Feb 27 - Mar 05, 2023
Packages / Language Bindings
- Kotlin for Apache Spark (⭐378)
- Kotlin API bindings and extensions.
Dec 27 - Jan 02, 2021
Packages / GIS
- Apache Sedona (⭐1.4k)
- Cluster computing system for processing large-scale spatial data.
Dec 20 - Dec 26, 2021
Packages / Machine Learning Extension
- MLflow
- Machine learning orchestration platform.
Dec 06 - Dec 12, 2021
Packages / General Purpose Libraries
- Joblib Apache Spark Backend (⭐221)
-
joblib
backend for running tasks on Spark clusters.
Packages / Storage
- lakeFS
- Integration with the lakeFS atomic versioned storage layer.
Resources / Docker Images
- datamechanics/spark - An easy to setup Docker image for Apache Spark from Data Mechanics.
Nov 29 - Dec 05, 2021
Packages / Notebooks and IDEs
- Polynote
- Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from Netflix.
Packages / SQL Data Sources
- Spark CSV (⭐1.1k)
- CSV reader and writer (obsolete since Spark 2.0 [SPARK-12833]).
Packages / Machine Learning Extension
- Clustering4Ever (⭐124)
Scala and Spark API to benchmark and analyse clustering algorithms on any vectorization you can generate.
- Microsoft ML for Apache Spark (⭐4k)
- A distributed ml library with support for LightGBM, Vowpal Wabbit, OpenCV, Deep Learning, Cognitive Services, and Model Deployment.
Packages / Natural Language Processing
- spark-nlp (⭐3.2k)
- Natural language processing library built on top of Apache Spark ML.
Resources / Papers
- Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark - Structured Streaming is a new high-level streaming API, it is a declarative API based on automatically incrementalizing a static relational query.
Aug 16 - Aug 22, 2021
Packages / General Purpose Libraries
- Apache DataFu (⭐103)
- A library of general purpose functions and UDF's.
Aug 09 - Aug 15, 2021
Packages / General Purpose Libraries
- spark-daria (⭐714)
- A Scala library with essential Spark functions and extensions to make you more productive.
- quinn (⭐445)
- A native PySpark implementation of spark-daria.
Resources / Books
- Learning Spark, 2nd Edition - Introduction to Spark API with Spark 3.0 covered. Good source of knowledge about basic concepts.
Mar 15 - Mar 21, 2021
Packages / Monitoring
- Data Mechanics Delight (⭐303)
- Cross-platform monitoring tool (Spark UI / Spark History Server replacement).
Feb 15 - Feb 21, 2021
Packages / General Purpose Libraries
- itachi (⭐44)
- A library that brings useful functions from modern database management systems to Apache Spark.
Nov 09 - Nov 15, 2020
Packages / Storage
- Delta Lake (⭐5.9k)
- Storage layer with ACID transactions.
Oct 12 - Oct 18, 2020
Packages / Interfaces
- Koalas (⭐3.3k)
- Pandas DataFrame API on top of Apache Spark.
Oct 05 - Oct 11, 2020
Packages / Language Bindings
- Mobius (⭐940)
- C# bindings (Deprecated in favor of .NET for Apache Spark).
- .NET for Apache Spark (⭐1.9k)
- .NET bindings.
Packages / Utilities
- pyspark-stubs (⭐113)
- Static type annotations for PySpark (obsolete since Spark 3.1. See SPARK-32681).
Sep 28 - Oct 04, 2020
Packages / Utilities
- Optimus (⭐1.4k)
- Data Cleansing and Exploration utilities with the goal of simplifying data cleaning.
Resources / Papers
- Large-Scale Intelligent Microservices - Microsoft paper that presents an Apache Spark-based micro-service orchestration framework that extends database operations to include web service primitives.
Jul 20 - Jul 26, 2020
Packages / Middleware
- Livy (⭐763)
- REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing.
Dec 16 - Dec 22, 2019
Packages / Testing
- deequ (⭐2.7k)
- Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Dec 02 - Dec 08, 2019
Packages / Notebooks and IDEs
Aug 26 - Sep 01, 2019
Packages / SQL Data Sources
- Spark Avro (⭐539)
- Apache Avro reader and writer (obselete since Spark 2.4 [SPARK-24768]).
Jan 28 - Feb 03, 2019
Packages / Web Archives
- Archives Unleashed Toolkit (⭐124)
- Open-source toolkit for analyzing web archives.
Aug 13 - Aug 19, 2018
Packages / Language Bindings
- Flambo (⭐607)
- Clojure DSL.
- sparklyr (⭐903)
- An alternative R backend, using
dplyr
.
- sparkle (⭐442)
- Haskell on Apache Spark.
Packages / Notebooks and IDEs
- Apache Zeppelin
- Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.
- Spark Notebook (⭐3.1k)
- Scalable and stable Scala and Spark focused notebook bridging the gap between JVM and Data Scientists (incl. extendable, typesafe and reactive charts).
- sparkmagic (⭐1.2k)
- Jupyter magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through Livy (⭐995), in Jupyter notebooks.
Packages / General Purpose Libraries
- Succinct
- Support for efficient queries on compressed data.
Packages / SQL Data Sources
- Spark XML (⭐441)
- XML parser and writer.
- Spark Cassandra Connector (⭐1.9k)
- Cassandra support including data source and API and support for arbitrary queries.
- Spark Riak Connector (⭐57)
- Riak TS & Riak KV connector.
- Mongo-Spark (⭐674)
- Official MongoDB connector.
- OrientDB-Spark (⭐19)
- Official OrientDB connector.
Packages / Bioinformatics
- ADAM (⭐949)
- Set of tools designed to analyse genomics data.
- Hail (⭐874)
- Genetic analysis framework.
Packages / GIS
- Magellan (⭐525)
- Geospatial analytics using Spark.
Packages / Time Series Analytics
- Spark-Timeseries (⭐1.2k)
- Scala / Java / Python library for interacting with time series data on Apache Spark.
- flint (⭐979)
- A time series library for Apache Spark.
Packages / Graph Processing
- Mazerunner (⭐377)
- Graph analytics platform on top of Neo4j and GraphX.
- GraphFrames (⭐919)
- Data frame based graph API.
- neo4j-spark-connector (⭐293)
- Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support.
- SparklingGraph
- Library extending GraphX features with multiple functionalities useful in graph analytics (measures, generators, link prediction etc.).
Packages / Machine Learning Extension
- dbscan-on-spark (⭐178)
- An Implementation of the DBSCAN clustering algorithm on top of Apache Spark by irvingc and based on the paper from He, Yaobin, et al. MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data.
- Apache SystemML
- Declarative machine learning framework on top of Spark.
- Mahout Spark Bindings [status unknown] - linear algebra DSL and optimizer with R-like syntax.
- spark-sklearn (⭐1.1k)
- Scikit-learn integration with distributed model training.
- JPMML-Spark (⭐95)
- PMML transformer library for Spark ML.
- Distributed Keras (⭐623)
- Distributed deep learning framework with PySpark and Keras.
- ModelDB
- A system to manage machine learning models for
spark.ml
andscikit-learn
.
- Sparkling Water (⭐940)
- H2O interoperability layer.
- BigDL (⭐4.2k)
- Distributed Deep Learning library.
- MLeap (⭐1.4k)
- Execution engine and serialization format which supports deployment of
o.a.s.ml
models without dependency onSparkSession
.
Packages / Middleware
- spark-jobserver (⭐2.8k)
- Simple Spark as a Service which supports objects sharing using so called named objects. JVM only.
- Mist (⭐322)
- Service for exposing Spark analytical jobs and machine learning models as realtime, batch or reactive web services.
- Apache Toree (⭐711)
- IPython protocol based middleware for interactive applications.
Packages / Utilities
- silex (⭐18)
- Collection of tools varying from ML extensions to additional RDD methods.
- sparkly (⭐54)
- Helpers & syntactic sugar for PySpark.
- Flintrock (⭐622)
- A command-line tool for launching Spark clusters on EC2.
Packages / Natural Language Processing
- spark-corenlp (⭐425)
- DataFrame wrapper for Stanford CoreNLP.
Packages / Streaming
- Apache Bahir
- Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ).
Packages / Interfaces
- Apache Beam
- Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments.
- Blaze (⭐3.1k)
- Interface for querying larger than memory datasets using Pandas-like syntax. It supports both Spark
DataFrames
andRDDs
.
Packages / Testing
- spark-testing-base (⭐1.4k)
- Collection of base test classes.
- spark-fast-tests (⭐383)
- A lightweight and fast testing framework.
Packages / Workflow Management
- Cromwell (⭐881)
- Workflow management system with Spark backend (⭐881).
Apr 10 - Apr 16, 2017
Resources / Papers
- Spark SQL: Relational Data Processing in Spark - Paper introducing relational underpinnings, code generation and Catalyst optimizer.
Apr 03 - Apr 09, 2017
Resources / Papers
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing - Paper introducing a core distributed memory abstraction.
Feb 27 - Mar 05, 2017
Resources / Books
- Advanced Analytics with Spark - Useful collection of Spark processing patterns. Accompanying GitHub repository: sryza/aas (⭐1.5k).
- Mastering Apache Spark - Interesting compilation of notes by Jacek Laskowski. Focused on different aspects of Spark internals.
- Spark Gotchas (⭐346) - Subjective compilation of tips, tricks and common programming mistakes.
- Spark in Action - New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to setup Eclipse for Spark application development and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo here (⭐270).
Resources / MOOCS
- Data Science and Engineering with Apache Spark (edX XSeries) - Series of five courses (Introduction to Apache Spark, Distributed Machine Learning with Apache Spark, Big Data Analysis with Apache Spark, Advanced Apache Spark for Data Science and Data Engineering, Advanced Distributed Machine Learning with Apache Spark) covering different aspects of software engineering and data science. Python oriented.
Resources / Workshops
- AMP Camp - Periodical training event organized by the UC Berkeley AMPLab. A source of useful exercise and recorded workshops covering different tools from the Berkeley Data Analytics Stack.
Resources / Projects Using Spark
- Oryx 2 (⭐1.8k) - Lambda architecture platform built on Apache Spark and Apache Kafka with specialization for real-time large scale machine learning.
Resources / Miscellaneous
- Spark with Scala Gitter channel - "A place to discuss and ask questions about using Scala for Spark programming" started by @deanwampler.
Oct 17 - Oct 23, 2016
Resources / Miscellaneous
- Apache Spark User List and Apache Spark Developers List - Mailing lists dedicated to usage questions and development topics respectively.
Jun 20 - Jun 26, 2016
Resources / Projects Using Spark
- Photon ML (⭐793) - A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model.
- Crossdata (⭐169) - Data integration platform with extended DataSource API and multi-user environment.
Resources / Docker Images
- jupyter/docker-stacks/pyspark-notebook (⭐7.2k) - PySpark with Jupyter Notebook and Mesos client.
- sequenceiq/docker-spark (⭐764) - Yarn images from SequenceIQ.
Jun 06 - Jun 12, 2016
Packages / Machine Learning Extension
- KeystoneML - Type safe machine learning pipelines with RDDs.
Resources / MOOCS
- Big Data Analysis with Scala and Spark (Coursera) - Scala oriented introductory course. Part of Functional Programming in Scala Specialization.
Feb 01 - Feb 07, 2016
Resources / Projects Using Spark
- PredictionIO - Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.