Top 50 Awesome List

awesome-spark/awesome-spark

Big Data  9 months ago  1.4k
A curated list of awesome Apache Spark packages and resources.
View byDAY/WEEK/README
View on Github

Dec 27th - Jan 2nd, 2021

Packages

GIS

  • Apache Sedonastars1.2k - Cluster computing system for processing large-scale spatial data.
  • Dec 20th - Dec 26th, 2021

    Packages

    Machine Learning Extension

  • MLflow - Machine learning orchestration platform.
  • Dec 6th - Dec 12th, 2021

    Resources

    Docker Images

  • datamechanics/spark - An easy to setup Docker image for Apache Spark from Data Mechanics.
  • Packages

    Storage

  • lakeFS - Integration with the lakeFS atomic versioned storage layer.
  • Packages

    General Purpose Libraries

  • Joblib Apache Spark Backendstars209 - joblib backend for running tasks on Spark clusters.
  • Nov 29th - Dec 5th, 2021

    Packages

    Notebooks and IDEs

  • Polynote - Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from Netflix.
  • Packages

    SQL Data Sources

  • Spark CSVstars1k - CSV reader and writer (obsolete since Spark 2.0 [SPARK-12833]).
  • Packages

    Machine Learning Extension

  • Clustering4Everstars124 Scala and Spark API to benchmark and analyse clustering algorithms on any vectorization you can generate.
  • Microsoft ML for Apache Sparkstars3.8k - A distributed ml library with support for LightGBM, Vowpal Wabbit, OpenCV, Deep Learning, Cognitive Services, and Model Deployment.
  • Packages

    Natural Language Processing

  • spark-nlpstars2.9k - Natural language processing library built on top of Apache Spark ML.
  • Resources

    Papers

  • Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark - Structured Streaming is a new high-level streaming API, it is a declarative API based on automatically incrementalizing a static relational query.
  • Nov 1st - Nov 7th, 2021

    Packages

    Middleware

  • Apache Kyuubistars1.2k - A distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark.
  • Aug 16th - Aug 22nd, 2021

    Packages

    General Purpose Libraries

  • Apache DataFu - A library of general purpose functions and UDF's.
  • Aug 9th - Aug 15th, 2021

    Resources

    Books

  • Learning Spark, 2nd Edition - Introduction to Spark API with Spark 3.0 covered. Good source of knowledge about basic concepts.
  • Packages

    General Purpose Libraries

  • spark-dariastars695 - A Scala library with essential Spark functions and extensions to make you more productive.
  • quinnstars357 - A native PySpark implementation of spark-daria.
  • Mar 15th - Mar 21st, 2021

    Packages

    Monitoring

  • Data Mechanics Delightstars268 - Cross-platform monitoring tool (Spark UI / Spark History Server replacement).
  • Feb 15th - Feb 21st, 2021

    Packages

    General Purpose Libraries

  • itachistars36 - A library that brings useful functions from modern database management systems to Apache Spark.
  • Nov 2nd - Nov 8th, 2020

    Packages

    Storage

  • Delta Lakestars5.3k - Storage layer with ACID transactions.
  • Oct 5th - Oct 11th, 2020

    Packages

    Interfaces

  • Koalasstars3.2k - Pandas DataFrame API on top of Apache Spark.
  • Sep 28th - Oct 4th, 2020

    Packages

    Utilities

  • pyspark-stubsstars115 - Static type annotations for PySpark (obsolete since Spark 3.1. See SPARK-32681).
  • Packages

    Language Bindings

  • Mobiusstars931 - C# bindings (Deprecated in favor of .NET for Apache Spark).
  • .NET for Apache Sparkstars1.8k - .NET bindings.
  • Sep 21st - Sep 27th, 2020

    Packages

    Utilities

  • Optimusstars1.3k - Data Cleansing and Exploration utilities with the goal of simplifying data cleaning.
  • Resources

    Papers

  • Large-Scale Intelligent Microservices - Microsoft paper that presents an Apache Spark-based micro-service orchestration framework that extends database operations to include web service primitives.
  • Jul 13th - Jul 19th, 2020

    Packages

    Middleware

  • Livystars711 - REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing.
  • Dec 9th - Dec 15th, 2019

    Packages

    Testing

  • deequstars2.5k - Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
  • Nov 25th - Dec 1st, 2019

    Packages

    Notebooks and IDEs

  • almond - A scala kernel for Jupyter.
  • Aug 19th - Aug 25th, 2019

    Packages

    SQL Data Sources

  • Spark Avrostars539 - Apache Avro reader and writer (obselete since Spark 2.4 [SPARK-24768]).
  • Jan 21st - Jan 27th, 2019

    Packages

    Web Archives

  • Archives Unleashed Toolkitstars121 - Open-source toolkit for analyzing web archives.
  • Aug 13th - Aug 19th, 2018

    Packages

    Language Bindings

  • Flambostars606 - Clojure DSL.
  • sparklyrstars879 - An alternative R backend, using dplyr.
  • sparklestars439 - Haskell on Apache Spark.
  • Packages

    Notebooks and IDEs

  • Apache Zeppelin - Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.
  • Spark Notebookstars3.1k - Scalable and stable Scala and Spark focused notebook bridging the gap between JVM and Data Scientists (incl. extendable, typesafe and reactive charts).
  • sparkmagicstars1.2k - Jupyter magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through Livystars990, in Jupyter notebooks.
  • Packages

    General Purpose Libraries

  • Succinct - Support for efficient queries on compressed data.
  • Packages

    SQL Data Sources

  • Spark XMLstars420 - XML parser and writer.
  • Spark Cassandra Connectorstars1.9k - Cassandra support including data source and API and support for arbitrary queries.
  • Spark Riak Connectorstars55 - Riak TS & Riak KV connector.
  • Mongo-Sparkstars663 - Official MongoDB connector.
  • OrientDB-Sparkstars19 - Official OrientDB connector.
  • Packages

    Bioinformatics

  • ADAMstars934 - Set of tools designed to analyse genomics data.
  • Hailstars822 - Genetic analysis framework.
  • Packages

    GIS

  • Magellanstars520 - Geospatial analytics using Spark.
  • Packages

    Time Series Analytics

  • Spark-Timeseriesstars1.2k - Scala / Java / Python library for interacting with time series data on Apache Spark.
  • flintstars961 - A time series library for Apache Spark.
  • Packages

    Graph Processing

  • Mazerunnerstars377 - Graph analytics platform on top of Neo4j and GraphX.
  • GraphFramesstars887 - Data frame based graph API.
  • neo4j-spark-connectorstars293 - Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support.
  • SparklingGraph - Library extending GraphX features with multiple functionalities useful in graph analytics (measures, generators, link prediction etc.).
  • Packages

    Machine Learning Extension

  • dbscan-on-sparkstars178 - An Implementation of the DBSCAN clustering algorithm on top of Apache Spark by irvingc and based on the paper from He, Yaobin, et al. MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data.
  • Apache SystemML - Declarative machine learning framework on top of Spark.
  • Mahout Spark Bindings [status unknown] - linear algebra DSL and optimizer with R-like syntax.
  • spark-sklearnstars1.1k - Scikit-learn integration with distributed model training.
  • JPMML-Sparkstars94 - PMML transformer library for Spark ML.
  • Distributed Kerasstars620 - Distributed deep learning framework with PySpark and Keras.
  • ModelDB - A system to manage machine learning models for spark.ml and scikit-learn .
  • Sparkling Waterstars933 - H2O interoperability layer.
  • BigDLstars4k - Distributed Deep Learning library.
  • MLeapstars1.4k - Execution engine and serialization format which supports deployment of o.a.s.ml models without dependency on SparkSession.
  • Packages

    Middleware

  • spark-jobserverstars2.8k - Simple Spark as a Service which supports objects sharing using so called named objects. JVM only.
  • Miststars318 - Service for exposing Spark analytical jobs and machine learning models as realtime, batch or reactive web services.
  • Apache Toreestars702 - IPython protocol based middleware for interactive applications.
  • Packages

    Utilities

  • silexstars18 - Collection of tools varying from ML extensions to additional RDD methods.
  • sparklystars52 - Helpers & syntactic sugar for PySpark.
  • Flintrockstars607 - A command-line tool for launching Spark clusters on EC2.
  • Packages

    Natural Language Processing

  • spark-corenlpstars429 - DataFrame wrapper for Stanford CoreNLP.
  • Packages

    Streaming

  • Apache Bahir - Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ).
  • Packages

    Interfaces

  • Apache Beam - Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments.
  • Blazestars3.1k - Interface for querying larger than memory datasets using Pandas-like syntax. It supports both Spark DataFrames and RDDs.
  • Packages

    Testing

  • spark-testing-basestars1.4k - Collection of base test classes.
  • spark-fast-testsstars361 - A lightweight and fast testing framework.
  • Packages

    Workflow Management

  • Cromwellstars849 - Workflow management system with Spark backendstars849.
  • Apr 10th - Apr 16th, 2017

    Resources

    Papers

  • Spark SQL: Relational Data Processing in Spark - Paper introducing relational underpinnings, code generation and Catalyst optimizer.
  • Apr 3rd - Apr 9th, 2017

    Resources

    Papers

  • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing - Paper introducing a core distributed memory abstraction.
  • Feb 27th - Mar 5th, 2017

    Resources

    Books

  • Advanced Analytics with Spark - Useful collection of Spark processing patterns. Accompanying GitHub repository: sryza/aasstars1.5k.
  • Mastering Apache Spark - Interesting compilation of notes by Jacek Laskowski. Focused on different aspects of Spark internals.
  • Spark Gotchasstars343 - Subjective compilation of tips, tricks and common programming mistakes.
  • Spark in Action - New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to setup Eclipse for Spark application development and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo herestars267.
  • Resources

    MOOCS

  • Data Science and Engineering with Apache Spark (edX XSeries) - Series of five courses (Introduction to Apache Spark, Distributed Machine Learning with Apache Spark, Big Data Analysis with Apache Spark, Advanced Apache Spark for Data Science and Data Engineering, Advanced Distributed Machine Learning with Apache Spark) covering different aspects of software engineering and data science. Python oriented.
  • Resources

    Workshops

  • AMP Camp - Periodical training event organized by the UC Berkeley AMPLab. A source of useful exercise and recorded workshops covering different tools from the Berkeley Data Analytics Stack.
  • Resources

    Projects Using Spark

  • Oryx 2stars1.8k - Lambda architecture platform built on Apache Spark and Apache Kafka with specialization for real-time large scale machine learning.
  • Resources

    Miscellaneous

  • Spark with Scala Gitter channel - "A place to discuss and ask questions about using Scala for Spark programming" started by @deanwampler.
  • Oct 17th - Oct 23rd, 2016

    Resources

    Miscellaneous

  • Apache Spark User List and Apache Spark Developers List - Mailing lists dedicated to usage questions and development topics respectively.
  • Jun 20th - Jun 26th, 2016

    Resources

    Projects Using Spark

  • Crossdatastars168 - Data integration platform with extended DataSource API and multi-user environment.
  • Photon MLstars790 - A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model.
  • Resources

    Docker Images

  • sequenceiq/docker-sparkstars767 - Yarn images from SequenceIQ.
  • jupyter/docker-stacks/pyspark-notebook - PySpark with Jupyter Notebook and Mesos client.
  • Jun 6th - Jun 12th, 2016

    Resources

    MOOCS

  • Big Data Analysis with Scala and Spark (Coursera) - Scala oriented introductory course. Part of Functional Programming in Scala Specialization.
  • Packages

    Machine Learning Extension

  • KeystoneML - Type safe machine learning pipelines with RDDs.
  • Feb 1st - Feb 7th, 2016

    Resources

    Projects Using Spark

  • PredictionIO - Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.
  • Last Checked At: 2022-09-30T06:24:57.475Z
    Previous
    manuzhang/awesome-streaming
    Next
    ambster-public/awesome-qlik

    About

    Track your favorite github awesome repo, not just star it. trackawesomelist.com provides website, newsletter, RSS for tracking the popular awesome list by daily and weekly.
    Contact us: [email protected]
    Track Awesome List - Track your favorite Github awesome repos, not just star them | Product Hunt

    Subscribe

    Subscribe to our weekly newsletter to receive the awesome updates! We never send spam and you can unsubscribe instantly with one click. Here's past issues.

    Links

    Follow us on TwitterSubscribe us on TelegramSubmit awesome list repoNewsletterDonateSitemap