PySpark drafts

Code and Ecosystem

Basic code that may help and be useful during daily programming Spark tasks. I'll also post a more advanced set of code examples soon.

It's known that PySpark is the Python API for most of Spark features as below:

General Workflow

SparkSession builder pattern

Dataframe creation

Of course it can be created from many differennt sources such as Hive tables, Structured Data Files, Parquet, Avro, external databases or existing RDD's etc.

Data Overview

Just a few data selection and access examples

Applying functions with  Pandas UDFs and Pandas Function APIs

Grouped data: split-apply-combine

Multiple data sources and formats

Mixing SQL, Python(UDF's) and pandas