F2F - PySpark Drafts

PySpark drafts

Code and Ecosystem

Basic code that may help and be useful during daily programming Spark tasks. I'll also post a more advanced set of code examples soon.

It's known that PySpark is the Python API for most of Spark features as below:

General Workflow

Initialize SparkSession
DataFrame creation
Data overview
More granular data selection and access
Apply native Python Functions
Handling grouped data: split-apply-combine
Multiple data sources and formats
Analyzing with SQL

SparkSession builder pattern

Dataframe creation

Of course it can be created from many differennt sources such as Hive tables, Structured Data Files, Parquet, Avro, external databases or existing RDD's etc.

Data Overview

Just a few data selection and access examples

PySpark drafts

Code and Ecosystem

General Workflow

SparkSession builder pattern

Dataframe creation

Data Overview

Applying functions with Pandas UDFs and Pandas Function APIs

Grouped data: split-apply-combine

Multiple data sources and formats

Mixing SQL, Python(UDF's) and pandas