PySpark.sql v.s. Pandas

alt text

Pyspark.sql and Pandas are both nice data analysis tools, providing fast and expressive data structures to work with relational or labeled data. Depending on the size of data for your analysis, you could choose either one as they have quite similar functionalities. These analysis will form the basis for applying machine learning on these data. Today let’s do a parallel comparison between Pyspark.sql and Pandas.

Features	pyspark.sql	pandas
Data Structure	DataFrame, Column, Row	Panel, DataFrame, Series
I/O	RDDs, Spark data Sources	Text, Binary, SQL
Inspection	Supported	Supported
Operation	Less	More
Extensions	GeoSpark	GeoPandas

Supported spark data sources in pyspark.sql:

JSON
Parquet
CSV
TXT
AVRO

Supported spark data sources in pyspark.sql are listed here

Data inspection methods in pyspark.sql:

df.dtypes
df.show()
df.head()
df.first()
df.take(n)
df.schema
df.describe().show()
df.columns
df.count()
df.distinct.count()
df.explain()

Data inspection methods in pandas

df[‘column_name’].value_counts()
df.info
df.desribe()

Some useful functions in pandas

df.sort_index(axis=1, ascending=False)
df.sort_values(by=‘’)
df.groupby(‘’).sum()

Supported Operations on DataFrames in pyspark.sql:

Conversion
Indexing
Grouping/Aggregation
Selection/Subsetting
Filtering
Joining

Supported Operations on DataFrames in pandas:

Conversion
Indexing
Iteration
Grouping/Aggregation
Function application
Selection/Subsetting
Filtering
Reshaping
Combining/Joining
Plotting

Written on October 26, 2017