PySpark.sql v.s. Pandas

alt textalt text

Pyspark.sql and Pandas are both nice data analysis tools, providing fast and expressive data structures to work with relational or labeled data. Depending on the size of data for your analysis, you could choose either one as they have quite similar functionalities. These analysis will form the basis for applying machine learning on these data. Today let’s do a parallel comparison between Pyspark.sql and Pandas.

Features pyspark.sql pandas
Data Structure DataFrame, Column, Row Panel, DataFrame, Series
I/O RDDs, Spark data Sources Text, Binary, SQL
Inspection Supported Supported
Operation Less More
Extensions GeoSpark GeoPandas

Supported spark data sources in pyspark.sql:

Supported spark data sources in pyspark.sql are listed here

Data inspection methods in pyspark.sql:

  • df.dtypes
  • df.show()
  • df.head()
  • df.first()
  • df.take(n)
  • df.schema
  • df.describe().show()
  • df.columns
  • df.count()
  • df.distinct.count()
  • df.explain()

Data inspection methods in pandas

  • df[‘column_name’].value_counts()
  • df.info
  • df.desribe()

Some useful functions in pandas

  • df.sort_index(axis=1, ascending=False)
  • df.sort_values(by=‘’)
  • df.groupby(‘’).sum()

Supported Operations on DataFrames in pyspark.sql:

  • Conversion
  • Indexing
  • Grouping/Aggregation
  • Selection/Subsetting
  • Filtering
  • Joining

Supported Operations on DataFrames in pandas:

  • Conversion
  • Indexing
  • Iteration
  • Grouping/Aggregation
  • Function application
  • Selection/Subsetting
  • Filtering
  • Reshaping
  • Combining/Joining
  • Plotting
Written on October 26, 2017