PySpark.sql v.s. Pandas
Pyspark.sql and Pandas are both nice data analysis tools, providing fast and expressive data structures to work with relational or labeled data. Depending on the size of data for your analysis, you could choose either one as they have quite similar functionalities. These analysis will form the basis for applying machine learning on these data. Today let’s do a parallel comparison between Pyspark.sql and Pandas.
Features | pyspark.sql | pandas |
---|---|---|
Data Structure | DataFrame, Column, Row | Panel, DataFrame, Series |
I/O | RDDs, Spark data Sources | Text, Binary, SQL |
Inspection | Supported | Supported |
Operation | Less | More |
Extensions | GeoSpark | GeoPandas |
Supported spark data sources in pyspark.sql:
Supported spark data sources in pyspark.sql are listed here
Data inspection methods in pyspark.sql:
- df.dtypes
- df.show()
- df.head()
- df.first()
- df.take(n)
- df.schema
- df.describe().show()
- df.columns
- df.count()
- df.distinct.count()
- df.explain()
Data inspection methods in pandas
- df[‘column_name’].value_counts()
- df.info
- df.desribe()
Some useful functions in pandas
- df.sort_index(axis=1, ascending=False)
- df.sort_values(by=‘
’) - df.groupby(‘
’).sum()
Supported Operations on DataFrames in pyspark.sql:
- Conversion
- Indexing
- Grouping/Aggregation
- Selection/Subsetting
- Filtering
- Joining
Supported Operations on DataFrames in pandas:
- Conversion
- Indexing
- Iteration
- Grouping/Aggregation
- Function application
- Selection/Subsetting
- Filtering
- Reshaping
- Combining/Joining
- Plotting