PySpark 到 Pandas Dataframe 转换：转换时数据类型错误

Question

Lets assume I have a dataframe abc in spark as follows:假设我在 spark 中有一个 dataframe abc ，如下所示：

ID    Trxn_Date    Order_Date    Sales_Rep  Order_Category   Sales_Amount   Discount
100   2021-03-24   2021-03-17    Mathew     DailyStaples       1000           1.50
133   2021-01-22   2021-01-12    Camelia    Medicines          2000           0.50

Objective:客观的：

Pick up a column randomly for each data type and find its minimum and maximum value column wise.为每种数据类型随机选取一列，并逐列查找其最小值和最大值。

  - For `numerical` column it should also compute the sum or average.
  - For `string` column it should compute the maximum and minimum length

Create another dataframe with the following structure:使用以下结构创建另一个 dataframe：

Table_Name        Column_Name         Min              Max     Sum
  abc              Trxn_Date      2021-01-22   2021-03-24
  abc              Sales_Rep            6              7              <----str.len('Mathew') = 6 and that of 'Camelia' is 7
  abc              Sales_Amount       1000            2000     3000

I am using the following code but it is picking up all the columns .我正在使用以下代码，但它正在拾取所有列。 Also when I am running this in databrics / PySpark environment, I am getting error as given below.此外，当我在databrics / PySpark环境中运行它时，我收到如下错误。

table_lst = ['table_1','table_2']
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df_list = []
for i in table_lst:
  sdf_i = spark.sql("SELECT * FROM schema_name.{0}".format(i))
  df_i = sdf_i.select("*").toPandas()
df_list.append(df_i)
d = {}
for i,j in zip(table_name,dfs):
   d[i] = j
df_concat = []
for k,v in d.items():
   val_df = {}
   for i,j in zip(v.columns,v.dtypes):
     if 'object' in str(j):
        max_s = v[i].map(len).max()
        min_s = v[i].map(len).min()
        val_df[k+'-'+i+'_'+'Max_String_L']= max_s
        val_df[k+'-'+i+'_'+'Min_String_L']= min_s
     elif 'int' or 'float' in str(j):
        max_int = v[i].max()      <------Error line as indicated in Databricks
        min_int = v[i].min()
        val_df[k+'-'+i+'Max_Num'] = max_int
        val_df[k+'-'+i+'_'+'Min_Num'] = min_int
     elif 'datetime' in str(j):
        max_date = v[i].max()
        min_date = v[i].min()
        val_df[k+'-'+i+'_'+'Max_Date'] = max_date
        val_df[k+'-'+i+'_'+'Min_Date'] = min_date
     else:
        print('left anythg?')
  df_f_d = pd.DataFrame.from_dict(val_df,orient='index').reset_index()
  df_concat.append(df_f_d)

When I am running this code on databrics pyspark, I am getting the below error:当我在 databrics pyspark 上运行此代码时，出现以下错误：

 TypeError: '>=' not supported between instances of 'float' and 'str'

Besides, the above code is not throwing the resultant dataframe as indicated above.此外，如上所示，上述代码不会抛出结果 dataframe。

My suspicion is while converting the sparkDF to pandas , all data types are being convereted to string .我怀疑在将sparkDF转换为pandas时，所有数据类型都被转换为string 。

How to, then, tackle this issue?那么，如何解决这个问题呢？ Also can the above code be modified so that the objective is fulfilled?也可以修改上面的代码以实现目标吗？

Answer 1

My solution is a bit long (probably expected as per the requirement), you can debug it with each step as you want.我的解决方案有点长（可能根据要求预期），您可以根据需要在每个步骤中对其进行调试。 The overall idea is总体思路是

Distinguse which column is string, which column is numerical区分哪一列是字符串，哪一列是数字
using describe function to get min max of each column.使用describe function 来获得每列的最小最大值。
describe , however, doesn't calculate sum and average so we have to aggregate separately但是describe不计算 sum 和 average 所以我们必须单独聚合

# a.csv
# ID,Trxn_Date,Order_Date,Sales_Rep,Order_Category,Sales_Amount,Discount
# 100,2021-03-24,2021-03-17,Mathew,DailyStaples,1000,1.50
# 133,2021-01-22,2021-01-12,Camelia,Medicines,2000,0.50

from pyspark.sql import functions as F

df = spark.read.csv('a.csv', header=True, inferSchema=True
all_cols = [col[0] for col in df.dtypes]
date_cols = ['Trxn_Date', 'Order_Date'] # Spark doesn't infer DateType so I have to handle it manually. You can ignore if your original schema already has it.
str_cols = [col[0] for col in df.dtypes if col[1] == 'string' and col[0] not in date_cols]
num_cols = [col[0] for col in df.dtypes if col[1] in ['int', 'double']]

# replace actual string values with its length
for col in str_cols:
    df = df.withColumn(col, F.length(col))

# calculate min max and transpose dataframe
df1 = (df
    .describe()
    .where(F.col('summary').isin('min', 'max'))
    .withColumn('keys', F.array([F.lit(c) for c in all_cols]))
    .withColumn('values', F.array([F.col(c) for c in all_cols]))
    .withColumn('maps', F.map_from_arrays('keys', 'values'))
    .select('summary', F.explode('maps').alias('col', 'value'))
    .groupBy('col')
    .agg(
        F.collect_list('summary').alias('keys'),
        F.collect_list('value').alias('values')
    )
    .withColumn('maps', F.map_from_arrays('keys', 'values'))
    .select('col', 'maps.min', 'maps.max')
)
df1.show(10, False)
# +--------------+----------+----------+
# |col           |min       |max       |
# +--------------+----------+----------+
# |Sales_Amount  |1000      |2000      |
# |Sales_Rep     |6         |7         |
# |Order_Category|9         |12        |
# |ID            |100       |133       |
# |Discount      |0.5       |1.5       |
# |Trxn_Date     |2021-01-22|2021-03-24|
# |Order_Date    |2021-01-12|2021-03-17|
# +--------------+----------+----------+

# calculate sum and transpose dataframe
df2 = (df
    .groupBy(F.lit(1).alias('sum'))
    .agg(*[F.sum(c).alias(c) for c in num_cols])
    .withColumn('keys', F.array([F.lit(c) for c in num_cols]))
    .withColumn('values', F.array([F.col(c) for c in num_cols]))
    .withColumn('maps', F.map_from_arrays('keys', 'values'))
    .select(F.explode('maps').alias('col', 'sum'))
)
df2.show(10, False)
# +------------+------+
# |col         |sum   |
# +------------+------+
# |ID          |233.0 |
# |Sales_Amount|3000.0|
# |Discount    |2.0   |
# +------------+------+

# Join them together to get final dataframe
df1.join(df2, on=['col'], how='left').show()
# +--------------+----------+----------+------+
# |           col|       min|       max|   sum|
# +--------------+----------+----------+------+
# |  Sales_Amount|      1000|      2000|3000.0|
# |     Sales_Rep|         6|         7|  null|
# |Order_Category|         9|        12|  null|
# |            ID|       100|       133| 233.0|
# |      Discount|       0.5|       1.5|   2.0|
# |     Trxn_Date|2021-01-22|2021-03-24|  null|
# |    Order_Date|2021-01-12|2021-03-17|  null|
# +--------------+----------+----------+------+

PySpark 到 Pandas Dataframe 转换：转换时数据类型错误

问题描述

1 个解决方案

解决方案1
0 2021-05-24 05:11:38

PySpark 到 Pandas Dataframe 转换：转换时数据类型错误

问题描述

1 个解决方案

解决方案1 0 2021-05-24 05:11:38

解决方案1
0 2021-05-24 05:11:38