[英]PySpark to Pandas Dataframe conversion :Error in data types while converting
Lets assume I have a dataframe abc
in spark as follows:假设我在 spark 中有一个 dataframe abc
,如下所示:
ID Trxn_Date Order_Date Sales_Rep Order_Category Sales_Amount Discount
100 2021-03-24 2021-03-17 Mathew DailyStaples 1000 1.50
133 2021-01-22 2021-01-12 Camelia Medicines 2000 0.50
Objective:客观的:
Pick up a column randomly for each data type and find its minimum and maximum value column wise.为每种数据类型随机选取一列,并逐列查找其最小值和最大值。
- For `numerical` column it should also compute the sum or average.
- For `string` column it should compute the maximum and minimum length
Create another dataframe with the following structure:使用以下结构创建另一个 dataframe:
Table_Name Column_Name Min Max Sum
abc Trxn_Date 2021-01-22 2021-03-24
abc Sales_Rep 6 7 <----str.len('Mathew') = 6 and that of 'Camelia' is 7
abc Sales_Amount 1000 2000 3000
I am using the following code but it is picking up all the columns .我正在使用以下代码,但它正在拾取所有列。 Also when I am running this in databrics / PySpark environment, I am getting error as given below.此外,当我在databrics / PySpark环境中运行它时,我收到如下错误。
table_lst = ['table_1','table_2']
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df_list = []
for i in table_lst:
sdf_i = spark.sql("SELECT * FROM schema_name.{0}".format(i))
df_i = sdf_i.select("*").toPandas()
df_list.append(df_i)
d = {}
for i,j in zip(table_name,dfs):
d[i] = j
df_concat = []
for k,v in d.items():
val_df = {}
for i,j in zip(v.columns,v.dtypes):
if 'object' in str(j):
max_s = v[i].map(len).max()
min_s = v[i].map(len).min()
val_df[k+'-'+i+'_'+'Max_String_L']= max_s
val_df[k+'-'+i+'_'+'Min_String_L']= min_s
elif 'int' or 'float' in str(j):
max_int = v[i].max() <------Error line as indicated in Databricks
min_int = v[i].min()
val_df[k+'-'+i+'Max_Num'] = max_int
val_df[k+'-'+i+'_'+'Min_Num'] = min_int
elif 'datetime' in str(j):
max_date = v[i].max()
min_date = v[i].min()
val_df[k+'-'+i+'_'+'Max_Date'] = max_date
val_df[k+'-'+i+'_'+'Min_Date'] = min_date
else:
print('left anythg?')
df_f_d = pd.DataFrame.from_dict(val_df,orient='index').reset_index()
df_concat.append(df_f_d)
When I am running this code on databrics pyspark, I am getting the below error:当我在 databrics pyspark 上运行此代码时,出现以下错误:
TypeError: '>=' not supported between instances of 'float' and 'str'
Besides, the above code is not throwing the resultant dataframe as indicated above.此外,如上所示,上述代码不会抛出结果 dataframe。
My suspicion is while converting the sparkDF
to pandas
, all data types are being convereted to string
.我怀疑在将sparkDF
转换为pandas
时,所有数据类型都被转换为string
。
How to, then, tackle this issue?那么,如何解决这个问题呢? Also can the above code be modified so that the objective is fulfilled?也可以修改上面的代码以实现目标吗?
My solution is a bit long (probably expected as per the requirement), you can debug it with each step as you want.我的解决方案有点长(可能根据要求预期),您可以根据需要在每个步骤中对其进行调试。 The overall idea is总体思路是
describe
function to get min max of each column.使用describe
function 来获得每列的最小最大值。describe
, however, doesn't calculate sum and average so we have to aggregate separately但是describe
不计算 sum 和 average 所以我们必须单独聚合# a.csv
# ID,Trxn_Date,Order_Date,Sales_Rep,Order_Category,Sales_Amount,Discount
# 100,2021-03-24,2021-03-17,Mathew,DailyStaples,1000,1.50
# 133,2021-01-22,2021-01-12,Camelia,Medicines,2000,0.50
from pyspark.sql import functions as F
df = spark.read.csv('a.csv', header=True, inferSchema=True
all_cols = [col[0] for col in df.dtypes]
date_cols = ['Trxn_Date', 'Order_Date'] # Spark doesn't infer DateType so I have to handle it manually. You can ignore if your original schema already has it.
str_cols = [col[0] for col in df.dtypes if col[1] == 'string' and col[0] not in date_cols]
num_cols = [col[0] for col in df.dtypes if col[1] in ['int', 'double']]
# replace actual string values with its length
for col in str_cols:
df = df.withColumn(col, F.length(col))
# calculate min max and transpose dataframe
df1 = (df
.describe()
.where(F.col('summary').isin('min', 'max'))
.withColumn('keys', F.array([F.lit(c) for c in all_cols]))
.withColumn('values', F.array([F.col(c) for c in all_cols]))
.withColumn('maps', F.map_from_arrays('keys', 'values'))
.select('summary', F.explode('maps').alias('col', 'value'))
.groupBy('col')
.agg(
F.collect_list('summary').alias('keys'),
F.collect_list('value').alias('values')
)
.withColumn('maps', F.map_from_arrays('keys', 'values'))
.select('col', 'maps.min', 'maps.max')
)
df1.show(10, False)
# +--------------+----------+----------+
# |col |min |max |
# +--------------+----------+----------+
# |Sales_Amount |1000 |2000 |
# |Sales_Rep |6 |7 |
# |Order_Category|9 |12 |
# |ID |100 |133 |
# |Discount |0.5 |1.5 |
# |Trxn_Date |2021-01-22|2021-03-24|
# |Order_Date |2021-01-12|2021-03-17|
# +--------------+----------+----------+
# calculate sum and transpose dataframe
df2 = (df
.groupBy(F.lit(1).alias('sum'))
.agg(*[F.sum(c).alias(c) for c in num_cols])
.withColumn('keys', F.array([F.lit(c) for c in num_cols]))
.withColumn('values', F.array([F.col(c) for c in num_cols]))
.withColumn('maps', F.map_from_arrays('keys', 'values'))
.select(F.explode('maps').alias('col', 'sum'))
)
df2.show(10, False)
# +------------+------+
# |col |sum |
# +------------+------+
# |ID |233.0 |
# |Sales_Amount|3000.0|
# |Discount |2.0 |
# +------------+------+
# Join them together to get final dataframe
df1.join(df2, on=['col'], how='left').show()
# +--------------+----------+----------+------+
# | col| min| max| sum|
# +--------------+----------+----------+------+
# | Sales_Amount| 1000| 2000|3000.0|
# | Sales_Rep| 6| 7| null|
# |Order_Category| 9| 12| null|
# | ID| 100| 133| 233.0|
# | Discount| 0.5| 1.5| 2.0|
# | Trxn_Date|2021-01-22|2021-03-24| null|
# | Order_Date|2021-01-12|2021-03-17| null|
# +--------------+----------+----------+------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.