![](/img/trans.png)
[英]How to merge list of dataframes all with same index and same column names?
[英]Merge dataframes in Pyspark with same column names
使用 pyspark 連接時,后綴有什么替代品嗎? 或使用spark.sql(query)
時
數據框具有相同的列,我想將它們各自的 dataFrame 名稱作為后綴。
下面的代碼是我在 python 中所做的。
df = pd.merge(left = df1, right = df2, on= 'col1', how= 'inner', suffixes= ('_df1', '__df2'))
df = pd.merge(left = df, right = df3, on= 'vin_17', how= 'inner', suffixes= ('','__df3'))
df = pd.merge(left = df, right = df4, on= 'vin_17', how= 'inner', suffixes= ('','__df4'))
這就是我在 pyspark 中所做的,但是所有列名都在改變,我希望重復的列只有__suffix
。
df1 = df1.select(*(col(x).alias(x + '__df1') for x in df1.columns))
df2 = df2.select(*(col(x).alias(x + '__df2') for x in df2.columns))
df3 = df3.select(*(col(x).alias(x + '__df3') for x in df3.columns))
重命名時,您只能過濾 3 個數據框之間的公共列:
def get_common_cols(dfs: list):
seen = set()
repeated = set()
for l in [df.columns for df in dfs]:
for i in set(l):
if i in seen:
repeated.add(i)
else:
seen.add(i)
return list(repeated)
common = get_common_cols([df1, df2, df3])
# rename only if x exists in common_cols
df1 = df1.select(*[col(x).alias(f"{x}__df1" if x in common else x) for x in df1.columns])
df2 = df2.select(*[col(x).alias(f"{x}__df2" if x in common else x) for x in df2.columns])
df3 = df3.select(*[col(x).alias(f"{x}__df3" if x in common else x) for x in df3.columns])
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.