[英]how to add rows to a data frame that are in another data frame by a column in pyspark
[英]How to add column to a data frame?
我有以下代码:
db_fields = ("id", "email", "status", "source")
df = DataFrame(results)
for col in db_fields:
if col not in df.columns:
COLUMN IS MISSING - COMMAND TO ADD COLUMN
例如,如果缺少status
列,则需要将其添加到数据框中,而无需添加任何值。因此,当我将df
导出到csv
我将始终具有相同的字段架构。
我知道要删除列,我应该这样做:
df = df.drop(col, 1)
但是我不知道添加具有空值的列的最佳方法是什么。
此方法将在状态列中添加Null值:
import numpy as np
df['status'] = np.nan
或者:
df['status'] = None
所以:
db_fields = ("id", "email", "status", "source")
for col in db_fields:
if col not in df.columns:
df[col] = None
您可以创建不存在的列的数组,并使用assign
和dictionary创建新的列:
df = pd.DataFrame({'id': ['a1','a2', 'b1'],
'a': ['a1','a2', 'b1'],
'source': ['a1','a2', 'b1']})
print (df)
id a source
0 a1 a1 a1
1 a2 a2 a2
2 b1 b1 b1
db_fields = ("id", "email", "status", "source")
#get missing columns
diff = np.setdiff1d(np.array(db_fields), df.columns)
print (diff)
['email' 'status']
#get original columns not existed in db_fields
diff1 = np.setdiff1d(df.columns, np.array(db_fields)).tolist()
print (diff1)
['a']
#add missing columns with change order
d = dict.fromkeys(diff, np.nan)
df = df.assign(**d)[diff1 + list(db_fields)]
print (df)
a id email status source
0 a1 a1 NaN NaN a1
1 a2 a2 NaN NaN a2
2 b1 b1 NaN NaN b1
#if necessary first db_fields
df = df.assign(**d)[list(db_fields) + diff1]
print (df)
id email status source a
0 a1 NaN NaN a1 a1
1 a2 NaN NaN a2 a2
2 b1 NaN NaN b1 b1
在这里,只需一行就可以简单明了地看到它:
import numpy as np
db_fields = ("id", "email", "status", "source")
df = DataFrame(results)
for col in db_fields:
if col not in df.columns:
# Add the column
df[col] = np.nan
顺便说一句:您也可以使用df.drop(inplace=True)
删除列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.