![](/img/trans.png)
[英]Python pandas - If 2 column equals to each other, add another column and say YES and NO, add another row if value not present
[英]Python pandas to ensure each row based on column value has a set of data present, if not add row
我正在组织 AWS 资源以进行标记,并将数据捕获到 CSV 文件中。 CSV 文件的示例 output 如下。 我试图确保对于每个 resource_id,都有一个我需要确保存在的 tag_key 数据集。 这个数据集是
标记键
Application
Client
Environment
Name
Owner
Project
Purpose
我是 pandas 的新手,我只设法将 CSV 文件读取为 dataframe
import pandas as pd
file_name = "z.csv"
df = pd.read_csv(file_name, names=['resource_id', 'resource_type', 'tag_key', 'tag_value'])
print (df)
CSV 文件
vol-00441b671ca48ba41,volume,Environment,Development
vol-00441b671ca48ba41,volume,Name,Database Files
vol-00441b671ca48ba41,volume,Project,Application Development
vol-00441b671ca48ba41,volume,Purpose,Web Server
i-1234567890abcdef0,instance,Environment,Production
i-1234567890abcdef0,instance,Owner,Fast Company
我期待 output 如下
vol-00441b671ca48ba41,volume,Environment,Development
vol-00441b671ca48ba41,volume,Name,Database Files
vol-00441b671ca48ba41,volume,Project,Application Development
vol-00441b671ca48ba41,volume,Purpose,Web Server
vol-00441b671ca48ba41,volume,Client,
vol-00441b671ca48ba41,volume,Owner,
vol-00441b671ca48ba41,volume,Application,
i-1234567890abcdef0,instance,Environment,Production
i-1234567890abcdef0,instance,Owner,Fast Company
i-1234567890abcdef0,instance,Application,
i-1234567890abcdef0,instance,Client,
i-1234567890abcdef0,instance,Name,
i-1234567890abcdef0,instance,Project,
i-1234567890abcdef0,instance,Purpose,
做一个稍微简单的例子。 我有 dataframe df:
df = pd.DataFrame(data={'a': [1, 1, 2, 2], 'b': [[1, 2], [3, 5], [1, 2], [5]]})
返回
a b
0 1 [1, 2]
1 1 [3, 5]
2 2 [1, 2]
3 2 [5]
具有所需的 b:1、2、3、4 和 5。
然后我们需要找出我们“已经拥有”的东西。 我们这样做:
def flatten(lsts):
return [j for i in lsts for j in i]
df_new = df.groupby(by=['a'])['b'].apply(flatten)
回报:
a
1 [1, 2, 3, 5]
2 [1, 2, 5]
现在我们需要列出我们缺少的列并添加它们:
df_new = df_new.reset_index()
lst_wanted = [1, 2, 3, 4, 5]
for row in df_new.itertuples():
for j in lst_wanted:
if j not in row.b:
df = df.append({'a': row.a, 'b': j}, ignore_index=True)
print(df)
返回:
a b
0 1 [1, 2]
1 1 [3, 5]
2 2 [1, 2]
3 2 [5]
4 1 4
5 2 3
6 2 4
一种方法是使用多索引、 from_product
和renindex
:
taglist = ['Application',
'Client',
'Environment',
'Name',
'Owner',
'Project',
'Purpose']
df_out = df.set_index(['resource_id','tag_key'])\
.reindex(pd.MultiIndex.from_product([df['resource_id'].unique(), taglist],
names=['resource_id','tag_key']))
df_out.assign(resource_type = df_out.groupby('resource_id')['resource_type']\
.ffill().bfill()).reset_index()
Output:
resource_id tag_key resource_type tag_value
0 vol-00441b671ca48ba41 Application volume NaN
1 vol-00441b671ca48ba41 Client volume NaN
2 vol-00441b671ca48ba41 Environment volume Development
3 vol-00441b671ca48ba41 Name volume Database Files
4 vol-00441b671ca48ba41 Owner volume NaN
5 vol-00441b671ca48ba41 Project volume Application Development
6 vol-00441b671ca48ba41 Purpose volume Web Server
7 i-1234567890abcdef0 Application instance NaN
8 i-1234567890abcdef0 Client instance NaN
9 i-1234567890abcdef0 Environment instance Production
10 i-1234567890abcdef0 Name instance NaN
11 i-1234567890abcdef0 Owner instance Fast Company
12 i-1234567890abcdef0 Project instance NaN
13 i-1234567890abcdef0 Purpose instance NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.