简体   繁体   English

CSV 与 python 多索引比较

[英]CSV comparison with python multipleindex

I need to compare two CSV files and print out changed, remained same or deleted rows in a third CSV file.我需要比较两个 CSV 文件并打印出第三个 CSV 文件中更改、保持不变或删除的行。 First csv file is like this:第一个 csv 文件是这样的:

location locationid sitename siteid country price
zoo         1         xxx      490     US     5
hosp        2         yyy      590     CA     7
rose        3         ccc      389     UK     5
lily        4         bbb      255     UK     3

Second csv file:第二个 csv 文件:

location locationid sitename siteid country price
zoo         1         xxx      490     US     4
hosp        2         yyy      590     CA     7
rose        3         ccc      389     ZW     2
zoo         1         sss      344     ME     3 
fol         9                          RU     11

at the end this is the result i want to get:最后这是我想要得到的结果:

location locationid sitename siteid country price status
zoo         1         xxx      490     US     4     changed
hosp        2         yyy      590     CA     7     same
rose        3         ccc      389     UK     5     new
lily        4         bbb      255     UK     3     deleted
zoo         9         sss      344     ME     3     new
fol         9                          RU     11    new

if a there is a new country added to a siteid then it has status of new.如果在 siteid 中添加了新国家/地区,则它的状态为新。 Location can have multiple siteids.位置可以有多个站点标识。 I want to catch if there is a new country added for a specific location and siteid not just one of them but for both of them as a condition.我想知道是否为特定位置添加了一个新的国家,并且 siteid 不仅是其中一个,而且作为条件,它们都适用。 In the dataset some siteids are NA thats why i added location here.在数据集中,一些站点 ID 是 NA,这就是我在此处添加位置的原因。 so in some cases from the location the file should understand the status.所以在某些情况下,文件应该从位置了解状态。

Here is my code but it is not working as i wanted.这是我的代码,但它没有按我的意愿工作。 If can you help me that will be really great:)如果你能帮助我,那就太好了:)

df1 = pd.read_csv(file1).fillna(0)
df2 = pd.read_csv(file2).fillna(0)
df1.set_index(['location','locationid','sitename','siteid','country'])
df2.set_index(['location','locationid','sitename','siteid','country'])
df3 = pd.concat([df1,df2],sort=False)
df3=df3.set_index(['location','locationid','sitename','siteid','country'])

df3.drop_duplicates()

df3a = df3.stack(dropna=False).groupby(level=[0,1]).unique().unstack().copy()

df3a.loc[~df3a.index.isin(df2.index),'status'] = 'deleted' # if not in df2 index then deleted
df3a.loc[~df3a.index.isin(df1.index),'status'] = 'new'     # if not in df1 index then new
idx = df3.stack().groupby(level=[0,1]).nunique() # get modified cells. 
df3a.loc[idx.mask(idx <= 1).dropna().index.get_level_values(0),'status'] = 'modified'
df3a['status'] = df3a['status'].fillna('same') # assume that anything not fulfilled by above rules is the same.

I'm not yet convinced this can be done exclusively with pandas operators.我还不相信这可以仅使用 pandas 运算符来完成。 You do have several problems in your code.您的代码中确实有几个问题。 xxx.set_index returns a new data frame -- it doesn't modify in place. xxx.set_index 返回一个新的数据框——它没有就地修改。 So, you need所以,你需要

df1 = df1.set_index(['location,'locationid','sitename','siteid','country'])
df2 = df2.set_index(['location,'locationid','sitename','siteid','country'])

Once you do that, you don't have to set_index on df3.一旦你这样做了,你就不必在 df3 上设置索引。 You really want to add the "status" value to df3, not df3a;您真的想将“状态”值添加到 df3,而不是 df3a; after the grouping, df3a doesn't look like what you need any more.分组后, df3a 看起来不再像您需要的那样。 I'm not sure the grouping is really the answer;我不确定分组是否真的是答案; I'm afraid you're going to have to iterate the rows that are in both and compare the "price" value to df1.恐怕您将不得不迭代两者中的行并将“价格”值与 df1 进行比较。 You can find out which rows with你可以找出哪些行

indf1 = df3.index.isin(df1.index)
indf2 = df3.index.isin(df2.index)
inboth = indf1 & indf2
df3.loc[~indf2,'status'] = 'deleted'
df3.loc[~indf1,'status'] = 'new'

but after that, I think you'll need to iterate.但在那之后,我认为你需要迭代。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM