简体   繁体   English

一定条件下合并Pandas Dataframe

[英]Merge Pandas Dataframe under certain conditions

I have two sets of dataframe, one is the "gold" one which means that I need to keep all the rows for the gold one after merging.我有两套 dataframe,一套是“黄金”,这意味着我需要在合并后保留黄金的所有行。 The other one is reference one.另一个是参考。 Below is a sneak peek of that two dataframe.下面是那两个 dataframe 的先睹为快。

gold
    doc_name                              mention                                 id         
0    doc_1                                  US                             United States         
0    doc_1                                Georgia                               Atl  
0    doc_1                                 Bama                                Selma  
0    doc_1                                Europe                                UK
0    doc_2                                 HSBC                               HK Bank Central  
0    doc_2                                  NC                                Charlotte  
       :                                    :                                    :
       :                                    :                                    :
0    doc_n                                  CA                                San Jose  
reference
    doc_name                               text                                         
0    doc_1                                The US                                      
0    doc_1                          Georgia's Fried Chicken                                
0    doc_1                              Bama Football                                 
0    doc_1                                 HSBC                                
0    doc_1                             Bank of America                               
0    doc_1                               NC Panthers
0    doc_1                               MI Packers
0    doc_1                               NC Panthers                                
       :                                    :                                    
       :                                    :                                    
0    doc_n                               CA's apt                                  

I tried to merge those 2 dataframe by using outer join df = pd.merge(gold, reference, right_on = ['doc_name'], left_on =['doc_name'], how = 'outer' then use contains string in "mention" column to filter out rows under "text" columns but if I do that I will lose rows from the gold dataframe, which I do not want.我试图通过使用外部连接合并那些 2 dataframe df = pd.merge(gold, reference, right_on = ['doc_name'], left_on =['doc_name'], how = 'outer'然后在“提及”中使用包含字符串列过滤掉“文本”列下的行,但如果我这样做,我将丢失黄金 dataframe 中的行,这是我不想要的。

The output that I would like to have is like this below我想要的 output 如下所示

    doc_name                 mention                id                text    
0    doc_1                     US               United States        The US        
0    doc_1                   Georgia                Atl          Georgia's Fried Chicken
0    doc_1                     Bama                Selma          Bama Football
0    doc_1                    Europe                UK                 Nan
0    doc_2                    HSBC             HK Bank Central         HSBC
0    doc_2                    NC                 Charlotte           NC Panthers
       :                       :                     :                   :
       :                       :                     :                   :
0    doc_n                    CA                  San Jose           CA's apt

I basically want to keep all the gold dataframe rows, but also want to have the "text" column from the reference dataframe that contains strings from gold's "mention" column.我基本上想保留所有 gold dataframe 行,但也希望参考 dataframe 中的“text”列包含 gold“mention”列中的字符串。 I have been trying to do that but still couldn't find a good way to do so.我一直在尝试这样做,但仍然找不到这样做的好方法。 That will be great if you all have some ideas or suggestions.如果你们都有一些想法或建议,那就太好了。 Thank you so much!太感谢了!

gold raw csv:黄金原料 csv:

doc_name,mention,id
chtb_165.en,Xinhua News Agency,Xinhua News Agency
chtb_165.en,Shanghai,Shanghai
chtb_165.en,HSBC,HSBC
chtb_165.en,China Shipping Mansion,International Ocean Shipping Building
chtb_165.en,Pudong Lujiazui financial trading district,Lujaizui
chtb_165.en,Pudong,Pudong
chtb_165.en,US,United States
chtb_165.en,Citibank,Citibank
chtb_165.en,Hong Kong,Hong Kong
chtb_165.en,Japan,Japan
chtb_165.en,Tokyo Mitsubishi Bank,The Bank of Tokyo-Mitsubishi UFJ
VOA20001129.2000.036,Washington,"Washington, D.C."
VOA20001129.2000.036,Supreme Court,Supreme Court of the United States
VOA20001129.2000.036,Joe O'Grossman,Joel Grossman
VOA20001129.2000.036,Baltimore,Baltimore
VOA20001129.2000.036,Johns Hopkins University,Johns Hopkins University
VOA20001129.2000.036,Lawrence Tribe,Laurence Tribe
VOA20001129.2000.036,Gore,Al Gore
VOA20001129.2000.036,legislature,Florida Legislature
VOA20001129.2000.036,Congress,United States Congress

reference raw csv:参考原始 csv:

doc_name,text
VOA20001129.2000.036,the Bush
VOA20001129.2000.036,American election
VOA20001129.2000.036,Congress
VOA20001129.2000.036,George W Bush
chtb_165.en,Xinhua News Agency
chtb_165.en,Shanghai
chtb_165.en,HSBC
chtb_165.en,China Shipping
chtb_165.en,Mansion
chtb_165.en,RMB
chtb_165.en,the US
chtb_165.en,"Citibank , Hong Kong"
chtb_165.en,Japan
chtb_165.en,Tokyo Mitsubishi Bank
chtb_165.en,Industrial Bank
chtb_165.en,Branch
chtb_165.en,Chartered Bank
chtb_165.en,BNP
chtb_165.en,Paris
chtb_165.en,Bank
chtb_165.en,Dai-Ichi Kangyo Bank
chtb_165.en,Sanwa Bank
chtb_165.en,Financial Trading
chtb_165.en,District
chtb_165.en,Franklin Templeton
chtb_165.en,Company
chtb_165.en,California
chtb_165.en,US dollars
chtb_165.en,China
chtb_165.en,Asian
chtb_165.en,Securities
chtb_165.en,Building
chtb_165.en,Hong Kong
chtb_165.en,Japan Industrial Bank
chtb_165.en,Holland
chtb_165.en,Belgium
chtb_165.en,Credit Bank
chtb_165.en,Waitan

I have the answer you want here.我这里有你想要的答案。 It generates an "output.csv" which you can read with pandas as a dataframe to give you the expected result.它会生成一个“output.csv”,您可以使用 pandas 将其读取为 dataframe,从而获得预期的结果。

Here is my "output.csv".这是我的“output.csv”。 The results look odd because your sample input (reference.csv and gold.csv) were a small subset.结果看起来很奇怪,因为您的样本输入(reference.csv 和 gold.csv)是一小部分。 If you test on your full large input CSVs, you will get a proper output:如果您在完整的大型输入 CSV 上进行测试,您将获得正确的 output:

doc_name,mention,id,text
VOA20001129.2000.036,Washington,Washington D.C.,
VOA20001129.2000.036,Supreme Court,Supreme Court of the United States,
VOA20001129.2000.036,Joe O'Grossman,Joel Grossman,
VOA20001129.2000.036,Baltimore,Baltimore,
VOA20001129.2000.036,Johns Hopkins University,Johns Hopkins University,
VOA20001129.2000.036,Lawrence Tribe,Laurence Tribe,
VOA20001129.2000.036,Gore,Al Gore,
VOA20001129.2000.036,legislature,Florida Legislature,
VOA20001129.2000.036,Congress,United States Congress,Congress
chtb_165.en,Xinhua News Agency,Xinhua News Agency,Xinhua News Agency
chtb_165.en,Shanghai,Shanghai,Shanghai
chtb_165.en,HSBC,HSBC,HSBC
chtb_165.en,China Shipping Mansion,International Ocean Shipping Building,
chtb_165.en,Pudong Lujiazui financial trading district,Lujaizui,
chtb_165.en,Pudong,Pudong,
chtb_165.en,US,United States,the US
chtb_165.en,Citibank,Citibank,Citibank  Hong Kong
chtb_165.en,Hong Kong,Hong Kong,Citibank  Hong Kong
chtb_165.en,Japan,Japan,Japan
chtb_165.en,Tokyo Mitsubishi Bank,The Bank of Tokyo-Mitsubishi UFJ,Tokyo Mitsubishi Bank

And finally, here is the code:最后,这是代码:

from collections import OrderedDict
import inspect

"""
Note: Only works on Python 3.6+
"""

class GoldClass:
    def __init__(self):
        self.mention = []
        self.id = []

def retrieve_name(var):
    callers_local_vars = inspect.currentframe().f_back.f_locals.items()
    return [var_name for var_name, var_val in callers_local_vars if var_val is var][0]

def get_nth_key(dictionary, n):
    if n < 0:
        n += len(dictionary)
    for i, key in enumerate(dictionary.keys()):
        if i == n:
            return key
    raise IndexError("dictionary index out of range")

with open("reference.csv") as reference_file:
    reference_list = reference_file.readlines()

with open("gold.csv") as gold_file:
    gold_list = gold_file.readlines()

reference_dict = OrderedDict()
for x in range(len(reference_list)):
    if x == 0:
        continue
    reference_list[x] = reference_list[x].strip()
    if reference_list[x].count(',') > 1:
        temp1 = reference_list[x].split(",")[0]
        temp2 = reference_list[x][len(temp1)+1:]
        temp2 = temp2.replace(",","").replace('"',"")
        reference_list[x] = temp1+","+temp2
    try:
        reference_dict[reference_list[x].split(",")[0]]
    except:
        reference_dict[reference_list[x].split(",")[0]] = []
    reference_dict[reference_list[x].split(",")[0]].append(reference_list[x].split(",")[1])

for x in range(len(gold_list)):
    if x == 0:
        continue
    gold_list[x] = gold_list[x].strip()
    if gold_list[x].count(',') > 2:
        temp1 = gold_list[x].split(",")[0]
        temp2 = gold_list[x].split(",")[1]
        temp3 = gold_list[x][len(temp1)+len(temp2)+2:]
        temp3 = temp3.replace(",","").replace('"',"")
        gold_list[x] = temp1+","+temp2+","+temp3
    temp_doc_name = gold_list[x].split(",")[0]
    temp_mention = gold_list[x].split(",")[1]
    temp_id = gold_list[x].split(",")[2]
    temp_index = list(reference_dict.keys()).index(temp_doc_name)
    try:
        exec("goldclass_"+str(temp_index))
    except:
        exec("goldclass_"+str(temp_index)+" = GoldClass()")
    exec("goldclass_"+str(temp_index)+".mention.append(temp_mention)")
    exec("goldclass_"+str(temp_index)+".id.append(temp_id)")

goldclass_objectlist = []
goldclass_iterator = 0
while True:
    try:
        exec("goldclass_objectlist.append(goldclass_"+str(goldclass_iterator)+")")
        goldclass_iterator = goldclass_iterator + 1
    except:
        break


final_lines = []
final_lines.append("doc_name,mention,id,text")
for temp4 in goldclass_objectlist:
    final_doc_name = get_nth_key(reference_dict,int(retrieve_name(temp4).split("_")[1]))
    for x in range(len(temp4.id)):
        final_mention = temp4.mention[x]
        final_id = temp4.id[x]
        final_text = ""
        for y in reference_dict[final_doc_name]:
            if final_mention in y:
                final_text = y
                break
        final_lines.append(final_doc_name+","+final_mention+","+final_id+","+final_text)

f = open("output.csv", "w")
for x in final_lines:
    f.write(x+"\n")
f.close()

How do you want to handle it when there are multiple texts in the reference with the same mentions from the gold?当参考文献中有多个文本与 gold 中的相同提及时,你想如何处理? These would create repeated rows.这些会创建重复的行。

在此处输入图像描述

Given:鉴于:

gold.csv黄金.csv

doc_name,mention,id
doc_1,US,United States         
doc_1,Georgia,Atl  
doc_1,Bama,Selma  
doc_1,Europe,UK
doc_2,HSBC,HK Bank Central  
doc_2,NC,Charlotte  
chtb_165.en,Xinhua News Agency,Xinhua News Agency
chtb_165.en,Shanghai,Shanghai
chtb_165.en,HSBC,HSBC
chtb_165.en,China Shipping Mansion,International Ocean Shipping Building
chtb_165.en,Pudong Lujiazui financial trading district,Lujaizui
chtb_165.en,Pudong,Pudong
chtb_165.en,US,United States
chtb_165.en,Citibank,Citibank
chtb_165.en,Hong Kong,Hong Kong
chtb_165.en,Japan,Japan
chtb_165.en,Tokyo Mitsubishi Bank,The Bank of Tokyo-Mitsubishi UFJ
VOA20001129.2000.036,Washington,"Washington, D.C."
VOA20001129.2000.036,Supreme Court,Supreme Court of the United States
VOA20001129.2000.036,Joe O'Grossman,Joel Grossman
VOA20001129.2000.036,Baltimore,Baltimore
VOA20001129.2000.036,Johns Hopkins University,Johns Hopkins University
VOA20001129.2000.036,Lawrence Tribe,Laurence Tribe
VOA20001129.2000.036,Gore,Al Gore
VOA20001129.2000.036,legislature,Florida Legislature
VOA20001129.2000.036,Congress,United States Congress

reference.csv参考.csv

doc_name,text
doc_1,The US                                      
doc_1,Georgia's Fried Chicken                                
doc_1,Bama Football                                 
doc_1,HSBC                                
doc_1,Bank of America                               
doc_1,NC Panthers
doc_1,MI Packers
doc_1,NC Panthers
VOA20001129.2000.036,the Bush
VOA20001129.2000.036,American election
VOA20001129.2000.036,Congress
VOA20001129.2000.036,George W Bush
chtb_165.en,Xinhua News Agency
chtb_165.en,Shanghai
chtb_165.en,HSBC
chtb_165.en,China Shipping
chtb_165.en,Mansion
chtb_165.en,RMB
chtb_165.en,the US
chtb_165.en,"Citibank , Hong Kong"
chtb_165.en,Japan
chtb_165.en,Tokyo Mitsubishi Bank
chtb_165.en,Industrial Bank
chtb_165.en,Branch
chtb_165.en,Chartered Bank
chtb_165.en,BNP
chtb_165.en,Paris
chtb_165.en,Bank
chtb_165.en,Dai-Ichi Kangyo Bank
chtb_165.en,Sanwa Bank
chtb_165.en,Financial Trading
chtb_165.en,District
chtb_165.en,Franklin Templeton
chtb_165.en,Company
chtb_165.en,California
chtb_165.en,US dollars
chtb_165.en,China
chtb_165.en,Asian
chtb_165.en,Securities
chtb_165.en,Building
chtb_165.en,Hong Kong
chtb_165.en,Japan Industrial Bank
chtb_165.en,Holland
chtb_165.en,Belgium
chtb_165.en,Credit Bank
chtb_165.en,Waitan

Create a column that looks for those mentions in the text with or |创建一个列,使用或|在文本中查找那些提及的内容operator.操作员。 Then can merge once it matches up the text with what is mentioned.一旦将文本与提到的内容匹配,就可以合并。

import pandas as pd

gold = pd.read_csv('C:/test/gold.csv')
reference = pd.read_csv('C:/test/reference.csv')

pat = '|'.join(r"{}".format(x) for x in gold.mention)
reference['mention_test'] = reference.text.str.extract('('+ pat + ')', expand=False)
df = pd.merge(gold, reference, how='left', left_on= ['doc_name','mention'], right_on=['doc_name','mention_test']).drop('mention_test', axis=1)

df.to_csv('output.csv', index=False)

Output: Output:

print(df.to_string())
                doc_name                                     mention                                     id                                                     text
0                  doc_1                                          US                 United States                      The US                                      
1                  doc_1                                     Georgia                                  Atl    Georgia's Fried Chicken                                
2                  doc_1                                        Bama                                Selma             Bama Football                                 
3                  doc_1                                      Europe                                     UK                                                      NaN
4                  doc_2                                        HSBC                      HK Bank Central                                                        NaN
5                  doc_2                                          NC                            Charlotte                                                        NaN
6            chtb_165.en                          Xinhua News Agency                     Xinhua News Agency                                       Xinhua News Agency
7            chtb_165.en                                    Shanghai                               Shanghai                                                 Shanghai
8            chtb_165.en                                        HSBC                                   HSBC                                                     HSBC
9            chtb_165.en                      China Shipping Mansion  International Ocean Shipping Building                                                      NaN
10           chtb_165.en  Pudong Lujiazui financial trading district                               Lujaizui                                                      NaN
11           chtb_165.en                                      Pudong                                 Pudong                                                      NaN
12           chtb_165.en                                          US                          United States                                                   the US
13           chtb_165.en                                          US                          United States                                               US dollars
14           chtb_165.en                                    Citibank                               Citibank                                     Citibank , Hong Kong
15           chtb_165.en                                   Hong Kong                              Hong Kong                                                Hong Kong
16           chtb_165.en                                       Japan                                  Japan                                                    Japan
17           chtb_165.en                                       Japan                                  Japan                                    Japan Industrial Bank
18           chtb_165.en                       Tokyo Mitsubishi Bank       The Bank of Tokyo-Mitsubishi UFJ                                    Tokyo Mitsubishi Bank
19  VOA20001129.2000.036                                  Washington                       Washington, D.C.                                                      NaN
20  VOA20001129.2000.036                               Supreme Court     Supreme Court of the United States                                                      NaN
21  VOA20001129.2000.036                              Joe O'Grossman                          Joel Grossman                                                      NaN
22  VOA20001129.2000.036                                   Baltimore                              Baltimore                                                      NaN
23  VOA20001129.2000.036                    Johns Hopkins University               Johns Hopkins University                                                      NaN
24  VOA20001129.2000.036                              Lawrence Tribe                         Laurence Tribe                                                      NaN
25  VOA20001129.2000.036                                        Gore                                Al Gore                                                      NaN
26  VOA20001129.2000.036                                 legislature                    Florida Legislature                                                      NaN
27  VOA20001129.2000.036                                    Congress                 United States Congress                                                 Congress

ADDITIONAL:额外的:

To combine those additional rows into 1 rows (keeping the same number of rows that gold.csv starts with:将这些额外的行组合成 1 行(保持 gold.csv 开头的相同行数:

import pandas as pd

pat = '|'.join(r"{}".format(x) for x in gold.mention)
reference['mention_test'] = reference.text.str.extract('('+ pat + ')', expand=False)
df = pd.merge(gold, reference, how='left', left_on= ['doc_name','mention'], right_on=['doc_name','mention_test']).drop('mention_test', axis=1)

duplicates = df[df.duplicated(subset=['doc_name','mention','id'], keep=False)]
aux = duplicates.groupby(['doc_name','mention','id'])['text'].apply('; '.join).reset_index()

df = df.drop(duplicates.index)
df = df.append(aux).reset_index(drop=True)

df.to_csv('output.csv', index=False)

Output: Output:

print(df.to_string())
                doc_name                                     mention                                     id                                                     text
0                  doc_1                                          US                 United States                      The US                                      
1                  doc_1                                     Georgia                                  Atl    Georgia's Fried Chicken                                
2                  doc_1                                        Bama                                Selma             Bama Football                                 
3                  doc_1                                      Europe                                     UK                                                      NaN
4                  doc_2                                        HSBC                      HK Bank Central                                                        NaN
5                  doc_2                                          NC                            Charlotte                                                        NaN
6            chtb_165.en                          Xinhua News Agency                     Xinhua News Agency                                       Xinhua News Agency
7            chtb_165.en                                    Shanghai                               Shanghai                                                 Shanghai
8            chtb_165.en                                        HSBC                                   HSBC                                                     HSBC
9            chtb_165.en                      China Shipping Mansion  International Ocean Shipping Building                                                      NaN
10           chtb_165.en  Pudong Lujiazui financial trading district                               Lujaizui                                                      NaN
11           chtb_165.en                                      Pudong                                 Pudong                                                      NaN
12           chtb_165.en                                    Citibank                               Citibank                                     Citibank , Hong Kong
13           chtb_165.en                                   Hong Kong                              Hong Kong                                                Hong Kong
14           chtb_165.en                       Tokyo Mitsubishi Bank       The Bank of Tokyo-Mitsubishi UFJ                                    Tokyo Mitsubishi Bank
15  VOA20001129.2000.036                                  Washington                       Washington, D.C.                                                      NaN
16  VOA20001129.2000.036                               Supreme Court     Supreme Court of the United States                                                      NaN
17  VOA20001129.2000.036                              Joe O'Grossman                          Joel Grossman                                                      NaN
18  VOA20001129.2000.036                                   Baltimore                              Baltimore                                                      NaN
19  VOA20001129.2000.036                    Johns Hopkins University               Johns Hopkins University                                                      NaN
20  VOA20001129.2000.036                              Lawrence Tribe                         Laurence Tribe                                                      NaN
21  VOA20001129.2000.036                                        Gore                                Al Gore                                                      NaN
22  VOA20001129.2000.036                                 legislature                    Florida Legislature                                                      NaN
23  VOA20001129.2000.036                                    Congress                 United States Congress                                                 Congress
24           chtb_165.en                                       Japan                                  Japan                             Japan; Japan Industrial Bank
25           chtb_165.en                                          US                          United States                                       the US; US dollars

ADDICTION 2:成瘾 2:

Finally, to keep the first, we'll just drop duplicates, but keep the first instance:最后,为了保留第一个,我们将删除重复项,但保留第一个实例:

import pandas as pd

gold = pd.read_csv('C:/test/gold.csv')
reference = pd.read_csv('C:/test/reference.csv')

pat = '|'.join(r"{}".format(x) for x in gold.mention)
reference['mention_test'] = reference.text.str.extract('('+ pat + ')', expand=False)
df = pd.merge(gold, reference, how='left', left_on= ['doc_name','mention'], right_on=['doc_name','mention_test']).drop('mention_test', axis=1)

df = df.drop_duplicates(subset=['doc_name','mention','id'], keep='first')
df.to_csv('output.csv', index=False)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM