简体   繁体   English

如何加入两个数据框,其中一列包含两个或多个值

[英]How can I join two dataframes where one column holds two or more values

I have two dataframes similar to this:我有两个与此类似的数据框:

A = pd.DataFrame(data={"number": [123, 345], "subject_ids": []})
B = pd.DataFrame(data={"number": [123, 123, 345, 345], "subject_id": [222, 333, 444, 555]})

Meaning: Every number has at least two subject ids.含义:每个数字至少有两个主题 ID。

How would you go about merging these dataframes together, so there would be column "subject_ids" in the A dataframe containing joined list of ids in one cell?您将如何 go 将这些数据帧合并在一起,因此 A dataframe 中的“subject_ids”列将包含一个单元格中的 id 连接列表?

"number": [123, 345], "subject_ids": [[222, 333], [444, 555]]

I've tried lots of methods like this:我试过很多这样的方法:

A.merge(B, how='left', on='number')

But nothing seems to work.但似乎没有任何效果。 (I couldn't find an answer to this either) (我也找不到答案)

The number is a key and those keys are identical, and the second df stores subjects to those numbers.该数字是一个键,这些键是相同的,第二个 df 存储这些数字的主题。 I want the A dataframe to contain a reference to those subject IDs in a list assigned to one row with that given number.我希望 A dataframe 在分配给具有该给定编号的一行的列表中包含对这些主题 ID 的引用。 One number can have many subjects.一个号码可以有多个科目。

Complaint dataframe where I want the column with list of all subject IDs associated with the number:投诉 dataframe 我想要与该数字关联的所有主题 ID 列表的列:

          number  total_complaint_count first_complaint_on last_complaint_on
0     0000000000                     77         2021-10-29        2021-12-05
77   00000000000                      1         2021-11-12        2021-11-12
78  000000000000                      1         2021-11-07        2021-11-07
79   00020056234                      1         2021-11-23        2021-11-23
80    0002266648                      1         2021-11-02        2021-11-02

Subject dataframe that contains the number to be associated with, subject and subject ID.包含要关联的编号、主题和主题 ID 的主题 dataframe。

          number                                            subject  \
787   0000000000                                              Other   
4391  0000000000  Calls pretending to be government, businesses,...   
694   0000000000                     Warranties  & protection plans   
1106  0000000000                                              Other   
4682  0000000000                         Dropped call or no message   

                                subject_id  
787   38d1177e-51e8-4cec-aef8-0112f425091b  
4391  1964fb22-bd20-4d49-beaf-51322a5f5bad  
694   07819535-41b0-44f3-a497-ac2cee16dd1a  
1106  2f348025-3f9f-4861-b151-fbb8a1ac14a3  
4682  15d33ca0-6d90-42ba-9a1d-74e0dcf28539  

Info of both dataframes:两个数据框的信息:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 230122 entries, 0 to 281716
Data columns (total 4 columns):
 #   Column                 Non-Null Count   Dtype 
---  ------                 --------------   ----- 
 0   number                 230122 non-null  object
 1   total_complaint_count  230122 non-null  int64 
 2   first_complaint_on     19 non-null      object
 3   last_complaint_on      19 non-null      object
dtypes: int64(1), object(3)
memory usage: 8.8+ MB
---------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 281720 entries, 787 to 9377
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   number      281720 non-null  object
 1   subject     281720 non-null  object
 2   subject_id  281720 non-null  object
dtypes: object(3)
memory usage: 8.6+ MB

Pretty sure I have the answer for you this time.很确定这次我有答案了。

dfa = pd.DataFrame(columns=['number'],
                 data=np.array([[123],
                               [345]
                                ]))
dfb = pd.DataFrame(columns=['number', 'subject ids'],
                 data=np.array([[123, 222],
                                [123, 333],
                                [345, 444],
                                [345, 555]
                                ]))

dfa['ids'] = ''  #create ids column in dfa

for x in dfa.itertuples():

    list = []
    for a in dfb.itertuples():

        if x[1] == a[1]: 
            print(a[2])
            list.append(a[2])
            #x[1] shows first column items from dfa, a[1] from dfb
            #if the values match
            #get value from column['subject id'] in dfb and add to an empty list
    
    slist = str(list) #change list to string
    dfa.loc[x[0], ['ids']] = slist #append to 'id' column at the index where the values match
    print(dfa)

i dont know how to quote the output of the table but the code above is copy paste aside from the imports我不知道如何引用表格的 output 但上面的代码是除了导入之外的复制粘贴

I tried to keep the list format to no avail.我试图保持列表格式无济于事。 Tried using lamda functions and setting the column .astype(object , included dtype=object in the dataframe. Along with a bunch of other ways.尝试使用 lamda 函数并设置列.astype(object ,在 dataframe 中包含 dtype dtype=object 。以及许多其他方式。

if someone else knows how to keep the list as a list and add it to the dataframe using the code above I would love to know as well如果其他人知道如何将列表保留为列表并使用上面的代码将其添加到 dataframe 我也很想知道

I've always disliked seeing dataframes written out that way because its difficult to read, so I changed that in mine.我一直不喜欢看到以这种方式写出的数据帧,因为它难以阅读,所以我改变了它。 if you use pd.concat it should do what you're asking.如果您使用pd.concat它应该按照您的要求进行。

a = pd.DataFrame(columns=['number', 'subject ids'],
                 data=np.array([[123, (222, 333)],
                               [345, (444, 555)]
                                ]))
b = pd.DataFrame(columns=['number', 'subject ids'],
                 data=np.array([[657, (666, 777)],
                                [789, (888, 999)]
                                ]))


dataframes = [a,b]
a = pd.concat(dataframes)
print(a)

use a = pd.concat(dataframes, ignore_index=True if you want to rest the index使用a = pd.concat(dataframes, ignore_index=True如果你想 rest 索引

I found a solution in this post: How to implode(reverse of pandas explode) based on a column我在这篇文章中找到了一个解决方案: How to implode(reverse of pandas explode) based on a column

I simply grouped by the number column, added the values to the list, and merged the data frames.我只是按数字列分组,将值添加到列表中,然后合并数据框。

Here is the code if somebody needs it:如果有人需要,这是代码:

def create_subject_id_column(complaint_df, subject_df, subject_column="subject", number_column="number"):
    subject_df = subject_df.copy()
    subject_df.drop(subject_column, axis=1, inplace=True)
    subject_df = (subject_df.groupby(number_column)
      .agg({'subject_id': lambda x: x.tolist()})
      .reset_index())
    combined_df = complaint_df.merge(subject_df, how="outer", on=number_column)
    return combined_df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM