[英]How can I join two dataframes where one column holds two or more values
I have two dataframes similar to this:我有两个与此类似的数据框:
A = pd.DataFrame(data={"number": [123, 345], "subject_ids": []})
B = pd.DataFrame(data={"number": [123, 123, 345, 345], "subject_id": [222, 333, 444, 555]})
Meaning: Every number has at least two subject ids.含义:每个数字至少有两个主题 ID。
How would you go about merging these dataframes together, so there would be column "subject_ids" in the A dataframe containing joined list of ids in one cell?您将如何 go 将这些数据帧合并在一起,因此 A dataframe 中的“subject_ids”列将包含一个单元格中的 id 连接列表?
"number": [123, 345], "subject_ids": [[222, 333], [444, 555]]
I've tried lots of methods like this:我试过很多这样的方法:
A.merge(B, how='left', on='number')
But nothing seems to work.但似乎没有任何效果。 (I couldn't find an answer to this either) (我也找不到答案)
The number is a key and those keys are identical, and the second df stores subjects to those numbers.该数字是一个键,这些键是相同的,第二个 df 存储这些数字的主题。 I want the A dataframe to contain a reference to those subject IDs in a list assigned to one row with that given number.我希望 A dataframe 在分配给具有该给定编号的一行的列表中包含对这些主题 ID 的引用。 One number can have many subjects.一个号码可以有多个科目。
Complaint dataframe where I want the column with list of all subject IDs associated with the number:投诉 dataframe 我想要与该数字关联的所有主题 ID 列表的列:
number total_complaint_count first_complaint_on last_complaint_on
0 0000000000 77 2021-10-29 2021-12-05
77 00000000000 1 2021-11-12 2021-11-12
78 000000000000 1 2021-11-07 2021-11-07
79 00020056234 1 2021-11-23 2021-11-23
80 0002266648 1 2021-11-02 2021-11-02
Subject dataframe that contains the number to be associated with, subject and subject ID.包含要关联的编号、主题和主题 ID 的主题 dataframe。
number subject \
787 0000000000 Other
4391 0000000000 Calls pretending to be government, businesses,...
694 0000000000 Warranties & protection plans
1106 0000000000 Other
4682 0000000000 Dropped call or no message
subject_id
787 38d1177e-51e8-4cec-aef8-0112f425091b
4391 1964fb22-bd20-4d49-beaf-51322a5f5bad
694 07819535-41b0-44f3-a497-ac2cee16dd1a
1106 2f348025-3f9f-4861-b151-fbb8a1ac14a3
4682 15d33ca0-6d90-42ba-9a1d-74e0dcf28539
Info of both dataframes:两个数据框的信息:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 230122 entries, 0 to 281716
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 number 230122 non-null object
1 total_complaint_count 230122 non-null int64
2 first_complaint_on 19 non-null object
3 last_complaint_on 19 non-null object
dtypes: int64(1), object(3)
memory usage: 8.8+ MB
---------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 281720 entries, 787 to 9377
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 number 281720 non-null object
1 subject 281720 non-null object
2 subject_id 281720 non-null object
dtypes: object(3)
memory usage: 8.6+ MB
Pretty sure I have the answer for you this time.很确定这次我有答案了。
dfa = pd.DataFrame(columns=['number'],
data=np.array([[123],
[345]
]))
dfb = pd.DataFrame(columns=['number', 'subject ids'],
data=np.array([[123, 222],
[123, 333],
[345, 444],
[345, 555]
]))
dfa['ids'] = '' #create ids column in dfa
for x in dfa.itertuples():
list = []
for a in dfb.itertuples():
if x[1] == a[1]:
print(a[2])
list.append(a[2])
#x[1] shows first column items from dfa, a[1] from dfb
#if the values match
#get value from column['subject id'] in dfb and add to an empty list
slist = str(list) #change list to string
dfa.loc[x[0], ['ids']] = slist #append to 'id' column at the index where the values match
print(dfa)
i dont know how to quote the output of the table but the code above is copy paste aside from the imports我不知道如何引用表格的 output 但上面的代码是除了导入之外的复制粘贴
I tried to keep the list format to no avail.我试图保持列表格式无济于事。 Tried using lamda functions and setting the column .astype(object
, included dtype=object
in the dataframe. Along with a bunch of other ways.尝试使用 lamda 函数并设置列.astype(object
,在 dataframe 中包含 dtype dtype=object
。以及许多其他方式。
if someone else knows how to keep the list as a list and add it to the dataframe using the code above I would love to know as well如果其他人知道如何将列表保留为列表并使用上面的代码将其添加到 dataframe 我也很想知道
I've always disliked seeing dataframes written out that way because its difficult to read, so I changed that in mine.我一直不喜欢看到以这种方式写出的数据帧,因为它难以阅读,所以我改变了它。 if you use pd.concat
it should do what you're asking.如果您使用pd.concat
它应该按照您的要求进行。
a = pd.DataFrame(columns=['number', 'subject ids'],
data=np.array([[123, (222, 333)],
[345, (444, 555)]
]))
b = pd.DataFrame(columns=['number', 'subject ids'],
data=np.array([[657, (666, 777)],
[789, (888, 999)]
]))
dataframes = [a,b]
a = pd.concat(dataframes)
print(a)
use a = pd.concat(dataframes, ignore_index=True
if you want to rest the index使用a = pd.concat(dataframes, ignore_index=True
如果你想 rest 索引
I found a solution in this post: How to implode(reverse of pandas explode) based on a column我在这篇文章中找到了一个解决方案: How to implode(reverse of pandas explode) based on a column
I simply grouped by the number column, added the values to the list, and merged the data frames.我只是按数字列分组,将值添加到列表中,然后合并数据框。
Here is the code if somebody needs it:如果有人需要,这是代码:
def create_subject_id_column(complaint_df, subject_df, subject_column="subject", number_column="number"):
subject_df = subject_df.copy()
subject_df.drop(subject_column, axis=1, inplace=True)
subject_df = (subject_df.groupby(number_column)
.agg({'subject_id': lambda x: x.tolist()})
.reset_index())
combined_df = complaint_df.merge(subject_df, how="outer", on=number_column)
return combined_df
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.