简体   繁体   English

如何将一个数据框中的列表列与另一数据框中的字符串列连接在一起?

[英]How to join a column of lists in one dataframe with a column of strings in another dataframe?

I have two dataframes. 我有两个数据框。 The first one (let's call it A) has a column (let's call it 'col1') whose elements are lists of strings. 第一个(称为A)称为列(称为col1),其元素为字符串列表。 The other one (let's call it B) has a column (let's call it 'col2') whose elements are strings. 另一个(称为B)称为列(称为“ col2”),其元素为字符串。 I want to do a join between these two dataframes where B.col2 is in the list in A.col1. 我想在B.col2在A.col1中的列表中的这两个数据帧之间进行联接。 This is one-to-many join. 这是一对多联接。

Also, I need the solution to be scalable since I wanna join two dataframes with hundreds of thousands of rows. 另外,我需要该解决方案具有可伸缩性,因为我想将两个数据帧与成千上万的行连接在一起。

I have tried concatenating the values in A.col1 and creating a new column (let's call it 'col3') and joining with this condition: A.col3.contains(B.col2). 我尝试串联A.col1中的值并创建一个新列(我们将其称为“ col3”)并加入以下条件:A.col3.contains(B.col2)。 However, my understanding is that this condition triggers a cartesian product between the two dataframes which I cannot afford considering the size of the dataframes. 但是,我的理解是,这种情况触发了两个数据框之间的笛卡尔积,考虑到数据框的大小,我无法承受。

def joinIds(IdList):
  return "__".join(IdList)
joinIds_udf = udf(joinIds)

pnr_corr = pnr_corr.withColumn('joinedIds', joinIds_udf(pnr_corr.pnrCorrelations.correlationPnrSchedule.scheduleIds)

pnr_corr_skd = pnr_corr.join(skd, pnr_corr.joinedIds.contains(skd.id), how='inner')

This is a sample of the join that I have in mind: 这是我想到的联接示例:

dataframe A:
listColumn
["a","b","c"]
["a","b"]
["d","e"]

dataframe B:
valueColumn
a
b
d

output:
listColumn      valueColumn
["a","b","c"]   a
["a","b","c"]   b
["a","b"]       a
["a","b"]       b
["d","e"]       d

I don't know if there is an efficient way to do it, but this gives the correct output: 我不知道是否有一种有效的方法来做到这一点,但这给出了正确的输出:

import pandas as pd
from itertools import chain

df1 = pd.Series([["a","b","c"],["a","b"],["d","e"]])
df2 = pd.Series(["a","b","d"])

result = [ [ [el2,list1] for el2 in df2.values if el2 in list1 ] 
                         for list1 in df1.values ]
result_flat = list(chain(*result))

result_df = pd.DataFrame(result_flat)

You get: 你得到:

In [26]: result_df
Out[26]:
   0          1
0  a  [a, b, c]
1  b  [a, b, c]
2  a     [a, b]
3  b     [a, b]
4  d     [d, e]

Another approach is to use the new explode() method from pandas>=0.25 and merge like this: 另一种方法是使用pandas> = 0.25中的新explode()方法并像这样合并:

import pandas as pd

df1 = pd.DataFrame({'col1': [["a","b","c"],["a","b"],["d","e"]]})
df2 = pd.DataFrame({'col2': ["a","b","d"]})

df1_flat = df1.col1.explode().reset_index()
df_merged = pd.merge(df1_flat,df2,left_on='col1',right_on='col2')

df_merged['col2'] = df1.loc[df_merged['index']].values
df_merged.drop('index',axis=1, inplace=True)

This gives the same result: 这给出了相同的结果:

  col1       col2
0    a  [a, b, c]
1    a     [a, b]
2    b  [a, b, c]
3    b     [a, b]
4    d     [d, e]

How about: 怎么样:

df['col1'] = [df['col1'].values[i] + [df['col2'].values[i]] for i in range(len(df))]

Where 'col1' is the list of strings and 'col2' is the strings. 其中“ col1”是字符串列表,“ col2”是字符串。

You can also drop 'col2' if you don't want it anymore with: 如果您不再希望使用以下命令,也可以删除“ col2”:

df = df.drop('col2',axis=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何基于另一列从数据框中的列加入唯一字符串 - How to join unique strings from a column in dataframe based on another column 从 DataFrame 中提取字符串,与上一列连接并与另一列 DataFrame 合并 - Extract strings from DataFrame, join with previous column and merge with another DataFrame 通过键列加入 Dataframe 列表 - Join Lists of Dataframe by key column 如何将 append 一个 dataframe 变成另一个 dataframe 作为一列 - How to append one dataframe into another dataframe as a column 具有字符串列到整数列表列的数据框 - Dataframe with column of strings to column of integer lists python 列数据框(字符串)转换为列表列表 - python column dataframe(strings) convert to lists of lists 如何从与 dataframe 的另一列的字符串匹配的列中删除字符串? - How to remove strings from a column matching with strings of another column of dataframe? 如何检查一个数据帧中的列值是否可用或不检查另一数据帧的列中的值? - How to check values of column in one dataframe available or not in column of another dataframe? 如何检查数据框中的一列是否与另一个数据框中的一列完全相等 - How to check if one column in a dataframe is exactly equal to a column in another dataframe 如何将一个数据帧的列值附加到另一个数据帧的列 - How to append column values of one dataframe to column of another dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM