简体   繁体   English

如何部分匹配列表并在 python 的数据框中写入匹配的字符

[英]How to partially match a list and write matched characters in a data frame in python

I have two data frames df1 and df2.我有两个数据框 df1 和 df2。 I want to match those two so that df two values match to one column of df1 and show up in a row.我想匹配这两个,以便 df 两个值匹配 df1 的一列并显示在一行中。 Here is a sample data I made这是我制作的示例数据

import pandas as p`enter code here`d

# initialize list of lists
data = [["AA", 'ABC_111' ], ["BB", 'ABC_112'], ["CC", 'ABC_113']]
data1= [['ABC_111_12'], ['ABC_112_45'], ['ABC_112_89'],['ABC_113_06'], ['ABC_113_25'], ['ABC_113_89']]
result= [['AA' ,'ABC_111', 'ABC_111_12','ABC_111_19'], ['BB', 'ABC_112', "ABC_112_45",'ABC_112_89' ],
         ['CC','ABC_113', 'ABC_113_89','ABC_113_06', 'ABC_113_25', 'ABC_113_29']]

# Create the pandas DataFrame
df1= pd.DataFrame(data, columns = [0, 1])
df2= pd.DataFrame(data1, columns = [0])
result_df = pd.DataFrame(result, columns = [0, 1, 2, 3, 4,5])

# print dataframe.
print("df1: \n",df1)


    print("df2: \n",df2)
    
    print("expected_result: \n",result_df)


df1: 
     0        1
0  AA  ABC_111
1  BB  ABC_112
2  CC  ABC_113

df2: 
             0
0  ABC_111_12
1  ABC_112_45
2  ABC_112_89
3  ABC_113_06
4  ABC_113_25
5  ABC_113_89

So my expected result is something like this:所以我的预期结果是这样的:

expected_result: 
     0        1           2           3           4           5
0  AA  ABC_111  ABC_111_12  ABC_111_19        None        None
1  BB  ABC_112  ABC_112_45  ABC_112_89        None        None
2  CC  ABC_113  ABC_113_89  ABC_113_06  ABC_113_25  ABC_113_29

I'm going to answer assuming the structure of the text entries are indicative of the real data you want to work with, as it is an important part of how I would go about solving this.我将假设文本条目的结构表示您要使用的真实数据来回答,因为它是我将如何解决此问题的重要部分。

The most important thing here is isolating which bits of text in each value are static and which are variable.这里最重要的是隔离每个值中的哪些文本位是 static 以及哪些是可变的。

If they are something like:如果它们是这样的:

AAA_NNN_NN

A=alpha
N=numeric

The length of As or Ns can be variable, so long as the number of _ is static. As 或 Ns 的长度可以是可变的,只要 _ 的数量是 static。 If there are always 2 in the data1 list, then we are in business.如果 data1 列表中总是有 2 个,那么我们在做生意。

There are a few ways to go about this, and the ultimate structure would depend on how well you know the data and what shortcuts you can engineer into the solution, but a slow method would be to do some exact matching on string splits. go 有几种方法可以解决这个问题,最终结构将取决于您对数据的了解程度以及您可以在解决方案中设计的快捷方式,但一种缓慢的方法是对字符串拆分进行一些精确匹配。

results = []
for i,j in data:
    tmp = [i, j]
    for k in data1:
        h = k[0].split("_")
        if h[1] in j:
            tmp.append(k[0])
        else:
            tmp.append(None)
    results.append(tmp)

for i in results:
    print(i)

The if h[1] in j: would see if '111' is in the string 'ABC_111' in the first case, which it is. if h[1] in j:将在第一种情况下查看'111'是否在字符串'ABC_111'中,它是。

This is very crude, but it should give you an idea of exact matching within structured strings.这是非常粗略的,但它应该让您了解结构化字符串中的精确匹配。 The importance here is on the structure.这里的重要性在于结构。 There are many ways to match things, but a lot of it comes down to the data you are working with.有很多方法可以匹配事物,但其中很多都取决于您正在使用的数据。

I hope this helps guide you to a solution.我希望这有助于指导您找到解决方案。

This works for the data provided.这适用于提供的数据。

  1. List item项目清单

  2. Split the data into the 'root' and the remaining value with rsplit()使用rsplit()将数据拆分为“根”和剩余值

  3. Use groupby and agg to put the remaining values in a list使用groupbyagg将剩余的值放在一个列表中

  4. Use apply() to piece the data back together, then expand the lists to columns使用 apply() 将数据重新组合在一起,然后将列表展开为列

  5. Concatenate with df1与 df1 连接

    df2[['cola', 'colb']] = df2[0].str.rsplit(' ', 1, expand=True) df3 = df2[['cola', 'colb']].groupby('cola').agg(list).reset_index() df3['colb'] = df3.apply(lambda x: [x.cola + ' ' + i for i in x.colb], axis=1 ) df4 = pd.DataFrame(df3['colb'].tolist(), index= df3.index) pd.concat([df1,df4], axis=1) df2[['cola', 'colb']] = df2[0].str.rsplit(' ', 1, expand=True) df3 = df2[['cola', 'colb']].groupby('cola ').agg(list).reset_index() df3['colb'] = df3.apply(lambda x: [x.cola + ' ' + i for i in x.colb], axis=1 ) df4 = pd. DataFrame(df3['colb'].tolist(), index= df3.index) pd.concat([df1,df4], axis=1)

    0 1 0 1 2 0 AA ABC_111 ABC_111_12 None None 1 BB ABC_112 ABC_112_45 ABC_112_89 None 2 CC ABC_113 ABC_113_06 ABC_113_25 ABC_113_89 0 1 0 1 2 0 AA ABC_111 ABC_111_12 无 无 1 BB ABC_112 ABC_112_45 ABC_112_89 无 2 CC ABC_113 ABC_113_06 ABC_113_25 ABC_113_89

Try this using rsplit , groupby , cumcount , set_index and unstack :尝试使用rsplitgroupbycumcountset_indexunstack

dfm = (df2.assign(keystr=df2[0].str.rsplit('_',1).str[0])
          .merge(df1, left_on='keystr', right_on=1))
df_out = (dfm.set_index(['0_y', 
                        'keystr', 
                        dfm.groupby(['0_y','keystr'])['0_x'].cumcount()])['0_x']
            .unstack().reset_index())
print(df_out)

Output: Output:

  0_y   keystr           0           1           2
0  AA  ABC_111  ABC_111_12         NaN         NaN
1  BB  ABC_112  ABC_112_45  ABC_112_89         NaN
2  CC  ABC_113  ABC_113_06  ABC_113_25  ABC_113_89

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM