如果该列包含另一个数据框的列中的字符串，则在该数据框中创建一个新列

Question

如果该列包含第二个数据框的列中的任何值，我想在数据框中创建一个新列。

第一个数据框

WXYnineZAB
EFGsixHIJ
QRSeightTUV
GHItwoJKL
YZAfiveBCD
EFGsixHIJ
MNOthreePQR
ABConeDEF
MNOthreePQR
MNOthreePQR
YZAfiveBCD
WXYnineZAB
GHItwoJKL
KLMsevenNOP
EFGsixHIJ
ABConeDEF
KLMsevenNOP
QRSeightTUV
STUfourVWX
STUfourVWX
KLMsevenNOP
WXYnineZAB
CDEtenFGH
YZAfiveBCD
CDEtenFGH
QRSeightTUV
ABConeDEF
STUfourVWX
CDEtenFGH
GHItwoJKL

第二个数据框

one
three
five
seven
nine

输出数据框

WXYnineZAB,nine
EFGsixHIJ,***
QRSeightTUV,***
GHItwoJKL,***
YZAfiveBCD,five
EFGsixHIJ,***
MNOthreePQR,three
ABConeDEF,one
MNOthreePQR,three
MNOthreePQR,three
YZAfiveBCD,five
WXYnineZAB,nine
GHItwoJKL,***
KLMsevenNOP,seven
EFGsixHIJ,***
ABConeDEF,one
KLMsevenNOP,seven
QRSeightTUV,***
STUfourVWX,***
STUfourVWX,***
KLMsevenNOP,seven
WXYnineZAB,nine
CDEtenFGH,***
YZAfiveBCD,five
CDEtenFGH,***
QRSeightTUV,***
ABConeDEF,one
STUfourVWX,***
CDEtenFGH,***
GHItwoJKL,***

为了易于解释，我将第一个数据帧设置为3chars +搜索字符串+ 3chars，但是我的实际文件没有这样的一致性。

Answer 1

源DF：

In [172]: d1
Out[172]:
            txt
0    WXYnineZAB
1     EFGsixHIJ
2   QRSeightTUV
3     GHItwoJKL
4    YZAfiveBCD
..          ...
25  QRSeightTUV
26    ABConeDEF
27   STUfourVWX
28    CDEtenFGH
29    GHItwoJKL

[30 rows x 1 columns]

In [173]: d2
Out[173]:
    word
0    one
1  three
2   five
3  seven
4   nine

从第二个DataFrame生成RegEx模式：

In [174]: pat = r'({})'.format(d2['word'].str.cat(sep='|'))

In [175]: pat
Out[175]: '(one|three|five|seven|nine)'

提取与RegEx模式匹配的单词并将其分配为新列：

In [176]: d1['new'] = d1['txt'].str.extract(pat, expand=False)

In [177]: d1
Out[177]:
            txt   new
0    WXYnineZAB  nine
1     EFGsixHIJ   NaN
2   QRSeightTUV   NaN
3     GHItwoJKL   NaN
4    YZAfiveBCD  five
..          ...   ...
25  QRSeightTUV   NaN
26    ABConeDEF   one
27   STUfourVWX   NaN
28    CDEtenFGH   NaN
29    GHItwoJKL   NaN

[30 rows x 2 columns]

您也可以在同一步骤中填写NaN：

In [178]: d1['new'] = d1['txt'].str.extract(pat, expand=False).fillna('***')

In [179]: d1
Out[179]:
            txt   new
0    WXYnineZAB  nine
1     EFGsixHIJ   ***
2   QRSeightTUV   ***
3     GHItwoJKL   ***
4    YZAfiveBCD  five
..          ...   ...
25  QRSeightTUV   ***
26    ABConeDEF   one
27   STUfourVWX   ***
28    CDEtenFGH   ***
29    GHItwoJKL   ***

[30 rows x 2 columns]

Answer 2

如果要避免使用RegEx，请使用以下纯粹基于列表的解决方案：

# Sample DataFrames (structure is borrowed from MaxU)
d1 = pd.DataFrame({'txt':['WXYnineZAB','EFGsixHIJ','QRSeightTUV','GHItwoJKL']})
d2 = pd.DataFrame({'word':['two','six']})
# Check if word exists in any txt (1-liner).
exists = [list(d2.word[[word in txt for word in d2.word]])[0] if sum([word in txt for word in d2.word]) == 1 else '***' for txt in d1.txt]
# Resulting output
res = pd.DataFrame(zip(d1.txt,exists), columns = ['text','word'])

结果：

          text word
0   WXYnineZAB  ***
1    EFGsixHIJ  six
2  QRSeightTUV  ***
3    GHItwoJKL  two

如果该列包含另一个数据框的列中的字符串，则在该数据框中创建一个新列

问题描述

2 个解决方案

解决方案1
0 2017-12-26 23:38:23

解决方案2
0 2017-12-27 06:11:01

如果该列包含另一个数据框的列中的字符串，则在该数据框中创建一个新列

问题描述

2 个解决方案

解决方案1 0 2017-12-26 23:38:23

解决方案2 0 2017-12-27 06:11:01

解决方案1
0 2017-12-26 23:38:23

解决方案2
0 2017-12-27 06:11:01