[英]Create a new column in a dataframe if the column contains a string from a column of another dataframe
如果該列包含第二個數據框的列中的任何值,我想在數據框中創建一個新列。
第一個數據框
WXYnineZAB
EFGsixHIJ
QRSeightTUV
GHItwoJKL
YZAfiveBCD
EFGsixHIJ
MNOthreePQR
ABConeDEF
MNOthreePQR
MNOthreePQR
YZAfiveBCD
WXYnineZAB
GHItwoJKL
KLMsevenNOP
EFGsixHIJ
ABConeDEF
KLMsevenNOP
QRSeightTUV
STUfourVWX
STUfourVWX
KLMsevenNOP
WXYnineZAB
CDEtenFGH
YZAfiveBCD
CDEtenFGH
QRSeightTUV
ABConeDEF
STUfourVWX
CDEtenFGH
GHItwoJKL
第二個數據框
one
three
five
seven
nine
輸出數據框
WXYnineZAB,nine
EFGsixHIJ,***
QRSeightTUV,***
GHItwoJKL,***
YZAfiveBCD,five
EFGsixHIJ,***
MNOthreePQR,three
ABConeDEF,one
MNOthreePQR,three
MNOthreePQR,three
YZAfiveBCD,five
WXYnineZAB,nine
GHItwoJKL,***
KLMsevenNOP,seven
EFGsixHIJ,***
ABConeDEF,one
KLMsevenNOP,seven
QRSeightTUV,***
STUfourVWX,***
STUfourVWX,***
KLMsevenNOP,seven
WXYnineZAB,nine
CDEtenFGH,***
YZAfiveBCD,five
CDEtenFGH,***
QRSeightTUV,***
ABConeDEF,one
STUfourVWX,***
CDEtenFGH,***
GHItwoJKL,***
為了易於解釋,我將第一個數據幀設置為3chars +搜索字符串+ 3chars,但是我的實際文件沒有這樣的一致性。
源DF:
In [172]: d1
Out[172]:
txt
0 WXYnineZAB
1 EFGsixHIJ
2 QRSeightTUV
3 GHItwoJKL
4 YZAfiveBCD
.. ...
25 QRSeightTUV
26 ABConeDEF
27 STUfourVWX
28 CDEtenFGH
29 GHItwoJKL
[30 rows x 1 columns]
In [173]: d2
Out[173]:
word
0 one
1 three
2 five
3 seven
4 nine
從第二個DataFrame生成RegEx模式:
In [174]: pat = r'({})'.format(d2['word'].str.cat(sep='|'))
In [175]: pat
Out[175]: '(one|three|five|seven|nine)'
提取與RegEx模式匹配的單詞並將其分配為新列:
In [176]: d1['new'] = d1['txt'].str.extract(pat, expand=False)
In [177]: d1
Out[177]:
txt new
0 WXYnineZAB nine
1 EFGsixHIJ NaN
2 QRSeightTUV NaN
3 GHItwoJKL NaN
4 YZAfiveBCD five
.. ... ...
25 QRSeightTUV NaN
26 ABConeDEF one
27 STUfourVWX NaN
28 CDEtenFGH NaN
29 GHItwoJKL NaN
[30 rows x 2 columns]
您也可以在同一步驟中填寫NaN:
In [178]: d1['new'] = d1['txt'].str.extract(pat, expand=False).fillna('***')
In [179]: d1
Out[179]:
txt new
0 WXYnineZAB nine
1 EFGsixHIJ ***
2 QRSeightTUV ***
3 GHItwoJKL ***
4 YZAfiveBCD five
.. ... ...
25 QRSeightTUV ***
26 ABConeDEF one
27 STUfourVWX ***
28 CDEtenFGH ***
29 GHItwoJKL ***
[30 rows x 2 columns]
如果要避免使用RegEx,請使用以下純粹基於列表的解決方案:
# Sample DataFrames (structure is borrowed from MaxU)
d1 = pd.DataFrame({'txt':['WXYnineZAB','EFGsixHIJ','QRSeightTUV','GHItwoJKL']})
d2 = pd.DataFrame({'word':['two','six']})
# Check if word exists in any txt (1-liner).
exists = [list(d2.word[[word in txt for word in d2.word]])[0] if sum([word in txt for word in d2.word]) == 1 else '***' for txt in d1.txt]
# Resulting output
res = pd.DataFrame(zip(d1.txt,exists), columns = ['text','word'])
結果:
text word
0 WXYnineZAB ***
1 EFGsixHIJ six
2 QRSeightTUV ***
3 GHItwoJKL two
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.