简体   繁体   中英

Create a new column in a dataframe if the column contains a string from a column of another dataframe

I want to create a new column in my dataframe if the column contains any of the values from a column of a second dataframe.

First dataframe

WXYnineZAB
EFGsixHIJ
QRSeightTUV
GHItwoJKL
YZAfiveBCD
EFGsixHIJ
MNOthreePQR
ABConeDEF
MNOthreePQR
MNOthreePQR
YZAfiveBCD
WXYnineZAB
GHItwoJKL
KLMsevenNOP
EFGsixHIJ
ABConeDEF
KLMsevenNOP
QRSeightTUV
STUfourVWX
STUfourVWX
KLMsevenNOP
WXYnineZAB
CDEtenFGH
YZAfiveBCD
CDEtenFGH
QRSeightTUV
ABConeDEF
STUfourVWX
CDEtenFGH
GHItwoJKL

Second Dataframe

one
three
five
seven
nine

Output DataFrame

WXYnineZAB,nine
EFGsixHIJ,***
QRSeightTUV,***
GHItwoJKL,***
YZAfiveBCD,five
EFGsixHIJ,***
MNOthreePQR,three
ABConeDEF,one
MNOthreePQR,three
MNOthreePQR,three
YZAfiveBCD,five
WXYnineZAB,nine
GHItwoJKL,***
KLMsevenNOP,seven
EFGsixHIJ,***
ABConeDEF,one
KLMsevenNOP,seven
QRSeightTUV,***
STUfourVWX,***
STUfourVWX,***
KLMsevenNOP,seven
WXYnineZAB,nine
CDEtenFGH,***
YZAfiveBCD,five
CDEtenFGH,***
QRSeightTUV,***
ABConeDEF,one
STUfourVWX,***
CDEtenFGH,***
GHItwoJKL,***

To explain it easily I made the first dataframe be 3chars + search string + 3chars, but my actual file doesn't have any consistency like this.

Source DFs:

In [172]: d1
Out[172]:
            txt
0    WXYnineZAB
1     EFGsixHIJ
2   QRSeightTUV
3     GHItwoJKL
4    YZAfiveBCD
..          ...
25  QRSeightTUV
26    ABConeDEF
27   STUfourVWX
28    CDEtenFGH
29    GHItwoJKL

[30 rows x 1 columns]

In [173]: d2
Out[173]:
    word
0    one
1  three
2   five
3  seven
4   nine

generate RegEx pattern from the second DataFrame:

In [174]: pat = r'({})'.format(d2['word'].str.cat(sep='|'))

In [175]: pat
Out[175]: '(one|three|five|seven|nine)'

extract words matching the RegEx pattern and assign them as a new column:

In [176]: d1['new'] = d1['txt'].str.extract(pat, expand=False)

In [177]: d1
Out[177]:
            txt   new
0    WXYnineZAB  nine
1     EFGsixHIJ   NaN
2   QRSeightTUV   NaN
3     GHItwoJKL   NaN
4    YZAfiveBCD  five
..          ...   ...
25  QRSeightTUV   NaN
26    ABConeDEF   one
27   STUfourVWX   NaN
28    CDEtenFGH   NaN
29    GHItwoJKL   NaN

[30 rows x 2 columns]

you can also fill NaN's if you want in the same step:

In [178]: d1['new'] = d1['txt'].str.extract(pat, expand=False).fillna('***')

In [179]: d1
Out[179]:
            txt   new
0    WXYnineZAB  nine
1     EFGsixHIJ   ***
2   QRSeightTUV   ***
3     GHItwoJKL   ***
4    YZAfiveBCD  five
..          ...   ...
25  QRSeightTUV   ***
26    ABConeDEF   one
27   STUfourVWX   ***
28    CDEtenFGH   ***
29    GHItwoJKL   ***

[30 rows x 2 columns]

If you want to avoid RegEx, here is a purely list-based solution:

# Sample DataFrames (structure is borrowed from MaxU)
d1 = pd.DataFrame({'txt':['WXYnineZAB','EFGsixHIJ','QRSeightTUV','GHItwoJKL']})
d2 = pd.DataFrame({'word':['two','six']})
# Check if word exists in any txt (1-liner).
exists = [list(d2.word[[word in txt for word in d2.word]])[0] if sum([word in txt for word in d2.word]) == 1 else '***' for txt in d1.txt]
# Resulting output
res = pd.DataFrame(zip(d1.txt,exists), columns = ['text','word'])

Result:

          text word
0   WXYnineZAB  ***
1    EFGsixHIJ  six
2  QRSeightTUV  ***
3    GHItwoJKL  two

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM