简体   繁体   中英

match strings in list and DF column and put into new DF column

using python, pandas

I have a dataframe with three columns and about a million rows. The third column contains strings. I want to select a subset of these strings that match the strings in a list and put them in a fourth column.

Here is an example of a string from the dataframe:

"BW - Jl 8 '79 - pE2 CCB-B -vl9-Ja '66-p83 LJ - v91 - Ja 15 -66 - p426 
NYRB - v5 - D 9 '65 - p39 NYTBR - v70 - N 21 '65 - p60 Nat R - vl7 - 
D14 '65-pll65 y"

Here is a sample of my list:

['AAA', 'A Anth', 'AAPSS-A', 'A Anth', 'A Arch', 'A Art', 'AB', 'ABA 
Jour', 'ABC', 'ABR', 'AC', 'ACSB', 'Adult L', 'Advocate', 'AE', 'AER', 
'AF', 'Africa T', 'Afterimage', 'Aging', 'AH', 'AHR', 'A Hy R', 'AIQ', 
'AJA', 'AJES', 'AJMD', 'AJMR', 'AJP', 'A J Psy', 'AJS', 'AL', 'A Lead', 
'A Lib', 'Am', 'Am Ant', 'Am Arts', 'Am Craft', 'Amer R', 'Am Ethol', 
'Am Film', 'Am Mus Teach', 'Am Q', 'Ams', 'Am Sci', 'Am Spect', 'Am 
Threat', 'Analog', 'ANQ', 'ANQ:QJ', 'Ant & Col Hob', 'Antiq', 'Antiq 
J', 'Ant R', 'Apo', 'APR', 'APSR', 'AR', 'ARBA', 'Arch', 'Archt R', 
'ARG', 'Armchair Det', 'Art Am', 'Art Bull', 'Art Dir', 'Art J', 'Art 
N', 'AS', 'ASBYP', 'Aspen A', 'Aspen J', 'ASR', 'Astron', 'Ath J', 
'Atl', 'Atl Pro Bk R', 'Atl PBR', 'Aud', 'AW', 'BALF', 'Ballet N', 
"Barron's", 'BAS', 'BB', 'B&B', 'BC', 'BCM', 'B Ent', 'Belles Let', 
'BF', 'BFYC', 'B Hor', 'BHR', 'BIC', 'Biography', 'BksW', 'Bks for 
Keeps', 'Bks for YP', 'BL', 'Bloom Rev']

From the string in the dataframe, I want to select 'BW', 'CCB-B', 'LJ', 'NYRB', 'NYTRB', and 'Nat R', (all of which are in the list) and put them in a new column in the same row.

My code looks like this:

s = df65['Review'].str.extractall(reviews_list).squeeze()
s = s.unstack(level=-1)
df65['Reviews'] = s

But extractall doesn't take lists as arguments in this way.

Help?

str.extractall expects a regex pattern as a parameter. You can make this regex with

'|'.join(reviews_list)

But some characters need to be escaped to be used with regex, so import re and use re.escape like this:

[re.escape(item) for item in reviews_list]

So your new call will be

 s = df65['Review'].str.extractall('|'.join([re.escape(item) for item in reviews_list])).squeeze()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM