match strings in list and DF column and put into new DF column

Question

using python, pandas

I have a dataframe with three columns and about a million rows. The third column contains strings. I want to select a subset of these strings that match the strings in a list and put them in a fourth column.

Here is an example of a string from the dataframe:

"BW - Jl 8 '79 - pE2 CCB-B -vl9-Ja '66-p83 LJ - v91 - Ja 15 -66 - p426 
NYRB - v5 - D 9 '65 - p39 NYTBR - v70 - N 21 '65 - p60 Nat R - vl7 - 
D14 '65-pll65 y"

Here is a sample of my list:

['AAA', 'A Anth', 'AAPSS-A', 'A Anth', 'A Arch', 'A Art', 'AB', 'ABA 
Jour', 'ABC', 'ABR', 'AC', 'ACSB', 'Adult L', 'Advocate', 'AE', 'AER', 
'AF', 'Africa T', 'Afterimage', 'Aging', 'AH', 'AHR', 'A Hy R', 'AIQ', 
'AJA', 'AJES', 'AJMD', 'AJMR', 'AJP', 'A J Psy', 'AJS', 'AL', 'A Lead', 
'A Lib', 'Am', 'Am Ant', 'Am Arts', 'Am Craft', 'Amer R', 'Am Ethol', 
'Am Film', 'Am Mus Teach', 'Am Q', 'Ams', 'Am Sci', 'Am Spect', 'Am 
Threat', 'Analog', 'ANQ', 'ANQ:QJ', 'Ant & Col Hob', 'Antiq', 'Antiq 
J', 'Ant R', 'Apo', 'APR', 'APSR', 'AR', 'ARBA', 'Arch', 'Archt R', 
'ARG', 'Armchair Det', 'Art Am', 'Art Bull', 'Art Dir', 'Art J', 'Art 
N', 'AS', 'ASBYP', 'Aspen A', 'Aspen J', 'ASR', 'Astron', 'Ath J', 
'Atl', 'Atl Pro Bk R', 'Atl PBR', 'Aud', 'AW', 'BALF', 'Ballet N', 
"Barron's", 'BAS', 'BB', 'B&B', 'BC', 'BCM', 'B Ent', 'Belles Let', 
'BF', 'BFYC', 'B Hor', 'BHR', 'BIC', 'Biography', 'BksW', 'Bks for 
Keeps', 'Bks for YP', 'BL', 'Bloom Rev']

From the string in the dataframe, I want to select 'BW', 'CCB-B', 'LJ', 'NYRB', 'NYTRB', and 'Nat R', (all of which are in the list) and put them in a new column in the same row.

My code looks like this:

s = df65['Review'].str.extractall(reviews_list).squeeze()
s = s.unstack(level=-1)
df65['Reviews'] = s

But extractall doesn't take lists as arguments in this way.

Help?

Answer 1

str.extractall expects a regex pattern as a parameter. You can make this regex with

'|'.join(reviews_list)

But some characters need to be escaped to be used with regex, so import re and use re.escape like this:

[re.escape(item) for item in reviews_list]

So your new call will be

 s = df65['Review'].str.extractall('|'.join([re.escape(item) for item in reviews_list])).squeeze()

match strings in list and DF column and put into new DF column

Question

1 answers

solution1
1 ACCPTED 2017-10-18 16:11:12

match strings in list and DF column and put into new DF column

Question

1 answers

solution1 1 ACCPTED 2017-10-18 16:11:12

solution1
1 ACCPTED 2017-10-18 16:11:12