简体   繁体   中英

Create new column using str.contains and based on if-else condition

I have a list of names 'pattern' that I wish to match with strings in column 'url_text'. If there is a match ie True the name should be printed in a new column 'pol_names_block' and if False leave the row empty.

pattern = '|'.join(pol_names_list) 

print(pattern)
'Jon Kyl|Doug Jones|Tim Kaine|Lindsey Graham|Cory Booker|Kamala Harris|Orrin Hatch|Bernie Sanders|Thom Tillis|Jerry Moran|Shelly Moore Capito|Maggie Hassan|Tom Carper|Martin Heinrich|Steve Daines|Pat Toomey|Todd Young|Bill Nelson|John Barrasso|Chris Murphy|Mike Rounds|Mike Crapo|John Thune|John. McCain|Susan Collins|Patty Murray|Dianne Feinstein|Claire McCaskill|Lamar Alexander|Jack Reed|Chuck Grassley|Catherine Masto|Pat Roberts|Ben Cardin|Dean Heller|Ron Wyden|Dick Durbin|Jeanne Shaheen|Tammy Duckworth|Sheldon Whitehouse|Tom Cotton|Sherrod Brown|Bob Corker|Tom Udall|Mitch McConnell|James Lankford|Ted Cruz|Mike Enzi|Gary Peters|Jeff Flake|Johnny Isakson|Jim Inhofe|Lindsey Graham|Marco Rubio|Angus King|Kirsten Gillibrand|Bob Casey|Chris Van Hollen|Thad Cochran|Richard Burr|Rob Portman|Jon Tester|Bob Menendez|John Boozman|Mazie Hirono|Joe Manchin|Deb Fischer|Michael Bennet|Debbie Stabenow|Ben Sasse|Brian Schatz|Jim Risch|Mike Lee|Elizabeth Warren|Richard Blumenthal|David Perdue|Al Franken|Bill Cassidy|Cory Gardner|Lisa Murkowski|Maria Cantwell|Tammy Baldwin|Joe Donnelly|Roger Wicker|Amy Klobuchar|Joel Heitkamp|Joni Ernst|Chris Coons|Mark Warner|John Cornyn|Ron Johnson|Patrick Leahy|Chuck Schumer|John Kennedy|Jeff Merkley|Roy Blunt|Richard Shelby|John Hoeven|Rand Paul|Dan Sullivan|Tim Scott|Ed Markey'

I am using the following code df['url_text'].str.contains(pattern) which results in True in case a name in 'pattern' is present in a row in column 'url_text' and False otherwise. With that I have tried the following code:

df['pol_name_block'] = df.apply(
    lambda row: pol_names_list if df['url_text'].str.contains(pattern) in row['url_text'] else ' ',
    axis=1
)

I get the error:

TypeError: 'in <string>' requires string as left operand, not Series

From this toy Dataframe:

>>> import pandas as pd
>>> from io import StringIO

>>> df = pd.read_csv(StringIO("""
... id,url_text
... 1,Tim Kaine
... 2,Tim Kain
... 3,Tim
... 4,Lindsey Graham.com
... """), sep=',')
>>> df
    id  url_text
0   1   Tim Kaine
1   2   Tim Kain
2   3   Tim
3   4   Lindsey Graham.com

From pol_names_list , we build patterns by formating it like so:

patterns = '(%s)' % '|'.join(pol_names_list)

Then, we can use the extract method to assign the value to the column pol_name_block to get the expected result:

df['pol_name_block'] = df['url_text'].str.extract(patterns)

Output:

    id  url_text            pol_name_block
0   1   Tim Kaine           Tim Kaine
1   2   Tim Kain            NaN
2   3   Tim                 NaN
3   4   Lindsey Graham.com  Lindsey Graham

Change your pattern to enclose it around a capture group () and use extract :

pattern = fr"({'|'.join(pol_names_list)})"
df['pol_name_block'] = df['url_text'].str.extract(pattern)
print(df)

# Output <- with the sample of @tlentali
   id        url_text  pol_name_block
0   1       Tim Kaine       Tim Kaine
1   2        Tim Kain             NaN
2   3             Tim             NaN
3   4  Lindsey Graham  Lindsey Graham

Important : you can extract only one element even there are multiple matches. If you want to extract all elements you have to use findall or extractall (only the output format will change)

# New sample, same pattern
>>> df
   id                      url_text
0   1  Tim Kaine and Lindsey Graham
1   2                      Tim Kain
2   3                           Tim
3   4                Lindsey Graham

# findall
>>> df['url_text'].str.findall(pattern)
0    [Tim Kaine, Lindsey Graham]
1                             []
2                             []
3               [Lindsey Graham]
Name: url_text, dtype: object

# extractall
>>> df['url_text'].str.extractall(pattern)
                      0
  match                
0 0           Tim Kaine
  1      Lindsey Graham
3 0      Lindsey Graham

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM