简体   繁体   中英

Constructing a dataframe with multiple columns based on str conditions using a loop - python

I have a webscraped Twitter DataFrame that includes user location. The location variable looks like this:

2          Crockett, Houston County, Texas, 75835, USA
3                                   NYC, New York, USA
4                            Warszawa, mazowieckie, RP
5                                           Texas, USA
6                 Virginia Beach, Virginia, 23451, USA
7          Louisville, Jefferson County, Kentucky, USA

I would like to construct state dummies for all USA states by using a loop.

I have managed to extract users from the USA using

location_usa = location_df['location'].str.contains('usa', case = False)

However the code would be too bulky I wrote this for every single state. I have a list of the states as strings. Also I am unable to use

pd.Series.Str.get_dummies()

as there are different locations within the same state and each entry is a whole sentence.

I would like the output to look something like this:

   Alabama   Alaska  Arizona
1        0        0        1
2        0        1        0
3        1        0        0 
4        0        0        0

Or the same with Boolean values.

Use .str.extract to get a Series of the states, and then use pd.get_dummies on that Series . Will need to define a list of all 50 states:

import pandas as pd

states = ['Texas', 'New York', 'Kentucky', 'Virginia']
pd.get_dummies(df.col1.str.extract('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(','))

   Kentucky  New York  Texas  Virginia
0         0         0      1         0
1         0         1      0         0
2         0         0      0         0
3         0         0      1         0
4         0         0      0         1
5         1         0      0         0

Note I matched on States followed by a ',' as that seems to be the pattern and allows you to avoid false matches like 'Virginia' with 'Virginia Beach' , or more problematic things like 'Washington County, Minnesota'

If you expect mutliple states to match on a single line, then this becomes .extractall summing across the 0th level:

pd.get_dummies(df.col1.str.extractall('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(',')).sum(level=0).clip(upper=1)

Edit:

Perhaps there are better ways, but this can be a bit safer as suggested by @BradSolomon allowing matches on 'State,( optional 5 digit Zip,) USA'

states = ['Texas', 'New York', 'Kentucky', 'Virginia', 'California', 'Pennsylvania']
pat = '(' + '|'.join(x+',?(\s\d{5},)?\sUSA' for x in states)+ ')'

s = df.col1.str.extract(pat)[0].str.split(',').str[0]

Output: s

0           Texas
1        New York
2             NaN
3           Texas
4        Virginia
5        Kentucky
6    Pennsylvania
Name: 0, dtype: object

from Input

                                          col1
0  Crockett, Houston County, Texas, 75835, USA
1                           NYC, New York, USA
2                    Warszawa, mazowieckie, RP
3                                   Texas, USA
4         Virginia Beach, Virginia, 23451, USA
5  Louisville, Jefferson County, Kentucky, USA
6                California, Pennsylvania, USA

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM