简体   繁体   中英

Filter by condition if string contains certain substring

I have a dataframe full of emails. Knowing that gmail has a 6 character minimum I want to filter my dataframe by getting rid of any gmail address that has a username of less than six characters. Therefore, the dataframe df

>> print(df)

        email          
1   a@gmail.com             
2   real.email@gmail.com      
3   no.email@email.com        
4   real@yahoo.com              
5   poo@gmail.com              

would become:

        email                     
2   real.email@gmail.com      
3   no.email@email.com        
4   real@yahoo.com              

Using

df = df[
        (len(df['email'].str.split('@').str[0]) >= 6)
        (df['email'].str.split('@').str[1] == 'gmail.com')
       ]

will filter everything that isn't @gmail.com, so I can't use that. What I want is essentially (which obviously doesn't work and gives a TypeError: 'method' object is not subscriptable )

if df['email'].str.split['@'].str[1] == 'gmail.com':
    len(df['email'].str.split['@'].str[0]) >= 6

How do I accomplish this in a vectorized operation?

You can use:

a = df['email'].str.contains('gmail') #check if email has gmail
b = df['email'].str.split('@').str[0].str.len().gt(6) #check if length before "@" > 6
out = df[a&b|~a]

print(out)

                  email
2  real.email@gmail.com
3    no.email@email.com
4        real@yahoo.com

See this:

>>> df[(df["email"].str.split("@").str[0].str.len() >= 6) | (df["email"].str.split("@").str[1] != 'gmail.com')]
                  email
1  real.email@gmail.com
2    no.email@email.com
3        real@yahoo.com

Regarding you saying "will filter everything that isn't @gmail.com", it is not correct. You just need to make your boolean logic right (like above). Also to measure the string length in dataframe, you should use .str.len() but not taking len of the whole dataframe output, which the latter will be the size of the dataframe.

You can do:

df=df.loc[~df.email.str.contains(r"^.{0,5}@gmail\.com$")]

Outputs:

                  email
1  real.email@gmail.com
2    no.email@email.com
3        real@yahoo.com

One way is to store the index in a list and then display just those indices:

ls=[]
for i in range(0,len(df)):
    if df['email'][i].split('@')[1] == 'gmail.com':
        if len(df['email'][i].split('@')[0]) >= 6:
            ls.append(i)

df[df.index.isin(ls)]

Output:

                  email
1  real.email@gmail.com
2    no.email@email.com
3        real@yahoo.com

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM