简体   繁体   中英

If column X contains String then find position of substring in column Y - PYTHON

I'm trying to find the starting position of a string in an URL contained in column['url'] if column ['Existe'] contains "F" or "D". I'm new to Python and I'm trying to replicate a workflow from Excel in Python and after an hour of trying methods with lambda, numpy.where or numpy.select, and searching the web, I had to ask for help.

I've tried applying the following code, but this only returns that the value exists, but doesn't actually gives me the position in the string. What I currently have is:

df['Start']= ["/t/" in x[0] and "F" in x[1] for x in zip(df['url'],df['Existe'])]

Basically, the results it gives me is the following:

       order     id       date      time  URL                    typedCount transition  Existe  Start
0          0  14438   1/3/2021  14:49:37  messenger.com/t/xxxxx          0       link       F   True
1          1  14437   1/3/2021  14:49:18  messenger.com/t/xxxxx          0       link       F   True

What I'm trying to do is to find the starting position of "/t/" in df['url'] if "F" exists in df['Existe'] and placing the result in a new column, df['Start']. I have to use this conditional because df['Existe'] contains both "F" and "D", and it has to look for "/t/" if it's "F", and "/@me/" if it's "D".

The result I'm looking for is:

       order     id       date      time  URL                    typedCount transition  Existe  Start
0          0  14438   1/3/2021  14:49:37  messenger.com/t/xxxxx          0       link       F   14
1          1  14437   1/3/2021  14:49:18  messenger.com/t/xxxxx          0       link       F   14

Does anyone know a way of doing this?

Thanks

Avoid Looping Over Rows

When manipulating data with pandas, it is typically best to avoid looping over rows . Working with logic that only operates on certain rows, it is better to begin by explicitly identifying those rows. The subset of rows where the value of column Existe is equal to "F" is:

has_f = df["Existe"] == "F"

Now you can use has_f to select only the rows you care about in df .

When working in pandas, try to use built-in pandas (or numpy) functions as much as possible. While you might not notice the difference when working with small DataFrames, any raw Python code you write and apply with df.apply() will perform poorly compared to the optimized code included in the pandas and numpy packages. Fortunately, pandas has vectorized string functions that can help you here. To find the location of a substring in each row of a column of strings, try the following:

t_locations = df["URL"].str.find("/t/")

This produces a Series of integer locations of the first occurrence of the substring "/t/" in the column URL . You can do the same for "/@me/" .

Combining these two features of pandas requires using the df.loc indexer to select the rows and columns you care about and only applying the str.find() function to those values:

df["Start"] = -1  # some default value
has_f = df["Existe"] == "F"

df.loc[has_f, "Start"] = df.loc[has_f, "URL"].str.find("/t/")
# The "~" here returns the inverse of the Boolean Series
df.loc[~has_f, "Start"] = df.loc[~has_f, "URL"].str.find("/@me/")

You could achieve this by using the apply() routine of DataFrame:

def process_row(row):
    if row['Existe']=='F':
        row['Start']=row['URL'].find('/t/') + 1
    elif row['Existe']=='D':
        row['Start']=row['URL'].find('/@me/') + 1
    return row
df.apply(process_row,axis=1)

I am not sure about what you tried to explain with 'F' and 'D' so I assumed this example and the output that'll go with it will be below

order   id  date    time    URL typedCount  transition  Existe
0   14438   1/3/2021    14:49:37    messenger.com/t/xxxxx   0   link    F
1   14437   1/3/2021    14:49:18    messenger.com/t/xxxxx   0   link    F
2   14437   1/3/2021    14:49:18    messenger.com/t/@me/xxxx    0   link    D
3   14437   1/3/2021    14:49:18    messenger.com/t/@me/xxxx    0   link    FD
4   14437   1/3/2021    14:49:18    messenger.com/nothing/xxxx  0   link    FD

Basically you can use df.apply to execute a function on your dataframe:

import pandas as pd

def myfunc(row):                                          
     f_pos = -1   
     d_pos = -1   
     if 'F' in row['Existe']:
         f_pos = row['URL'].find('/t')
     if 'D' in row['Existe']:
         d_pos = row['URL'].find('/@me/')
     return f_pos, d_pos
 

df = pd.read_csv("yourfile", sep=",") # change the field separator
df['Start'] = df.apply(myfunc, axis=1)

And the output is:

    order     id      date      time                         URL  typedCount transition Existe     Start
0      0  14438  1/3/2021  14:49:37       messenger.com/t/xxxxx           0       link      F  (13, -1)
1      1  14437  1/3/2021  14:49:18       messenger.com/t/xxxxx           0       link      F  (13, -1)
2      2  14437  1/3/2021  14:49:18    messenger.com/t/@me/xxxx           0       link      D  (-1, 15)
3      3  14437  1/3/2021  14:49:18    messenger.com/t/@me/xxxx           0       link     FD  (13, 15)
4      4  14437  1/3/2021  14:49:18  messenger.com/nothing/xxxx           0       link     FD  (-1, -1)

As you can see you can edit the myfunc to fit your needs better as I'm not sure what you meant by both having the F and D .

It is worth to note that the str.find() method will return -1 if the string can't be found.

I've also set the myfunc to return this same -1 in case no F or D are found but you can set anything else. For instance if you set it to -2 or False you can quickly know if it didn't find anything because it wasn't in the URL or because no F or D where in the Existe column.

EDIT: In python the indexes starts at 0, meaning the 14th character is at index 13

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM