I'm trying to find the starting position of a string in an URL contained in column['url'] if column ['Existe'] contains "F" or "D". I'm new to Python and I'm trying to replicate a workflow from Excel in Python and after an hour of trying methods with lambda, numpy.where or numpy.select, and searching the web, I had to ask for help.
I've tried applying the following code, but this only returns that the value exists, but doesn't actually gives me the position in the string. What I currently have is:
df['Start']= ["/t/" in x[0] and "F" in x[1] for x in zip(df['url'],df['Existe'])]
Basically, the results it gives me is the following:
order id date time URL typedCount transition Existe Start
0 0 14438 1/3/2021 14:49:37 messenger.com/t/xxxxx 0 link F True
1 1 14437 1/3/2021 14:49:18 messenger.com/t/xxxxx 0 link F True
What I'm trying to do is to find the starting position of "/t/" in df['url'] if "F" exists in df['Existe'] and placing the result in a new column, df['Start']. I have to use this conditional because df['Existe'] contains both "F" and "D", and it has to look for "/t/" if it's "F", and "/@me/" if it's "D".
The result I'm looking for is:
order id date time URL typedCount transition Existe Start
0 0 14438 1/3/2021 14:49:37 messenger.com/t/xxxxx 0 link F 14
1 1 14437 1/3/2021 14:49:18 messenger.com/t/xxxxx 0 link F 14
Does anyone know a way of doing this?
Thanks
When manipulating data with pandas, it is typically best to avoid looping over rows . Working with logic that only operates on certain rows, it is better to begin by explicitly identifying those rows. The subset of rows where the value of column Existe
is equal to "F"
is:
has_f = df["Existe"] == "F"
Now you can use has_f
to select only the rows you care about in df
.
When working in pandas, try to use built-in pandas (or numpy) functions as much as possible. While you might not notice the difference when working with small DataFrames, any raw Python code you write and apply with df.apply()
will perform poorly compared to the optimized code included in the pandas and numpy packages. Fortunately, pandas has vectorized string functions that can help you here. To find the location of a substring in each row of a column of strings, try the following:
t_locations = df["URL"].str.find("/t/")
This produces a Series
of integer locations of the first occurrence of the substring "/t/"
in the column URL
. You can do the same for "/@me/"
.
Combining these two features of pandas requires using the df.loc
indexer to select the rows and columns you care about and only applying the str.find()
function to those values:
df["Start"] = -1 # some default value
has_f = df["Existe"] == "F"
df.loc[has_f, "Start"] = df.loc[has_f, "URL"].str.find("/t/")
# The "~" here returns the inverse of the Boolean Series
df.loc[~has_f, "Start"] = df.loc[~has_f, "URL"].str.find("/@me/")
You could achieve this by using the apply() routine of DataFrame:
def process_row(row):
if row['Existe']=='F':
row['Start']=row['URL'].find('/t/') + 1
elif row['Existe']=='D':
row['Start']=row['URL'].find('/@me/') + 1
return row
df.apply(process_row,axis=1)
I am not sure about what you tried to explain with 'F'
and 'D'
so I assumed this example and the output that'll go with it will be below
order id date time URL typedCount transition Existe
0 14438 1/3/2021 14:49:37 messenger.com/t/xxxxx 0 link F
1 14437 1/3/2021 14:49:18 messenger.com/t/xxxxx 0 link F
2 14437 1/3/2021 14:49:18 messenger.com/t/@me/xxxx 0 link D
3 14437 1/3/2021 14:49:18 messenger.com/t/@me/xxxx 0 link FD
4 14437 1/3/2021 14:49:18 messenger.com/nothing/xxxx 0 link FD
Basically you can use df.apply
to execute a function on your dataframe:
import pandas as pd
def myfunc(row):
f_pos = -1
d_pos = -1
if 'F' in row['Existe']:
f_pos = row['URL'].find('/t')
if 'D' in row['Existe']:
d_pos = row['URL'].find('/@me/')
return f_pos, d_pos
df = pd.read_csv("yourfile", sep=",") # change the field separator
df['Start'] = df.apply(myfunc, axis=1)
And the output is:
order id date time URL typedCount transition Existe Start
0 0 14438 1/3/2021 14:49:37 messenger.com/t/xxxxx 0 link F (13, -1)
1 1 14437 1/3/2021 14:49:18 messenger.com/t/xxxxx 0 link F (13, -1)
2 2 14437 1/3/2021 14:49:18 messenger.com/t/@me/xxxx 0 link D (-1, 15)
3 3 14437 1/3/2021 14:49:18 messenger.com/t/@me/xxxx 0 link FD (13, 15)
4 4 14437 1/3/2021 14:49:18 messenger.com/nothing/xxxx 0 link FD (-1, -1)
As you can see you can edit the myfunc
to fit your needs better as I'm not sure what you meant by both having the F
and D
.
It is worth to note that the str.find()
method will return -1
if the string can't be found.
I've also set the myfunc
to return this same -1
in case no F
or D
are found but you can set anything else. For instance if you set it to -2
or False
you can quickly know if it didn't find anything because it wasn't in the URL or because no F
or D
where in the Existe
column.
EDIT: In python the indexes starts at 0, meaning the 14th character is at index 13
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.