[英]If column X contains String then find position of substring in column Y - PYTHON
I'm trying to find the starting position of a string in an URL contained in column['url'] if column ['Existe'] contains "F" or "D".如果列 ['Existe'] 包含“F”或“D”,我正在尝试在列 ['url'] 中包含的 URL 中找到字符串的起始 position。 I'm new to Python and I'm trying to replicate a workflow from Excel in Python and after an hour of trying methods with lambda, numpy.where or numpy.select, and searching the web, I had to ask for help.
I'm new to Python and I'm trying to replicate a workflow from Excel in Python and after an hour of trying methods with lambda, numpy.where or numpy.select, and searching the web, I had to ask for help.
I've tried applying the following code, but this only returns that the value exists, but doesn't actually gives me the position in the string.我尝试应用以下代码,但这仅返回该值存在,但实际上并没有给我字符串中的 position。 What I currently have is:
我目前拥有的是:
df['Start']= ["/t/" in x[0] and "F" in x[1] for x in zip(df['url'],df['Existe'])]
Basically, the results it gives me is the following:基本上,它给我的结果如下:
order id date time URL typedCount transition Existe Start
0 0 14438 1/3/2021 14:49:37 messenger.com/t/xxxxx 0 link F True
1 1 14437 1/3/2021 14:49:18 messenger.com/t/xxxxx 0 link F True
What I'm trying to do is to find the starting position of "/t/" in df['url'] if "F" exists in df['Existe'] and placing the result in a new column, df['Start'].如果 df['Existe'] 中存在“F”,则在 df['url'] 中找到“/t/”的起始 position 并将结果放入新列 df['开始']。 I have to use this conditional because df['Existe'] contains both "F" and "D", and it has to look for "/t/" if it's "F", and "/@me/" if it's "D".
我必须使用这个条件,因为 df['Existe'] 包含“F”和“D”,如果它是“F”,它必须寻找“/t/”,如果它是“/@me/” D”。
The result I'm looking for is:我正在寻找的结果是:
order id date time URL typedCount transition Existe Start
0 0 14438 1/3/2021 14:49:37 messenger.com/t/xxxxx 0 link F 14
1 1 14437 1/3/2021 14:49:18 messenger.com/t/xxxxx 0 link F 14
Does anyone know a way of doing this?有谁知道这样做的方法?
Thanks谢谢
When manipulating data with pandas, it is typically best to avoid looping over rows .使用 pandas 操作数据时,通常最好避免循环遍历行。 Working with logic that only operates on certain rows, it is better to begin by explicitly identifying those rows.
使用仅对某些行进行操作的逻辑,最好从显式标识这些行开始。 The subset of rows where the value of column
Existe
is equal to "F"
is: Existe
列的值等于"F"
的行的子集是:
has_f = df["Existe"] == "F"
Now you can use has_f
to select only the rows you care about in df
.现在您可以使用
has_f
到 select 仅您关心的行df
。
When working in pandas, try to use built-in pandas (or numpy) functions as much as possible.在 pandas 工作时,尽量使用内置的 pandas(或 numpy)函数。 While you might not notice the difference when working with small DataFrames, any raw Python code you write and apply with
df.apply()
will perform poorly compared to the optimized code included in the pandas and numpy packages.虽然在使用小型 DataFrame 时您可能不会注意到差异,但与 pandas 和 Z2EA9510C37F7F89E4942F 包中包含的优化代码相比,您使用
df.apply()
编写和应用的任何原始 Python 代码的性能都会很差。 Fortunately, pandas has vectorized string functions that can help you here.幸运的是,pandas 具有矢量化字符串函数,可以在这里为您提供帮助。 To find the location of a substring in each row of a column of strings, try the following:
要在一列字符串的每一行中查找 substring 的位置,请尝试以下操作:
t_locations = df["URL"].str.find("/t/")
This produces a Series
of integer locations of the first occurrence of the substring "/t/"
in the column URL
.这会在 URL 列中产生 substring
"/t/"
的第一次出现的Series
URL
位置。 You can do the same for "/@me/"
.你可以对
"/@me/"
做同样的事情。
Combining these two features of pandas requires using the df.loc
indexer to select the rows and columns you care about and only applying the str.find()
function to those values:结合 pandas 的这两个功能需要使用
df.loc
索引器到 select 您关心的行和列,并且仅将str.find()
ZC1C425268E68385D1AB5074C17A94F 应用于这些值:
df["Start"] = -1 # some default value
has_f = df["Existe"] == "F"
df.loc[has_f, "Start"] = df.loc[has_f, "URL"].str.find("/t/")
# The "~" here returns the inverse of the Boolean Series
df.loc[~has_f, "Start"] = df.loc[~has_f, "URL"].str.find("/@me/")
You could achieve this by using the apply() routine of DataFrame:您可以通过使用 DataFrame 的 apply() 例程来实现此目的:
def process_row(row):
if row['Existe']=='F':
row['Start']=row['URL'].find('/t/') + 1
elif row['Existe']=='D':
row['Start']=row['URL'].find('/@me/') + 1
return row
df.apply(process_row,axis=1)
I am not sure about what you tried to explain with 'F'
and 'D'
so I assumed this example and the output that'll go with it will be below我不确定你试图用
'F'
和'D'
解释什么,所以我假设这个例子和 output 将 go
order id date time URL typedCount transition Existe
0 14438 1/3/2021 14:49:37 messenger.com/t/xxxxx 0 link F
1 14437 1/3/2021 14:49:18 messenger.com/t/xxxxx 0 link F
2 14437 1/3/2021 14:49:18 messenger.com/t/@me/xxxx 0 link D
3 14437 1/3/2021 14:49:18 messenger.com/t/@me/xxxx 0 link FD
4 14437 1/3/2021 14:49:18 messenger.com/nothing/xxxx 0 link FD
Basically you can use df.apply
to execute a function on your dataframe:基本上你可以使用
df.apply
在你的 dataframe 上执行 function:
import pandas as pd
def myfunc(row):
f_pos = -1
d_pos = -1
if 'F' in row['Existe']:
f_pos = row['URL'].find('/t')
if 'D' in row['Existe']:
d_pos = row['URL'].find('/@me/')
return f_pos, d_pos
df = pd.read_csv("yourfile", sep=",") # change the field separator
df['Start'] = df.apply(myfunc, axis=1)
And the output is: output 是:
order id date time URL typedCount transition Existe Start
0 0 14438 1/3/2021 14:49:37 messenger.com/t/xxxxx 0 link F (13, -1)
1 1 14437 1/3/2021 14:49:18 messenger.com/t/xxxxx 0 link F (13, -1)
2 2 14437 1/3/2021 14:49:18 messenger.com/t/@me/xxxx 0 link D (-1, 15)
3 3 14437 1/3/2021 14:49:18 messenger.com/t/@me/xxxx 0 link FD (13, 15)
4 4 14437 1/3/2021 14:49:18 messenger.com/nothing/xxxx 0 link FD (-1, -1)
As you can see you can edit the myfunc
to fit your needs better as I'm not sure what you meant by both having the F
and D
.如您所见,您可以编辑
myfunc
以更好地满足您的需求,因为我不确定同时拥有F
和D
是什么意思。
It is worth to note that the str.find()
method will return -1
if the string can't be found.值得注意的是,如果找不到字符串,
str.find()
方法将返回-1
。
I've also set the myfunc
to return this same -1
in case no F
or D
are found but you can set anything else.如果没有找到
F
或D
,我还将myfunc
设置为返回相同的-1
,但您可以设置其他任何内容。 For instance if you set it to -2
or False
you can quickly know if it didn't find anything because it wasn't in the URL or because no F
or D
where in the Existe
column.例如,如果您将其设置为
-2
或False
,您可以快速知道它是否没有找到任何东西,因为它不在 URL 中,或者因为在Existe
列中没有F
或D
EDIT: In python the indexes starts at 0, meaning the 14th character is at index 13编辑:在 python 中,索引从 0 开始,这意味着第 14 个字符位于索引 13
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.