如果 X 列包含字符串，则在 Y 列中找到 substring 的 position - PYTHON

Question

I'm trying to find the starting position of a string in an URL contained in column['url'] if column ['Existe'] contains "F" or "D".如果列 ['Existe'] 包含“F”或“D”，我正在尝试在列 ['url'] 中包含的 URL 中找到字符串的起始 position。 I'm new to Python and I'm trying to replicate a workflow from Excel in Python and after an hour of trying methods with lambda, numpy.where or numpy.select, and searching the web, I had to ask for help. I'm new to Python and I'm trying to replicate a workflow from Excel in Python and after an hour of trying methods with lambda, numpy.where or numpy.select, and searching the web, I had to ask for help.

I've tried applying the following code, but this only returns that the value exists, but doesn't actually gives me the position in the string.我尝试应用以下代码，但这仅返回该值存在，但实际上并没有给我字符串中的 position。 What I currently have is:我目前拥有的是：

df['Start']= ["/t/" in x[0] and "F" in x[1] for x in zip(df['url'],df['Existe'])]

Basically, the results it gives me is the following:基本上，它给我的结果如下：

       order     id       date      time  URL                    typedCount transition  Existe  Start
0          0  14438   1/3/2021  14:49:37  messenger.com/t/xxxxx          0       link       F   True
1          1  14437   1/3/2021  14:49:18  messenger.com/t/xxxxx          0       link       F   True

What I'm trying to do is to find the starting position of "/t/" in df['url'] if "F" exists in df['Existe'] and placing the result in a new column, df['Start'].如果 df['Existe'] 中存在“F”，则在 df['url'] 中找到“/t/”的起始 position 并将结果放入新列 df['开始']。 I have to use this conditional because df['Existe'] contains both "F" and "D", and it has to look for "/t/" if it's "F", and "/@me/" if it's "D".我必须使用这个条件，因为 df['Existe'] 包含“F”和“D”，如果它是“F”，它必须寻找“/t/”，如果它是“/@me/” D”。

The result I'm looking for is:我正在寻找的结果是：

       order     id       date      time  URL                    typedCount transition  Existe  Start
0          0  14438   1/3/2021  14:49:37  messenger.com/t/xxxxx          0       link       F   14
1          1  14437   1/3/2021  14:49:18  messenger.com/t/xxxxx          0       link       F   14

Does anyone know a way of doing this?有谁知道这样做的方法？

Thanks谢谢

Answer 1

Avoid Looping Over Rows避免循环遍历行

When manipulating data with pandas, it is typically best to avoid looping over rows .使用 pandas 操作数据时，通常最好避免循环遍历行。 Working with logic that only operates on certain rows, it is better to begin by explicitly identifying those rows.使用仅对某些行进行操作的逻辑，最好从显式标识这些行开始。 The subset of rows where the value of column Existe is equal to "F" is: Existe列的值等于"F"的行的子集是：

has_f = df["Existe"] == "F"

Now you can use has_f to select only the rows you care about in df .现在您可以使用has_f到 select 仅您关心的行df 。

When working in pandas, try to use built-in pandas (or numpy) functions as much as possible.在 pandas 工作时，尽量使用内置的 pandas（或 numpy）函数。 While you might not notice the difference when working with small DataFrames, any raw Python code you write and apply with df.apply() will perform poorly compared to the optimized code included in the pandas and numpy packages.虽然在使用小型 DataFrame 时您可能不会注意到差异，但与 pandas 和 Z2EA9510C37F7F89E4942F 包中包含的优化代码相比，您使用df.apply()编写和应用的任何原始 Python 代码的性能都会很差。 Fortunately, pandas has vectorized string functions that can help you here.幸运的是，pandas 具有矢量化字符串函数，可以在这里为您提供帮助。 To find the location of a substring in each row of a column of strings, try the following:要在一列字符串的每一行中查找 substring 的位置，请尝试以下操作：

t_locations = df["URL"].str.find("/t/")

This produces a Series of integer locations of the first occurrence of the substring "/t/" in the column URL .这会在 URL 列中产生 substring "/t/"的第一次出现的Series URL位置。 You can do the same for "/@me/" .你可以对"/@me/"做同样的事情。

Combining these two features of pandas requires using the df.loc indexer to select the rows and columns you care about and only applying the str.find() function to those values:结合 pandas 的这两个功能需要使用df.loc索引器到 select 您关心的行和列，并且仅将str.find() ZC1C425268E68385D1AB5074C17A94F 应用于这些值：

df["Start"] = -1  # some default value
has_f = df["Existe"] == "F"

df.loc[has_f, "Start"] = df.loc[has_f, "URL"].str.find("/t/")
# The "~" here returns the inverse of the Boolean Series
df.loc[~has_f, "Start"] = df.loc[~has_f, "URL"].str.find("/@me/")

Answer 2

You could achieve this by using the apply() routine of DataFrame:您可以通过使用 DataFrame 的 apply() 例程来实现此目的：

def process_row(row):
    if row['Existe']=='F':
        row['Start']=row['URL'].find('/t/') + 1
    elif row['Existe']=='D':
        row['Start']=row['URL'].find('/@me/') + 1
    return row
df.apply(process_row,axis=1)

Answer 3

I am not sure about what you tried to explain with 'F' and 'D' so I assumed this example and the output that'll go with it will be below我不确定你试图用'F'和'D'解释什么，所以我假设这个例子和 output 将 go

order   id  date    time    URL typedCount  transition  Existe
0   14438   1/3/2021    14:49:37    messenger.com/t/xxxxx   0   link    F
1   14437   1/3/2021    14:49:18    messenger.com/t/xxxxx   0   link    F
2   14437   1/3/2021    14:49:18    messenger.com/t/@me/xxxx    0   link    D
3   14437   1/3/2021    14:49:18    messenger.com/t/@me/xxxx    0   link    FD
4   14437   1/3/2021    14:49:18    messenger.com/nothing/xxxx  0   link    FD

Basically you can use df.apply to execute a function on your dataframe:基本上你可以使用df.apply在你的 dataframe 上执行 function：

import pandas as pd

def myfunc(row):                                          
     f_pos = -1   
     d_pos = -1   
     if 'F' in row['Existe']:
         f_pos = row['URL'].find('/t')
     if 'D' in row['Existe']:
         d_pos = row['URL'].find('/@me/')
     return f_pos, d_pos
 

df = pd.read_csv("yourfile", sep=",") # change the field separator
df['Start'] = df.apply(myfunc, axis=1)

And the output is: output 是：

    order     id      date      time                         URL  typedCount transition Existe     Start
0      0  14438  1/3/2021  14:49:37       messenger.com/t/xxxxx           0       link      F  (13, -1)
1      1  14437  1/3/2021  14:49:18       messenger.com/t/xxxxx           0       link      F  (13, -1)
2      2  14437  1/3/2021  14:49:18    messenger.com/t/@me/xxxx           0       link      D  (-1, 15)
3      3  14437  1/3/2021  14:49:18    messenger.com/t/@me/xxxx           0       link     FD  (13, 15)
4      4  14437  1/3/2021  14:49:18  messenger.com/nothing/xxxx           0       link     FD  (-1, -1)

As you can see you can edit the myfunc to fit your needs better as I'm not sure what you meant by both having the F and D .如您所见，您可以编辑myfunc以更好地满足您的需求，因为我不确定同时拥有F和D是什么意思。

It is worth to note that the str.find() method will return -1 if the string can't be found.值得注意的是，如果找不到字符串， str.find()方法将返回-1 。

I've also set the myfunc to return this same -1 in case no F or D are found but you can set anything else.如果没有找到F或D ，我还将myfunc设置为返回相同的-1 ，但您可以设置其他任何内容。 For instance if you set it to -2 or False you can quickly know if it didn't find anything because it wasn't in the URL or because no F or D where in the Existe column.例如，如果您将其设置为-2或False ，您可以快速知道它是否没有找到任何东西，因为它不在 URL 中，或者因为在Existe列中没有F或D

EDIT: In python the indexes starts at 0, meaning the 14th character is at index 13编辑：在 python 中，索引从 0 开始，这意味着第 14 个字符位于索引 13

如果 X 列包含字符串，则在 Y 列中找到 substring 的 position - PYTHON

问题描述

3 个解决方案

解决方案1
1 已采纳 2021-01-06 15:45:15

Avoid Looping Over Rows避免循环遍历行

解决方案2
0 2021-01-06 15:13:44

解决方案3
0 2021-01-06 15:20:43

如果 X 列包含字符串，则在 Y 列中找到 substring 的 position - PYTHON

问题描述

3 个解决方案

解决方案1 1 已采纳 2021-01-06 15:45:15

Avoid Looping Over Rows避免循环遍历行

解决方案2 0 2021-01-06 15:13:44

解决方案3 0 2021-01-06 15:20:43

解决方案1
1 已采纳 2021-01-06 15:45:15

解决方案2
0 2021-01-06 15:13:44

解决方案3
0 2021-01-06 15:20:43