如果 X 列包含字符串，則在 Y 列中找到 substring 的 position - PYTHON

Question

如果列 ['Existe'] 包含“F”或“D”，我正在嘗試在列 ['url'] 中包含的 URL 中找到字符串的起始 position。 I'm new to Python and I'm trying to replicate a workflow from Excel in Python and after an hour of trying methods with lambda, numpy.where or numpy.select, and searching the web, I had to ask for help.

我嘗試應用以下代碼，但這僅返回該值存在，但實際上並沒有給我字符串中的 position。 我目前擁有的是：

df['Start']= ["/t/" in x[0] and "F" in x[1] for x in zip(df['url'],df['Existe'])]

基本上，它給我的結果如下：

       order     id       date      time  URL                    typedCount transition  Existe  Start
0          0  14438   1/3/2021  14:49:37  messenger.com/t/xxxxx          0       link       F   True
1          1  14437   1/3/2021  14:49:18  messenger.com/t/xxxxx          0       link       F   True

如果 df['Existe'] 中存在“F”，則在 df['url'] 中找到“/t/”的起始 position 並將結果放入新列 df['開始']。 我必須使用這個條件，因為 df['Existe'] 包含“F”和“D”，如果它是“F”，它必須尋找“/t/”，如果它是“/@me/” D”。

我正在尋找的結果是：

       order     id       date      time  URL                    typedCount transition  Existe  Start
0          0  14438   1/3/2021  14:49:37  messenger.com/t/xxxxx          0       link       F   14
1          1  14437   1/3/2021  14:49:18  messenger.com/t/xxxxx          0       link       F   14

有誰知道這樣做的方法？

謝謝

Answer 1

避免循環遍歷行

使用 pandas 操作數據時，通常最好避免循環遍歷行。 使用僅對某些行進行操作的邏輯，最好從顯式標識這些行開始。 Existe列的值等於"F"的行的子集是：

has_f = df["Existe"] == "F"

現在您可以使用has_f到 select 僅您關心的行df 。

在 pandas 工作時，盡量使用內置的 pandas（或 numpy）函數。 雖然在使用小型 DataFrame 時您可能不會注意到差異，但與 pandas 和 Z2EA9510C37F7F89E4942F 包中包含的優化代碼相比，您使用df.apply()編寫和應用的任何原始 Python 代碼的性能都會很差。 幸運的是，pandas 具有矢量化字符串函數，可以在這里為您提供幫助。 要在一列字符串的每一行中查找 substring 的位置，請嘗試以下操作：

t_locations = df["URL"].str.find("/t/")

這會在 URL 列中產生 substring "/t/"的第一次出現的Series URL位置。 你可以對"/@me/"做同樣的事情。

結合 pandas 的這兩個功能需要使用df.loc索引器到 select 您關心的行和列，並且僅將str.find() ZC1C425268E68385D1AB5074C17A94F 應用於這些值：

df["Start"] = -1  # some default value
has_f = df["Existe"] == "F"

df.loc[has_f, "Start"] = df.loc[has_f, "URL"].str.find("/t/")
# The "~" here returns the inverse of the Boolean Series
df.loc[~has_f, "Start"] = df.loc[~has_f, "URL"].str.find("/@me/")

Answer 2

您可以通過使用 DataFrame 的 apply() 例程來實現此目的：

def process_row(row):
    if row['Existe']=='F':
        row['Start']=row['URL'].find('/t/') + 1
    elif row['Existe']=='D':
        row['Start']=row['URL'].find('/@me/') + 1
    return row
df.apply(process_row,axis=1)

Answer 3

我不確定你試圖用'F'和'D'解釋什么，所以我假設這個例子和 output 將 go

order   id  date    time    URL typedCount  transition  Existe
0   14438   1/3/2021    14:49:37    messenger.com/t/xxxxx   0   link    F
1   14437   1/3/2021    14:49:18    messenger.com/t/xxxxx   0   link    F
2   14437   1/3/2021    14:49:18    messenger.com/t/@me/xxxx    0   link    D
3   14437   1/3/2021    14:49:18    messenger.com/t/@me/xxxx    0   link    FD
4   14437   1/3/2021    14:49:18    messenger.com/nothing/xxxx  0   link    FD

基本上你可以使用df.apply在你的 dataframe 上執行 function：

import pandas as pd

def myfunc(row):                                          
     f_pos = -1   
     d_pos = -1   
     if 'F' in row['Existe']:
         f_pos = row['URL'].find('/t')
     if 'D' in row['Existe']:
         d_pos = row['URL'].find('/@me/')
     return f_pos, d_pos
 

df = pd.read_csv("yourfile", sep=",") # change the field separator
df['Start'] = df.apply(myfunc, axis=1)

output 是：

    order     id      date      time                         URL  typedCount transition Existe     Start
0      0  14438  1/3/2021  14:49:37       messenger.com/t/xxxxx           0       link      F  (13, -1)
1      1  14437  1/3/2021  14:49:18       messenger.com/t/xxxxx           0       link      F  (13, -1)
2      2  14437  1/3/2021  14:49:18    messenger.com/t/@me/xxxx           0       link      D  (-1, 15)
3      3  14437  1/3/2021  14:49:18    messenger.com/t/@me/xxxx           0       link     FD  (13, 15)
4      4  14437  1/3/2021  14:49:18  messenger.com/nothing/xxxx           0       link     FD  (-1, -1)

如您所見，您可以編輯myfunc以更好地滿足您的需求，因為我不確定同時擁有F和D是什么意思。

值得注意的是，如果找不到字符串， str.find()方法將返回-1 。

如果沒有找到F或D ，我還將myfunc設置為返回相同的-1 ，但您可以設置其他任何內容。 例如，如果您將其設置為-2或False ，您可以快速知道它是否沒有找到任何東西，因為它不在 URL 中，或者因為在Existe列中沒有F或D

編輯：在 python 中，索引從 0 開始，這意味着第 14 個字符位於索引 13

如果 X 列包含字符串，則在 Y 列中找到 substring 的 position - PYTHON

問題描述

3 個解決方案

解決方案1
1 已采納 2021-01-06 15:45:15

避免循環遍歷行

解決方案2
0 2021-01-06 15:13:44

解決方案3
0 2021-01-06 15:20:43

如果 X 列包含字符串，則在 Y 列中找到 substring 的 position - PYTHON

問題描述

3 個解決方案

解決方案1 1 已采納 2021-01-06 15:45:15

避免循環遍歷行

解決方案2 0 2021-01-06 15:13:44

解決方案3 0 2021-01-06 15:20:43

解決方案1
1 已采納 2021-01-06 15:45:15

解決方案2
0 2021-01-06 15:13:44

解決方案3
0 2021-01-06 15:20:43