嘗試在字符串中查找所有出現的 substring，並在之后保留 n 個字符 Python Pandas Dataframe

Question

對於 dataframe，我試圖提取所有出現的“現金”，然后提取它們之后的 n 個字符（其中包含現金金額）。 我試過 JSON，正則表達式，但它們不起作用，因為這個 dataframe 非常不一致。

例如，

sample = pd.DataFrame({'LongString': ["I am trying to find out how much cash 15906810 
and this needs to be consistent cash :  69105060", 
"other words that are wrong cash : 11234 and more words cash 1526
"]})

然后我的 dataframe 看起來像

sample_resolved = pd.DataFrame({'LongString': ["I am trying to find out how much cash 15906810 
and this needs to be consistent cash :  69105060", 
"other words that are wrong cash : 11234 and more words cash 1526
"], 'cash_string' = ["cash  15906810 cash : 69105060", "cash : 11234 cash 1526]})

dataframe的每一行都不一致。 最終目標是創建一個新列，其中包含“現金”的所有實例，后跟 8-10 個字符。

最終目標是有一條線

df['cash_string'] = df['LongString'].str.findall('cash')

（但也包括每個“現金”實例后的 n 個字符）

謝謝！

Answer 1

要添加到@JCThomas 的回答，我會像下面這樣更改 str_after_substr function

def cash_finder(s, substr='cash', offset=10):
    ss = s.split(substr)
    cashlist = []
    for i in ss[1:]:
        cashlist.append(int(''.join([x for x in list(i[:offset].strip()) if re.match('\d',x) ])))
    return cashlist

這將在一句話中為您提供所有現金實例，

並且，df 操作將 go 如下所示。

ddf['cashstring'] = ddf['LongString'].apply(lambda x: [{'cash':i} for i in cash_finder(x)])

Answer 2

通常，如果沒有 dataframe 方法（或其組合）可以滿足您的要求，您可以編寫一個適用於單個示例的 function，然后使用series.apply(some_func)將其傳遞給 dataframe。

因此，一個 function 可以滿足您的需求：

def str_after_substr(s, substr='cash', offset=5):
    i = s.index(substr)
    start = i+len(substr)
    return s[start:start+offset]
# test
str_after_substr('moneymoneycashmoneyhoney')

# create the new column values and add it to the df
df['new_column] = df['old_column'].apply(str_after_substr)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

Answer 3

例子

制作最小且可重現的示例

df = pd.DataFrame(["abc cash : 1590 cde cash : 6910", "fgh cash : 1890 hij cash : 3410 cash : 4510"], columns=['col1'])

df

    col1
0   abc cash : 1590 cde cash : 6910
1   fgh cash : 1890 hij cash : 3410 cash : 4510

代碼

s = df['col1'].str.extractall(r'(cash : \d+)')[0]

s

  match
0  0        cash : 1590
   1        cash : 6910
1  0        cash : 1890
   1        cash : 3410
   2        cash : 4510
Name: 0, dtype: object

s.groupby(level=0).agg(', '.join)

0                 cash : 1590, cash : 6910
1    cash : 1890, cash : 3410, cash : 4510
Name: 0, dtype: object

Output

df.assign(col2=s.groupby(level=0).agg(', '.join))

    col1                                            col2
0   abc cash : 1590 cde cash : 6910                 cash : 1590, cash : 6910
1   fgh cash : 1890 hij cash : 3410 cash : 4510     cash : 1890, cash : 3410, cash : 4510

嘗試在字符串中查找所有出現的 substring，並在之后保留 n 個字符 Python Pandas Dataframe

問題描述

3 個解決方案

解決方案1
1 2022-12-09 17:11:21

解決方案2
0 已采納 2022-12-09 15:57:00

解決方案3
0 2022-12-09 16:02:12

嘗試在字符串中查找所有出現的 substring，並在之后保留 n 個字符 Python Pandas Dataframe

問題描述

3 個解決方案

解決方案1 1 2022-12-09 17:11:21

解決方案2 0 已采納 2022-12-09 15:57:00

解決方案3 0 2022-12-09 16:02:12

解決方案1
1 2022-12-09 17:11:21

解決方案2
0 已采納 2022-12-09 15:57:00

解決方案3
0 2022-12-09 16:02:12