简体   繁体   English

尝试在字符串中查找所有出现的 substring,并在之后保留 n 个字符 Python Pandas Dataframe

[英]Trying to find all occurrences of a substring within a string, and also keep n characters afterwards in Python Pandas Dataframe

For a dataframe, I am trying to extract all occurrences of "cash" and then n characters after them (which contains the cash amount).对于 dataframe,我试图提取所有出现的“现金”,然后提取它们之后的 n 个字符(其中包含现金金额)。 I have tried JSON, Regex, but they do not work as this dataframe is quite inconsistent.我试过 JSON,正则表达式,但它们不起作用,因为这个 dataframe 非常不一致。

So for example,例如,

sample = pd.DataFrame({'LongString': ["I am trying to find out how much cash 15906810 
and this needs to be consistent cash :  69105060", 
"other words that are wrong cash : 11234 and more words cash 1526
"]})

And then my dataframe will look like然后我的 dataframe 看起来像

sample_resolved = pd.DataFrame({'LongString': ["I am trying to find out how much cash 15906810 
and this needs to be consistent cash :  69105060", 
"other words that are wrong cash : 11234 and more words cash 1526
"], 'cash_string' = ["cash  15906810 cash : 69105060", "cash : 11234 cash 1526]})

Each row of the dataframe is inconsistent. dataframe的每一行都不一致。 The ultimate goal is to create a new column that has all instances of "cash" followed by let's say 8-10 characters after it.最终目标是创建一个新列,其中包含“现金”的所有实例,后跟 8-10 个字符。

The ultimate goal would be to have a line that goes最终目标是有一条线

df['cash_string'] = df['LongString'].str.findall('cash') 

(but also includes the n characters after each 'cash' instance) (但也包括每个“现金”实例后的 n 个字符)

Thank you!谢谢!

To add on to @JCThomas 's answer, I'd change the str_after_substr function like below要添加到@JCThomas 的回答,我会像下面这样更改 str_after_substr function

def cash_finder(s, substr='cash', offset=10):
    ss = s.split(substr)
    cashlist = []
    for i in ss[1:]:
        cashlist.append(int(''.join([x for x in list(i[:offset].strip()) if re.match('\d',x) ])))
    return cashlist

This will give you all instances of cash in one sentence,这将在一句话中为您提供所有现金实例,

and, df operation will go like below.并且,df 操作将 go 如下所示。

ddf['cashstring'] = ddf['LongString'].apply(lambda x: [{'cash':i} for i in cash_finder(x)])

In general, if there isn't a dataframe method (or combination thereof) that does what you're after, you can write a function that works on a single example and then pass it to the dataframe with series.apply(some_func) .通常,如果没有 dataframe 方法(或其组合)可以满足您的要求,您可以编写一个适用于单个示例的 function,然后使用series.apply(some_func)将其传递给 dataframe。

So, a function that does what you're looking for:因此,一个 function 可以满足您的需求:

def str_after_substr(s, substr='cash', offset=5):
    i = s.index(substr)
    start = i+len(substr)
    return s[start:start+offset]
# test
str_after_substr('moneymoneycashmoneyhoney')

# create the new column values and add it to the df
df['new_column] = df['old_column'].apply(str_after_substr)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

Example例子

make minimal and reproducible example制作最小且可重现的示例

df = pd.DataFrame(["abc cash : 1590 cde cash : 6910", "fgh cash : 1890 hij cash : 3410 cash : 4510"], columns=['col1'])

df

    col1
0   abc cash : 1590 cde cash : 6910
1   fgh cash : 1890 hij cash : 3410 cash : 4510



Code代码

s = df['col1'].str.extractall(r'(cash : \d+)')[0]

s

  match
0  0        cash : 1590
   1        cash : 6910
1  0        cash : 1890
   1        cash : 3410
   2        cash : 4510
Name: 0, dtype: object

s.groupby(level=0).agg(', '.join)

0                 cash : 1590, cash : 6910
1    cash : 1890, cash : 3410, cash : 4510
Name: 0, dtype: object



Output Output

df.assign(col2=s.groupby(level=0).agg(', '.join))

    col1                                            col2
0   abc cash : 1590 cde cash : 6910                 cash : 1590, cash : 6910
1   fgh cash : 1890 hij cash : 3410 cash : 4510     cash : 1890, cash : 3410, cash : 4510

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 python - 在字符串中查找所有出现的带有通配符的子字符串 - python - find all occurrences of substring with wildcards in string 替换熊猫数据框中所有出现的字符串(Python) - Replace all occurrences of a string in a pandas dataframe (Python) 有效地返回 Pandas Python ZBA834BA059A9A379459C12175EB88E4Z 中 substring 的所有出现(大表) - Return efficiently all occurrences for substring in Pandas Python DataFrame (large tables) 查找字符串中所有出现的分割子字符串 - Find all occurrences of a divided substring in a string Python-查找子字符串,然后替换其中的所有字符 - Python- find substring and then replace all characters within it 在python中查找字符串中子字符串的出现次数 - to find number of occurrences of a substring in a string in python 在Python中查找字符串中所有事件的开始和结束位置 - Find start and end positions of all occurrences within a string in Python Python-有效地找到pandas DataFrame中所有字符的集合? - Python - Efficiently find the set of all characters in a pandas DataFrame? Python:Pandas Dataframe使用通配符在列中查找字符串并保留行 - Python: Pandas Dataframe Using Wildcard to Find String in Column and Keep Row 在 dataframe 的字符串中查找 substring 的索引 - Find index of substring within string from a dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM