简体   繁体   English

从整个 Dataframe 中提取单元格中的字符串

[英]Extract string in cell from entire Dataframe

Working on a pdf extraction tool.使用 pdf 提取工具。 Say I have the following Dataframe. I don't know the column names, or how many columns there are.假设我有以下 Dataframe。我不知道列名,也不知道有多少列。 All I know is in this dataframe, I can find the string extract this: xxxx .我所知道的是在这个 dataframe 中,我可以找到extract this: xxxx I need to extract that string.我需要提取该字符串。

data = {'these':['Value1', 'padding'], 'are':['Value2', np.nan], 'random':[123, 'dont'], 'names':['extract this: 1236', 'find']} 
df = pd.DataFrame(data)      


+---------+--------+--------+--------------------+
|  these  |  are   | random |       names        |
+---------+--------+--------+--------------------+
| Value1  | Value2 | 123    | extract this: 1236 |
| padding | nan    | dont   | find               |
+---------+--------+--------+--------------------+

I'm able to get it to an array where I could then clean to remove all non-string elements as shown below and I could then find the substring, but I don't like this approach.我能够将它放到一个数组中,然后我可以在其中清除所有非字符串元素,如下所示,然后我可以找到 substring,但我不喜欢这种方法。 Is there a better way of doing this?有更好的方法吗?

mask = np.column_stack([df[col].str.contains(r"extract this: ", na=False) for col in df])
inv_num_arr = df.loc[mask.any(axis=1)].values[0]

The output should just the string extract this: 1236 output 应该只是字符串extract this: 1236

You can use re.search by converting dataframe into string like您可以通过将dataframe转换为string来使用re.search

import re
re.search('extract this:\s\d+', df.to_string()).group(0)

'extract this: 1236'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM