[英]Find and replace semi-common strings in dataframe?
I am attempting to find a semi-common occurring string and remove all other data in the column.我试图找到一个半常见的字符串并删除列中的所有其他数据。 Pandas and Re have been imported.
Pandas 和 Re 已被导入。 For instance, I have dataframe...
例如,我有数据框...
>>>df
COLUMN COUNT DATA
1 this row RA-123: data 8b43a
2 here RA-5372: data 94h63c
I need to keep just the RA-'number that follows' and remove everything before and after.我只需要保留 RA-'后面的数字'并删除之前和之后的所有内容。 The numbers that follow are not always the same length and the 'RA-' string does not always occur in the same position.
后面的数字并不总是相同的长度,并且“RA-”字符串并不总是出现在相同的位置。 There is a colon after every instance that can be used as a delimiter.
每个实例后面都有一个可用作分隔符的冒号。
I tried this (a friend wrote the regex search piece for me because I am not familiar with it).我试过这个(一个朋友为我写了正则表达式搜索片,因为我不熟悉它)。
df.assign(DATA= df['DATA'].str.extract(re.search('RA[^:]+')))
But python returned但是蟒蛇回来了
TypeError: search() missing 1 required positional argument: 'string'
What am I missing here?我在这里缺少什么? Thanks in advance!
提前致谢!
You should use acapturing group with extract:您应该使用带有提取物的捕获组:
df['DATA'].str.extract(r'(RA-\d+)')
Here, (RA-\\d+)
is a capturing group matching RA
, then a hyphen and then one or more digits.这里,
(RA-\\d+)
是一个匹配RA
的捕获组,然后是一个连字符,然后是一个或多个数字。
You may use your own pattern, but you still need to wrap it with capturing parentheses, r'(RA[^:]+)'
.您可以使用自己的模式,但您仍然需要使用捕获括号
r'(RA[^:]+)'
将其包裹起来。
As I mentioned earlier, no need for re
here.正如我之前提到的,这里不需要
re
。
Other answers addressed well how to use extract
directly.其他答案很好地解决了如何直接使用
extract
。 However, to answer your specificly, if you really want to use re
, the way to go is to use re.compile
instead of re.search
.但是,要具体回答您的问题,如果您真的想使用
re
,那么要走的路是使用re.compile
而不是re.search
。
df.assign(DATA= df['DATA'].str.extract(re.compile(regex_str)))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.