[英]Need to extract data from a column, if a particular character exists, extracting the substring before the character
I've got a column which I am trying to clean, the data is like this:我有一个要清理的列,数据是这样的:
Wherever the pattern is of xy year, I want to extract only the 'x' value and leave it in the string.无论模式是 xy 年的什么地方,我都只想提取“x”值并将其留在字符串中。 For any other value, I want to keep it as is.
对于任何其他值,我想保持原样。
Using str.extract('(.{,2}(-))') is returning a NaN value for all the other rows.使用 str.extract('(.{,2}(-))') 为所有其他行返回 NaN 值。
The solution first compiles the regex then the compiled regex will be used on each row.该解决方案首先编译正则表达式,然后编译的正则表达式将用于每一行。 The lambda also relies on the walrus operator
:=
. lambda 也依赖于海象运算符
:=
。 Assumes that your 2nd column is named col2
.假设您的第二列名为
col2
。
import re
pattern = re.compile("([\d]+)-[\d]+ year")
df["col2"] = df["col2"].map(lambda x: m[1] if (m:=pattern.match(x)) else x)
You want series.str.replace()
, I believe.你想要
series.str.replace()
,我相信。
Does this give you the desired output?这会给你想要的 output 吗?
df = pd.DataFrame.from_records([[1778, '3-5 year'], [961, np.nan], [2141, 'h 3+ year']], columns=['a','b'])
repl = lambda m: m.group(1)
df.b = df.b.str.replace(r'(\d+)-\d+\syear', repl, regex=True)
df
which takes the original df
:它采用原始
df
:
a b
0 1778 3-5 year
1 961 NaN
2 2141 h 3+ year
and gives the output:并给出 output:
a b
0 1778 3
1 961 NaN
2 2141 h 3+ year
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.