简体   繁体   English

需要从某列中提取数据,如果存在特定字符,提取该字符前的substring

[英]Need to extract data from a column, if a particular character exists, extracting the substring before the character

I've got a column which I am trying to clean, the data is like this:我有一个要清理的列,数据是这样的:

在此处输入图像描述

Wherever the pattern is of xy year, I want to extract only the 'x' value and leave it in the string.无论模式是 xy 年的什么地方,我都只想提取“x”值并将其留在字符串中。 For any other value, I want to keep it as is.对于任何其他值,我想保持原样。

Using str.extract('(.{,2}(-))') is returning a NaN value for all the other rows.使用 str.extract('(.{,2}(-))') 为所有其他行返回 NaN 值。

The solution first compiles the regex then the compiled regex will be used on each row.该解决方案首先编译正则表达式,然后编译的正则表达式将用于每一行。 The lambda also relies on the walrus operator := . lambda 也依赖于海象运算符:= Assumes that your 2nd column is named col2 .假设您的第二列名为col2

import re

pattern = re.compile("([\d]+)-[\d]+ year")
df["col2"] = df["col2"].map(lambda x: m[1] if (m:=pattern.match(x)) else x)

You want series.str.replace() , I believe.你想要series.str.replace() ,我相信。

Does this give you the desired output?这会给你想要的 output 吗?

df = pd.DataFrame.from_records([[1778, '3-5 year'], [961, np.nan], [2141, 'h 3+ year']], columns=['a','b'])

repl = lambda m: m.group(1)
df.b = df.b.str.replace(r'(\d+)-\d+\syear', repl, regex=True)
df

which takes the original df :它采用原始df

      a          b
0  1778   3-5 year
1   961        NaN
2  2141  h 3+ year

and gives the output:并给出 output:

      a          b
0  1778          3
1   961        NaN
2  2141  h 3+ year

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM