需要从某列中提取数据，如果存在特定字符，提取该字符前的substring

Question

I've got a column which I am trying to clean, the data is like this:我有一个要清理的列，数据是这样的：

Wherever the pattern is of xy year, I want to extract only the 'x' value and leave it in the string.无论模式是 xy 年的什么地方，我都只想提取“x”值并将其留在字符串中。 For any other value, I want to keep it as is.对于任何其他值，我想保持原样。

Using str.extract('(.{,2}(-))') is returning a NaN value for all the other rows.使用 str.extract('(.{,2}(-))') 为所有其他行返回 NaN 值。

Answer 1

The solution first compiles the regex then the compiled regex will be used on each row.该解决方案首先编译正则表达式，然后编译的正则表达式将用于每一行。 The lambda also relies on the walrus operator := . lambda 也依赖于海象运算符:= 。 Assumes that your 2nd column is named col2 .假设您的第二列名为col2 。

import re

pattern = re.compile("([\d]+)-[\d]+ year")
df["col2"] = df["col2"].map(lambda x: m[1] if (m:=pattern.match(x)) else x)

Answer 2

You want series.str.replace() , I believe.你想要series.str.replace() ，我相信。

Does this give you the desired output?这会给你想要的 output 吗？

df = pd.DataFrame.from_records([[1778, '3-5 year'], [961, np.nan], [2141, 'h 3+ year']], columns=['a','b'])

repl = lambda m: m.group(1)
df.b = df.b.str.replace(r'(\d+)-\d+\syear', repl, regex=True)
df

which takes the original df :它采用原始df ：

      a          b
0  1778   3-5 year
1   961        NaN
2  2141  h 3+ year

and gives the output:并给出 output：

      a          b
0  1778          3
1   961        NaN
2  2141  h 3+ year

需要从某列中提取数据，如果存在特定字符，提取该字符前的substring

问题描述

2 个解决方案

解决方案1
0 2022-11-17 11:34:39

解决方案2
0 2022-11-17 11:35:52

需要从某列中提取数据，如果存在特定字符，提取该字符前的substring

问题描述

2 个解决方案

解决方案1 0 2022-11-17 11:34:39

解决方案2 0 2022-11-17 11:35:52

解决方案1
0 2022-11-17 11:34:39

解决方案2
0 2022-11-17 11:35:52