Pandas - 使用替换+正则表达式从字符串列中提取数值

Question

I have a dataframe with a column with many value ranges.我有一个 dataframe 的列，其中包含许多值范围。 Example below:下面的例子：

dirty_col = pd.Series([5, 6, '1-2', '40-60', 10])

I am trying to clean up this column producing a new column with the average of the value ranges.我正在尝试清理此列，生成一个具有平均值范围的新列。 Expected result:预期结果：

clean_col = pd.Series([5, 6, 1.5, 50, 10])

I am trying to map this using regex in vectorized mapping functions, something like:我正在尝试 map 在矢量化映射函数中使用正则表达式，例如：

clean_col = pd.Series([5, 6, '1-2', '40-60', 10]).replace({'^[0-9]-[0-9]$':--average here--},regex=True)

But I am stuck here.但我被困在这里。 How could I get the expected result above USING a mapping dictionary and regular expressions?如何使用映射字典和正则表达式获得上述预期结果？ I am aware I could work directly in the dataframe spliting the text by '-' and then averaging out, but, I already have many other cleaning mappings inside above dictionary, that it would be more convenient and cleaner to keep using the same dictionary for all the cleaning.我知道我可以直接在 dataframe 中工作，将文本按“-”分割，然后取平均值，但是，我已经在上面的字典中有许多其他的清理映射，继续使用同一个字典会更方便和更干净所有的清洁。

I think the solution I am looking for probably uses lambdas, or an extra function that gets called from inside the dictionary, but I cannot figure out how to do this.我认为我正在寻找的解决方案可能使用 lambdas，或者从字典内部调用的额外 function，但我无法弄清楚如何做到这一点。

Answer 1

I don't think pandas.Series.replace supports callable.我不认为pandas.Series.replace支持可调用。 One possible way using pandas.eval :使用pandas.eval的一种可能方法：

dirty_col.replace({'^(\d+)-(\d+)$': "(\\1+\\2)/2"},regex=True).apply(pd.eval)

Output: Output：

0     5.0
1     6.0
2     1.5
3    50.0
4    10.0
dtype: float64

Answer 2

You may try series.str.replace with repl as a callable and fillna back您可以尝试series.str.replace与repl作为可调用和fillna返回

f_repr = lambda m: str(sum(map(int, m[0].split('-')))/2)
s_out = s.str.replace(r'^[0-9]+-[0-9]+$', f_repr).fillna(s)

Out[30]:
0       5
1       6
2     1.5
3    50.0
4      10
dtype: object

Pandas - 使用替换+正则表达式从字符串列中提取数值

问题描述

2 个解决方案

解决方案1
3 已采纳 2020-12-11 08:21:49

解决方案2
2 2020-12-11 08:42:28

Pandas - 使用替换+正则表达式从字符串列中提取数值

问题描述

2 个解决方案

解决方案1 3 已采纳 2020-12-11 08:21:49

解决方案2 2 2020-12-11 08:42:28

解决方案1
3 已采纳 2020-12-11 08:21:49

解决方案2
2 2020-12-11 08:42:28