获取熊猫数据框中正则表达式匹配的索引不起作用

Question

I have an excel worksheet that I am reading into pandas for parsing and later analysis.我有一个 excel 工作表，我正在将它读入 Pandas 以进行解析和以后的分析。 It has the following format.它具有以下格式。 All values are strings.所有值都是字符串。 They will be converted to floats/ints later but having them as strings helps with parsing.它们稍后将转换为浮点数/整数，但将它们作为字符串有助于解析。

column1  |  column2 | column3 |
-----------------------------
12345   |10         |20       |
txt     |25         |65       |
35615   |15         |20       |
txt     |35         |20       |

I need to get the index of all 5 digit, numerical values in column1.我需要获取 column1 中所有 5 位数字的索引。 It will always be a 5 digit.它将始终是 5 位数字。 I am using the following regex.我正在使用以下正则表达式。

\b\d{5}\b

I am having problems getting pandas to properly match the 5 digits when using any of the built in string methods.使用任何内置字符串方法时，我在让熊猫正确匹配 5 位数字时遇到问题。

I have tried the following.我尝试了以下方法。

df.column1.str.contains('\b\d{5}\b', regex=True).index.list()
df.column1.str.match('\b\d{5}\b').index.list()

I am expecting it to return我期待它回来

[0,2]

Both of these return an empty list.这两个都返回一个空列表。 What am I doing wrong?我究竟做错了什么？

Answer 1

Add r before string, filter by boolean indexing and get index values to list:在字符串之前添加r ，通过boolean indexing过滤并获取要列出的索引值：

i = df[df.column1.str.contains(r'\b\d{5}\b')].index.tolist()
print (i)
[0, 2]

Or if want parse only numeric values with length 5 change regex with ^ and $ for start and end of string:或者，如果只想解析长度为5数值，请使用^和$更改正则表达式以表示字符串的开头和结尾：

i = df[df.column1.str.contains(r'^\d{5}$')].index.tolist()

获取熊猫数据框中正则表达式匹配的索引不起作用

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-01-26 07:36:38

获取熊猫数据框中正则表达式匹配的索引不起作用

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-01-26 07:36:38

解决方案1
1 已采纳 2020-01-26 07:36:38