简体   繁体   English

获取熊猫数据框中正则表达式匹配的索引不起作用

[英]Get index of regex match in pandas dataframe not working

I have an excel worksheet that I am reading into pandas for parsing and later analysis.我有一个 excel 工作表,我正在将它读入 Pandas 以进行解析和以后的分析。 It has the following format.它具有以下格式。 All values are strings.所有值都是字符串。 They will be converted to floats/ints later but having them as strings helps with parsing.它们稍后将转换为浮点数/整数,但将它们作为字符串有助于解析。

column1  |  column2 | column3 |
-----------------------------
12345   |10         |20       |
txt     |25         |65       |
35615   |15         |20       |
txt     |35         |20       |

I need to get the index of all 5 digit, numerical values in column1.我需要获取 column1 中所有 5 位数字的索引。 It will always be a 5 digit.它将始终是 5 位数字。 I am using the following regex.我正在使用以下正则表达式。

\b\d{5}\b

I am having problems getting pandas to properly match the 5 digits when using any of the built in string methods.使用任何内置字符串方法时,我在让熊猫正确匹配 5 位数字时遇到问题。

I have tried the following.我尝试了以下方法。

df.column1.str.contains('\b\d{5}\b', regex=True).index.list()
df.column1.str.match('\b\d{5}\b').index.list()

I am expecting it to return我期待它回来

[0,2]

Both of these return an empty list.这两个都返回一个空列表。 What am I doing wrong?我究竟做错了什么?

Add r before string, filter by boolean indexing and get index values to list:在字符串之前添加r ,通过boolean indexing过滤并获取要列出的索引值:

i = df[df.column1.str.contains(r'\b\d{5}\b')].index.tolist()
print (i)
[0, 2]

Or if want parse only numeric values with length 5 change regex with ^ and $ for start and end of string:或者,如果只想解析长度为5数值,请使用^$更改正则表达式以表示字符串的开头和结尾:

i = df[df.column1.str.contains(r'^\d{5}$')].index.tolist()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM