简体   繁体   English

正则表达式,用于捕获字符串中的数字(Python)

[英]Regex for capturing digits in a string (Python)

I'm new to python & I'm processing a text file with regular expressions to extract ids & append a list. 我是python的新手,正在处理带有正则表达式的文本文件以提取id和附加列表。 I wrote some python below intending to construct a list that looks like this 我在下面写了一些python,打算构造一个像这样的列表

["10073710","10074302","10079203","10082213"...and so on]

Instead I'm seeing a list structure that has a bunch of verbose tags included. 相反,我看到的是一个列表结构,其中包含许多冗长的标签。 I'm assuming this is normal behavior & the finditer function appends these tags when it finds matches. 我假设这是正常行为,并且finditer函数在找到匹配项时会追加这些标签。 But the response is a bit messy & I'm not sure how to turn off/delete these added tags. 但是响应有点混乱,我不确定如何关闭/删除这些添加的标签。 See screenshot below. 请参见下面的屏幕截图。

在此处输入图片说明

Can anyone please help me modify the code below so I can achieve the intended structure for the list? 谁能帮我修改下面的代码,以便实现列表的预期结构?

import re

#create a list of strings
company_id = []

#open file contents into a variable
company_data = open(r'C:\Users\etherealessence\Desktop\company_data_test.json', 'r', encoding="utf-8")

#read the line structure into a variable
line_list = company_data.readlines()

#stringify the contents so regex operations can be performed
line_list = str(line_list)

#close the file
company_data.close()

#assign the regex pattern to a variable
pattern = re.compile(r'"id":([^,]+)')

#find all instances of the pattern and append the list
#https://stackoverflow.com/questions/12870178/looping-through-python-regex-matches
for id in re.finditer(pattern, line_list): 
  #print(id)
  company_id.append(id)

#test view the list of company id strings
#print(line_list)
print(company_id)

re.finditer returns an iterator of re.Match objects. re.finditer返回re.Match对象的iterator

If you want to extract the actual match (and more specifically, the captured group, to get rid of the leading "id": ), you can do something like this: 如果要提取实际匹配项(更具体地说,是捕获的组,以摆脱开头的"id": :),则可以执行以下操作:

for match in re.finditer(pattern, line_list):
    company_id.append(match.group(1))

To get the value, use id.string : 要获取值,请使用id.string

for id in re.finditer(pattern, line_list): 
  company_id.append(id.string)

as when you're reading just id, you're not fetching the actual value. 因为当您仅读取id时,您并未获取实际值。

If your data is in JSON, you might just want to simply parse it. 如果您的数据使用JSON,则可能只想简单地对其进行分析。


If you wish to use regular expression, you can simplify your expression and use three capturing groups to the desired ID much easier. 如果希望使用正则表达式,则可以简化表达式,并使用三个捕获组轻松获得所需的ID。 You can set two capturing groups in the left and right sides of your IDs, then the middle capturing group can help you to get the IDs, maybe something similar to this expression : 您可以在ID的左侧和右侧设置两个捕获组,然后中间的捕获组可以帮助您获取ID,也许类似于以下表达式

("id":")([0-9]+)(") 

在此处输入图片说明

RegEx Descriptive Graph RegEx描述图

This link helps you to visualizes your expressions: 链接可帮助您形象化您的表情:

在此处输入图片说明

Python Testing Python测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(\x22id\x22:\x22)([0-9]+)(\x22)"

test_str = "some other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON data"

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Python Test Python测试

# -*- coding: UTF-8 -*-
import re

string = "some other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON data"
expression = r'(\x22id\x22:\x22)([0-9]+)(\x22)'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(2) + "\" is a match 💚💚💚 ")
else: 
    print('🙀 Sorry! No matches!')

Output: 输出:

YAAAY! "10480132" is a match 💚💚💚

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM