[英]Regex for capturing digits in a string (Python)
I'm new to python & I'm processing a text file with regular expressions to extract ids & append a list. 我是python的新手,正在处理带有正则表达式的文本文件以提取id和附加列表。 I wrote some python below intending to construct a list that looks like this
我在下面写了一些python,打算构造一个像这样的列表
["10073710","10074302","10079203","10082213"...and so on]
Instead I'm seeing a list structure that has a bunch of verbose tags included. 相反,我看到的是一个列表结构,其中包含许多冗长的标签。 I'm assuming this is normal behavior & the finditer function appends these tags when it finds matches.
我假设这是正常行为,并且finditer函数在找到匹配项时会追加这些标签。 But the response is a bit messy & I'm not sure how to turn off/delete these added tags.
但是响应有点混乱,我不确定如何关闭/删除这些添加的标签。 See screenshot below.
请参见下面的屏幕截图。
Can anyone please help me modify the code below so I can achieve the intended structure for the list? 谁能帮我修改下面的代码,以便实现列表的预期结构?
import re
#create a list of strings
company_id = []
#open file contents into a variable
company_data = open(r'C:\Users\etherealessence\Desktop\company_data_test.json', 'r', encoding="utf-8")
#read the line structure into a variable
line_list = company_data.readlines()
#stringify the contents so regex operations can be performed
line_list = str(line_list)
#close the file
company_data.close()
#assign the regex pattern to a variable
pattern = re.compile(r'"id":([^,]+)')
#find all instances of the pattern and append the list
#https://stackoverflow.com/questions/12870178/looping-through-python-regex-matches
for id in re.finditer(pattern, line_list):
#print(id)
company_id.append(id)
#test view the list of company id strings
#print(line_list)
print(company_id)
re.finditer
returns an iterator
of re.Match
objects. re.finditer
返回re.Match
对象的iterator
。
If you want to extract the actual match (and more specifically, the captured group, to get rid of the leading "id":
), you can do something like this: 如果要提取实际匹配项(更具体地说,是捕获的组,以摆脱开头的
"id":
:),则可以执行以下操作:
for match in re.finditer(pattern, line_list):
company_id.append(match.group(1))
To get the value, use id.string
: 要获取值,请使用
id.string
:
for id in re.finditer(pattern, line_list):
company_id.append(id.string)
as when you're reading just id, you're not fetching the actual value. 因为当您仅读取id时,您并未获取实际值。
If your data is in JSON, you might just want to simply parse it. 如果您的数据使用JSON,则可能只想简单地对其进行分析。
If you wish to use regular expression, you can simplify your expression and use three capturing groups to the desired ID much easier. 如果希望使用正则表达式,则可以简化表达式,并使用三个捕获组轻松获得所需的ID。 You can set two capturing groups in the left and right sides of your IDs, then the middle capturing group can help you to get the IDs, maybe something similar to this expression :
您可以在ID的左侧和右侧设置两个捕获组,然后中间的捕获组可以帮助您获取ID,也许类似于以下表达式 :
("id":")([0-9]+)(")
This link helps you to visualizes your expressions: 该链接可帮助您形象化您的表情:
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(\x22id\x22:\x22)([0-9]+)(\x22)"
test_str = "some other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON data"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
# -*- coding: UTF-8 -*-
import re
string = "some other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON data"
expression = r'(\x22id\x22:\x22)([0-9]+)(\x22)'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(2) + "\" is a match 💚💚💚 ")
else:
print('🙀 Sorry! No matches!')
YAAAY! "10480132" is a match 💚💚💚
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.