[英]Python REGEX inside loop
I am trying to parse a list of document with REGEX (could not user BeautifulSoup).我正在尝试使用 REGEX 解析文档列表(不能使用 BeautifulSoup)。 I am now able to loop over each txt document inside my folder but I have now to parse them.
我现在可以遍历文件夹中的每个 txt 文档,但我现在必须解析它们。 I have been using Python for few days only and I am a bit confused now.
我只用了几天 Python,现在有点困惑。
I want to generate a dictionary with <DOCNO>
as an ID and <TEXT>
as the value.我想用
<DOCNO>
作为 ID 和<TEXT>
作为值生成一个字典。
Example of a file:文件示例:
<DOC>
<DOCNO> 443 </DOCNO>
<TEXT>Hello Word</TEXT>
</DOC>
<DOC>
<DOCNO> 3745 </DOCNO>
<TEXT> Hola amigo </TEXT>
</DOC>
My code so far:到目前为止我的代码:
path = "data"
for filename in os.listdir(path):
print(filename)
file = open(path + "/" + filename)
page = file.read()
page = page.replace(' ', ' ')
//stuck here
doc_regex = re.compile("<DOC>.*?</DOC>", re.DOTALL)
docno_regex = re.compile("<DOCNO>.*?</DOCNO>")
text_regex = re.compile("<TEXT>.*?</TEXT>", re.DOTALL)
The best practice is to compile the RegEx once at module level (not in a for
loop).最佳实践是在模块级别(而不是在
for
循环中)编译一次 RegEx。 For instance, you can write:例如,您可以编写:
import re
doc_regex = re.compile("<DOC>(.*?)</DOC>", re.DOTALL)
docno_regex = re.compile("<DOCNO>(.*?)</DOCNO>")
text_regex = re.compile("<TEXT>(.*?)</TEXT>", re.DOTALL)
The first time the mode is loaded, the regex are compiled.第一次加载模式时,将编译正则表达式。
Note: in your RegEx, you need to use a group "(...)" to retrieve the content of each tag.注意:在您的 RegEx 中,您需要使用组“(...)”来检索每个标签的内容。
Several pitfalls:几个陷阱:
You ought to use os.path.join
to calculate the fullpath of a file (on Windows, the path delimiter is \\
, not /
).您应该使用
os.path.join
来计算文件的完整路径(在 Windows 上,路径分隔符是\\
,而不是/
)。 os.path.join
is doing that for you. os.path.join
正在为你做这件事。
You ought to use a with
statement to open a file and specify the file encoding.您应该使用
with
语句打开文件并指定文件编码。 In text mode, if encoding is not specified the encoding used is platform dependent.在文本模式下,如果未指定编码,则使用的编码取决于平台。
So, your loop can be turned into:所以,你的循环可以变成:
path = "data"
for filename in os.listdir(path):
fullpath = os.path.join(path, filename)
print(filename)
with open(fullpath, mode="r", encoding="utf-8") as fd:
page = fd.read()
To parse your data, you can use re.findall
.要解析您的数据,您可以使用
re.findall
。 There are other ways, also…还有其他方法,也……
You can use doc_regex to find each <DOC>...</DOC>
and then docno_regex and text_regex to find the docno and the text .您可以使用doc_regex查找每个
<DOC>...</DOC>
然后使用docno_regex和text_regex查找docno和text 。
In you loop, you can do that way:在你的循环中,你可以这样做:
for doc_content in doc_regex.findall(page):
docno = docno_regex.findall(doc_content)[0].strip()
text = text_regex.findall(doc_content)[0].strip()
print(docno, text)
To store each entry in a dictionary, you can define a dict
, like that:要将每个条目存储在字典中,您可以定义一个
dict
,如下所示:
result = {}
for doc_content in doc_regex.findall(page):
docno = docno_regex.findall(doc_content)[0].strip()
text = text_regex.findall(doc_content)[0].strip()
result[docno] = text
You get:你得到:
{'3745': 'Hola amigo', '443': 'Hello Word'}
Why can't BeautifulSoup be used?为什么不能使用 BeautifulSoup? How about this?
这个怎么样?
from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<DOC>
<DOCNO> 443 </DOCNO>
<TEXT>Hello Word</TEXT>
</DOC>
<DOC>
<DOCNO> 3745 </DOCNO>
<TEXT> Hola amigo </TEXT>
</DOC>
'''
xml = SimplifiedDoc(html)
docs = xml.selects('DOC')
dic = {}
for doc in docs:
dic[doc.DOCNO.text]=doc.TEXT.text
print (dic)
Result:结果:
{'443': 'Hello Word', '3745': 'Hola amigo'}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.