Python REGEX 内循环

Question

I am trying to parse a list of document with REGEX (could not user BeautifulSoup).我正在尝试使用 REGEX 解析文档列表（不能使用 BeautifulSoup）。 I am now able to loop over each txt document inside my folder but I have now to parse them.我现在可以遍历文件夹中的每个 txt 文档，但我现在必须解析它们。 I have been using Python for few days only and I am a bit confused now.我只用了几天 Python，现在有点困惑。

I want to generate a dictionary with <DOCNO> as an ID and <TEXT> as the value.我想用<DOCNO>作为 ID 和<TEXT>作为值生成一个字典。

Example of a file:文件示例：

<DOC>
<DOCNO> 443 </DOCNO>
<TEXT>Hello Word</TEXT>
</DOC>
<DOC>
<DOCNO> 3745 </DOCNO>
<TEXT> Hola amigo </TEXT>
</DOC>

My code so far:到目前为止我的代码：

    path = "data"

    for filename in os.listdir(path):
        print(filename)
        file = open(path + "/" + filename)
        page = file.read()
        page = page.replace('  ', ' ')

        //stuck here 
        doc_regex = re.compile("<DOC>.*?</DOC>", re.DOTALL)
        docno_regex = re.compile("<DOCNO>.*?</DOCNO>")
        text_regex = re.compile("<TEXT>.*?</TEXT>", re.DOTALL)

Answer 1

The best practice is to compile the RegEx once at module level (not in a for loop).最佳实践是在模块级别（而不是在for循环中）编译一次 RegEx。 For instance, you can write:例如，您可以编写：

import re

doc_regex = re.compile("<DOC>(.*?)</DOC>", re.DOTALL)
docno_regex = re.compile("<DOCNO>(.*?)</DOCNO>")
text_regex = re.compile("<TEXT>(.*?)</TEXT>", re.DOTALL)

The first time the mode is loaded, the regex are compiled.第一次加载模式时，将编译正则表达式。

Note: in your RegEx, you need to use a group "(...)" to retrieve the content of each tag.注意：在您的 RegEx 中，您需要使用组“(...)”来检索每个标签的内容。

Several pitfalls:几个陷阱：

You ought to use os.path.join to calculate the fullpath of a file (on Windows, the path delimiter is \\ , not / ).您应该使用os.path.join来计算文件的完整路径（在 Windows 上，路径分隔符是\\ ，而不是/ ）。 os.path.join is doing that for you. os.path.join正在为你做这件事。
You ought to use a with statement to open a file and specify the file encoding.您应该使用with语句打开文件并指定文件编码。 In text mode, if encoding is not specified the encoding used is platform dependent.在文本模式下，如果未指定编码，则使用的编码取决于平台。

So, your loop can be turned into:所以，你的循环可以变成：

path = "data"

for filename in os.listdir(path):
    fullpath = os.path.join(path, filename)
    print(filename)

    with open(fullpath, mode="r", encoding="utf-8") as fd:
        page = fd.read()

To parse your data, you can use re.findall .要解析您的数据，您可以使用re.findall 。 There are other ways, also…还有其他方法，也……

You can use doc_regex to find each <DOC>...</DOC> and then docno_regex and text_regex to find the docno and the text .您可以使用doc_regex查找每个<DOC>...</DOC>然后使用docno_regex和text_regex查找docno和text 。

In you loop, you can do that way:在你的循环中，你可以这样做：

    for doc_content in doc_regex.findall(page):
        docno = docno_regex.findall(doc_content)[0].strip()
        text = text_regex.findall(doc_content)[0].strip()
        print(docno, text)

To store each entry in a dictionary, you can define a dict , like that:要将每个条目存储在字典中，您可以定义一个dict ，如下所示：

result = {}
for doc_content in doc_regex.findall(page):
    docno = docno_regex.findall(doc_content)[0].strip()
    text = text_regex.findall(doc_content)[0].strip()
    result[docno] = text

You get:你得到：

{'3745': 'Hola amigo', '443': 'Hello Word'}

Answer 2

Why can't BeautifulSoup be used?为什么不能使用 BeautifulSoup？ How about this?这个怎么样？

from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<DOC>
<DOCNO> 443 </DOCNO>
<TEXT>Hello Word</TEXT>
</DOC>
<DOC>
<DOCNO> 3745 </DOCNO>
<TEXT> Hola amigo </TEXT>
</DOC>
'''
xml = SimplifiedDoc(html)
docs = xml.selects('DOC')
dic = {}
for doc in docs:
  dic[doc.DOCNO.text]=doc.TEXT.text
print (dic)

Result:结果：

{'443': 'Hello Word', '3745': 'Hola amigo'}

Python REGEX 内循环

问题描述

2 个解决方案

解决方案1
0 2020-01-29 22:37:49

解决方案2
0 2020-01-30 05:20:00

Python REGEX 内循环

问题描述

2 个解决方案

解决方案1 0 2020-01-29 22:37:49

解决方案2 0 2020-01-30 05:20:00

解决方案1
0 2020-01-29 22:37:49

解决方案2
0 2020-01-30 05:20:00