简体   繁体   English

Python REGEX 内循环

[英]Python REGEX inside loop

I am trying to parse a list of document with REGEX (could not user BeautifulSoup).我正在尝试使用 REGEX 解析文档列表(不能使用 BeautifulSoup)。 I am now able to loop over each txt document inside my folder but I have now to parse them.我现在可以遍历文件夹中的每个 txt 文档,但我现在必须解析它们。 I have been using Python for few days only and I am a bit confused now.我只用了几天 Python,现在有点困惑。

I want to generate a dictionary with <DOCNO> as an ID and <TEXT> as the value.我想用<DOCNO>作为 ID 和<TEXT>作为值生成一个字典。

Example of a file:文件示例:

<DOC>
<DOCNO> 443 </DOCNO>
<TEXT>Hello Word</TEXT>
</DOC>
<DOC>
<DOCNO> 3745 </DOCNO>
<TEXT> Hola amigo </TEXT>
</DOC>

My code so far:到目前为止我的代码:

    path = "data"

    for filename in os.listdir(path):
        print(filename)
        file = open(path + "/" + filename)
        page = file.read()
        page = page.replace('  ', ' ')

        //stuck here 
        doc_regex = re.compile("<DOC>.*?</DOC>", re.DOTALL)
        docno_regex = re.compile("<DOCNO>.*?</DOCNO>")
        text_regex = re.compile("<TEXT>.*?</TEXT>", re.DOTALL)

The best practice is to compile the RegEx once at module level (not in a for loop).最佳实践是在模块级别(而不是在for循环中)编译一次 RegEx。 For instance, you can write:例如,您可以编写:

import re

doc_regex = re.compile("<DOC>(.*?)</DOC>", re.DOTALL)
docno_regex = re.compile("<DOCNO>(.*?)</DOCNO>")
text_regex = re.compile("<TEXT>(.*?)</TEXT>", re.DOTALL)

The first time the mode is loaded, the regex are compiled.第一次加载模式时,将编译正则表达式。

Note: in your RegEx, you need to use a group "(...)" to retrieve the content of each tag.注意:在您的 RegEx 中,您需要使用组“(...)”来检索每个标签的内容。

Several pitfalls:几个陷阱:

  • You ought to use os.path.join to calculate the fullpath of a file (on Windows, the path delimiter is \\ , not / ).您应该使用os.path.join来计算文件的完整路径(在 Windows 上,路径分隔符是\\ ,而不是/ )。 os.path.join is doing that for you. os.path.join正在为你做这件事。

  • You ought to use a with statement to open a file and specify the file encoding.您应该使用with语句打开文件并指定文件编码。 In text mode, if encoding is not specified the encoding used is platform dependent.在文本模式下,如果未指定编码,则使用的编码取决于平台。

So, your loop can be turned into:所以,你的循环可以变成:

path = "data"

for filename in os.listdir(path):
    fullpath = os.path.join(path, filename)
    print(filename)

    with open(fullpath, mode="r", encoding="utf-8") as fd:
        page = fd.read()

To parse your data, you can use re.findall .要解析您的数据,您可以使用re.findall There are other ways, also…还有其他方法,也……

You can use doc_regex to find each <DOC>...</DOC> and then docno_regex and text_regex to find the docno and the text .您可以使用doc_regex查找每个<DOC>...</DOC>然后使用docno_regextext_regex查找docnotext

In you loop, you can do that way:在你的循环中,你可以这样做:

    for doc_content in doc_regex.findall(page):
        docno = docno_regex.findall(doc_content)[0].strip()
        text = text_regex.findall(doc_content)[0].strip()
        print(docno, text)

To store each entry in a dictionary, you can define a dict , like that:要将每个条目存储在字典中,您可以定义一个dict ,如下所示:

result = {}
for doc_content in doc_regex.findall(page):
    docno = docno_regex.findall(doc_content)[0].strip()
    text = text_regex.findall(doc_content)[0].strip()
    result[docno] = text

You get:你得到:

{'3745': 'Hola amigo', '443': 'Hello Word'}

Why can't BeautifulSoup be used?为什么不能使用 BeautifulSoup? How about this?这个怎么样?

from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<DOC>
<DOCNO> 443 </DOCNO>
<TEXT>Hello Word</TEXT>
</DOC>
<DOC>
<DOCNO> 3745 </DOCNO>
<TEXT> Hola amigo </TEXT>
</DOC>
'''
xml = SimplifiedDoc(html)
docs = xml.selects('DOC')
dic = {}
for doc in docs:
  dic[doc.DOCNO.text]=doc.TEXT.text
print (dic)

Result:结果:

{'443': 'Hello Word', '3745': 'Hola amigo'}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM