繁体   English   中英

使用正则表达式解析文本以进行情感分析

[英]Parsing text with regex for sentiment analysis

我正在解析一个文本文件,其中包含数千种以下格式的文章,所有文章都遵循完全相同的模式。 文本在虚线之间。

-------------------------------
 1 of 40 DOCUMENTS



                  July 22, 2016  9:42 



This is the title of the document.



Author 1 and Author 2 in London



This is the body of the text. This paragraph has four sentences. There are 25 words in total. The meaning of the words is not important.



July 23, 2016

 --------------------

我想处理这些文章并仅保留:

a)第1行带有文档编号,b)标题,c)正文

由于文本正文可能还包含我想保留的日期,因此我该如何用正则表达式表达呢? 任何其他建议也将受到欢迎。 谢谢您的帮助。

我希望使每篇文章的格式如下,虚线之间的文字。

-------------------------------
  1 of 40 DOCUMENTS



This is the title of the document.



This is the body of the text. This paragraph has four sentences. There are 25 words in total. The meaning of the words is not important.

--------------------------------

我认为使用正则表达式可能不是解决此问题的最佳方法。

这是一个如何解决这个问题的粗略想法。 功能transform希望传递给一个迭代器,该迭代器一次返回一个输入行。 这可以只是一个打开的文件。 出于测试目的,我将测试字符串拆分为行列表,并为该列表传递了迭代器。 该函数(生成器)可能需要一些微调,具体取决于您可能要从输入中删除多少行。 为了进行测试,我在输入中添加了第二篇文章,就好像它是最后一篇文章一样。 我猜想它可能会如何终止。

生成器函数通过执行next(lines)并将结果分配给变量line ,遍历作为可迭代对象的变量lines中传递给它的所有line 如果当前行要包含在输出中,则执行语句yield line 我已经根据您要删除的内容而不是要保留的内容实施了该解决方案,因为从您的有限示例中还不清楚标题和文本正文的所有可能性。 您似乎希望删除从“ -----------等”开始的第6行和第14行 是第一行,而日期则出现在下一个'---------等'之前两行。 如果第6行上的日期和第14行上的作者列表并不总是在这些固定位置上,那么所有赌注都将关闭。

您能准确描述输入格式吗?

import re


def transform(lines):
    try:
        line = None
        while True:
            if line is None:
                line = next(lines) # ---------------
            yield line
            line = next(lines) # 1 of 40 documents
            yield line
            line = next(lines) # blank line
            yield line
            line = next(lines) # blank line
            yield line
            line = next(lines) # blank line
            yield line
            line = next(lines) # July 22, 2016 9:42 - Do not yield this line
            line = next(lines) # blank line
            yield line
            line = next(lines) # blank line
            yield line
            line = next(lines) # blank line
            yield line
            line = next(lines) # This is the title of the document.
            yield line
            line = next(lines) # blank line
            yield line
            line = next(lines) # blank line
            yield line
            line = next(lines) # blank line
            yield line
            line = next(lines) # Author 1 and Author 2 in London - Do not yield this line
            while True:
                line = next(lines)
                if not re.match(r'\s*[A-Za-z]+\s+\d\d?,\s+\d{4}\s*$', line): # date?
                    yield line
                else:
                    line2 = next(lines) # blank ?
                    line3 = next(lines) # ------------------------------- ?
                    if line3 != '-------------------------------':
                        yield line
                        yield line2
                        yield line3
                    else:
                        line = line3
                        break # start of new document
    except StopIteration:
        pass


if __name__ == '__main__':

    text = """-------------------------------
 1 of 40 DOCUMENTS



                  July 22, 2016  9:42



This is the title of the document.



Author 1 and Author 2 in London



This is the body of the text. This paragraph has four sentences. There are 25 words in total. The meaning of the words is not important.



July 23, 2016

-------------------------------
  1 of 40 DOCUMENTS



                  July 22, 2016  9:42



This is the title of the document.



Author 1 and Author 2 in London



This is the body of the text. This paragraph has four sentences. There are 25 words in total. The meaning of the words is not important.



July 23, 2016
"""
    for line in transform(iter(text.split('\n'))):
        print(line)

结果输出:

-------------------------------
 1 of 40 DOCUMENTS






This is the title of the document.






This is the body of the text. This paragraph has four sentences. There are 25 words in total. The meaning of the words is not important.



-------------------------------
  1 of 40 DOCUMENTS






This is the title of the document.






This is the body of the text. This paragraph has four sentences. There are 25 words in total. The meaning of the words is not important.

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM