如何在Python中提取两个字符串之间的内容？

Question

我对 Python 相当陌生。

我有一个包含近 50 万行文本的 .txt 文件。 一般结构是这样的：

WARC-TREC-ID：

你好

我的

姓名

是

WARC-TREC-ID：

例子

文本

WARC-TREC-ID：

我想提取“WARC-TREC-ID：”关键字之间的所有内容。

这是我已经尝试过的：

    content_list = []

with open('C://Users//HOME//Desktop//Document_S//corpus_test//00.txt', errors = 'ignore') as openfile2:
    for line in openfile2:
        for item in line.split("WARC-TREC-ID:"):
            if "WARC-TREC-ID:" in item:
                content = (item [ item.find("WARC-TREC-ID:")+len("WARC-TREC-ID:") : ])
                content_list.append(content)

这将返回一个空列表。

我也试过：

    import re

with open('C://Users//HOME//Desktop//Document_S//corpus_test//00.txt', 'r') as openfile3:
    
    m = re.search('WARC-TREC-ID:(.+?)WARC-TREC-ID:', openfile3)
    if m: 
        found = m.group(1)

这会导致 TypeError: expected string or bytes-like object

Answer 1

尝试：

content_list = []
with open(filename) as infile:
    for line in infile:               #Iterate each line
        if 'WARC-TREC-ID:' in line:   #check if line contains 'WARC-TREC-ID:'
            content_list.append([])   #Append empty list
        else:
            content_list[-1].append(line)   #Append content

print(content_list)

Answer 2

在您的第二种方法中，您应该将文件内容作为string传递，因为它需要一个字符串参数，而不是 file 。 这也只会返回该字符串的第一次出现。 您可能想使用findall 。

Answer 3

对于包含您数据的文件：

raw_data = open('data.txt', 'r').read()
result = [x for x in raw_data.split() if x != 'WARC-TREC-ID:']

输出：

['hello', 'my', 'name', 'is', 'example', 'text']

如何在Python中提取两个字符串之间的内容？

问题描述

3 个解决方案

解决方案1
2 已采纳 2020-02-06 07:53:53

解决方案2
0 2020-02-06 07:53:42

解决方案3
-1 2020-02-06 08:23:21

如何在Python中提取两个字符串之间的内容？

问题描述

3 个解决方案

解决方案1 2 已采纳 2020-02-06 07:53:53

解决方案2 0 2020-02-06 07:53:42

解决方案3 -1 2020-02-06 08:23:21

解决方案1
2 已采纳 2020-02-06 07:53:53

解决方案2
0 2020-02-06 07:53:42

解决方案3
-1 2020-02-06 08:23:21