使用 Python 中的列表解码文本文件

Question

我编码了这句话：

这是一个了不起的“摘要”，这是这个惊人的摘要的结尾。

对此：

1 2 3 4“5”6 7：2 8 9 10 7 4 5。

对应的索引表（作为文本文件）是：

word,index
This,1
is,2
an,3
amazing,4
abstract,5
AND,6
this,7
the,8
end,9
of,10

现在我想从这些数字 go 中：'1 2 3 4 "5" 6 7:2 8 9 10 7 4 5.' 到它对应的词使用索引表。

我使用此代码将索引表文本文件作为切片列表打开：

index_file = open("decompress.txt", "r")

content_index = index_file.read().split()
print(content_index)

output：

['word,index', 'This,1', 'is,2', 'an,3', 'amazing,4', 'abstract,5', 'AND,6', 'this,7', 'the,8', 'end,9', 'of,10']

然后我使用以下代码将每个元素切片到一个新列表中：

for line in content_index:
    fields = line.split(",")

output：

['word', 'index']
['This', '1']
['is', '2']
['an', '3']
['amazing', '4']
['abstract', '5']
['AND', '6']
['this', '7']
['the', '8']
['end', '9']
['of', '10']

我尝试使用 fields[0] en fields[1] 和 for 循环解码数字，但没有成功。 任何帮助将不胜感激！

Answer 1

首先，最好使用 dict 并替换您的代码：

for line in content_index:
    fields = line.split(",")

至：

fields = {}
for line in content_index:
    word, number = line.split(',')
    fields[number] = word

然后，您可以使用正则表达式轻松地将特定模式（在您的情况下为数字）替换为任何其他字符串。 查找数字的正则表达式将是\d+其中\d表示digit ， +表示one or more所以：

import re

original_string = ' 1 2 3 4 "5" 6 7: 2 8 9 10 7 4 5. '

def replacement(match):
    """
    This function accepts regular expression match and returns corresponding replacement if it's found in `fields`
    """
    return fields.get(match.group(0), '')  # Learn more about match groups at `re` documentation.

result = re.sub(r'\d+', replacement, original_string)  # This line will iterate through original string, calling `replacement` for each number in this string, substituting return value to string.

所以最终的代码将是：

import re

fields = {}

with open('decompress.txt') as f:
    for line in f.readlines():
        word, number = line.split(',')
        fields[number] = word

original_string = ' 1 2 3 4 "5" 6 7: 2 8 9 10 7 4 5. '

def replacement(match):
    """
    This function accepts regular expression match and returns corresponding replacement if it's found in `fields`
    """
    return fields.get(match.group(0), '')

result = re.sub(r'\d+', replacement, original_string)
print(result)

您可以在有关re库的 Python 文档中了解有关正则表达式的更多信息。 它是非常强大的文本处理和解析工具。

Answer 2

对于这种情况，您可以使用re 模块中的正则表达式和几个理解。

在第一次导入 re 并列出所有行：

import re

with open('decompress.txt') as f:
    lines = f.readlines()
#>> lines
# ['word,index\n', 'This,1\n', 'is,2\n', 'an,3\n', 'amazing,4\n', 
#  'abstract,5\n', 'AND,6\n', 'this,7\n', 'the,8\n', 'end,9\n', 'of,10']

之后使用带有模式(.*)的re.search - select anythink, , - 在昏迷之前和(\d+) - 之后的一些数字。 在这种情况下，跳过文档的第一行。

parsed_lines = [re.search(r'(.*),(\d+)', line) for line in lines if 'index' not in line]

最后，创建一个字典，索引是键，文本是值。

fields = {int(line_match.group(2)): line_match.group(1) for line_match in parsed_lines}
# {1: 'This', 2: 'is', 3: 'an', 4: 'amazing', 5: 'abstract', 
#  6: 'AND', 7: 'this', 8: 'the', 9: 'end', 10: 'of'}

UPD：或在第二步列出：

parsed_lines = [re.search(r'(.*),\d+', line).group(1) for line in lines if 'index' not in line]

使用 Python 中的列表解码文本文件

问题描述

2 个解决方案

解决方案1
1 2019-11-05 16:12:20

解决方案2
0 2019-11-05 18:01:55

使用 Python 中的列表解码文本文件

问题描述

2 个解决方案

解决方案1 1 2019-11-05 16:12:20

解决方案2 0 2019-11-05 18:01:55

解决方案1
1 2019-11-05 16:12:20

解决方案2
0 2019-11-05 18:01:55