如何从多个txt文件中删除多个字符

Question

我正在尝试编写一个脚本来自动执行从 txt 文件中删除字符的简单任务，并且我想用相同的名称保存它但没有字符。 我有多个 txt 文件：例如 1.txt、2.txt... 200.txt，存储在一个目录（文档）中。 我有一个包含要删除的字符的 txt 文件。 一开始我想将我的 chars_to_remove.txt 与我所有的不同文件（1.txt、2.txt...）进行比较，但我可以找到一种方法。 相反，我创建了一个包含我想要删除的所有字符的字符串。

假设我在 1.txt 文件中有以下字符串：

2020 年 3 月、2019 年和 2018 年马德里和巴塞罗那（西班牙）的平均浓度α 、最大值比率β和由于封锁Δ导致的二氧化氮减少。

我想从字符串中删除α 、 β和Δ字符。 这是我的代码。

import glob 
import os 

chars_to_remove = '‘’“”|n.d.…•∈αβδΔεθϑφΣμτσχ€$∞http:www.←→≥≤<>▷×°±*⁃'

file_location = os.path.join('Desktop', 'Documents', '*.txt')
file_names = glob.glob(file_location)
print(file_names)

for f in file_names:
    outfile = open(f,'r',encoding='latin-1')
    data = outfile.read()
    if chars_to_remove in data:
        data.replace(chars_to_remove, '')
    outfile.close()

变量data在每次迭代中存储来自 txt 文件的所有内容。 我想检查字符串中是否有chars_to_remove并使用replace() function 将其删除。 我尝试了这里和这里建议的不同方法，但没有成功。

另外，我尝试将其作为列表进行比较：

chars_to_remove = ['‘','’','“','”','|','n.d.','…','•','∈','α','β','δ','Δ','ε','θ','ϑ','φ','Σ','μ','τ','σ','χ','€','$','∞','http:','www.','←','→','≥','≤','<','>','▷','×','°','±','*','⁃']

但比较时出现数据类型错误。

任何进一步的想法将不胜感激！

Answer 1

它可能没有那么快，但为什么不使用正则表达式来删除字符/短语呢？

import re

pattern = re.compile(r"(‘|’|“|”|\||n.d.|…|•|∈|α|β|δ|Δ|ε|θ|ϑ|φ|Σ|μ|τ|σ|χ|€|$|∞|http:|www.|←|→|≥|≤|<|>|▷|×|°|±|\*|⁃)")
result = pattern.sub("", 'Mean concentrations α, maximum value ratio β and reductions in NO2 due to the lockdown Δ, March 2020, 2019 and 2018 in Madrid and Barcelona (Spain).')
print(result)

Output

Mean concentrations , maximum value ratio  and reductions in NO2 due to the lockdown , March 2020, 2019 and 2018 in Madrid and Barcelona (Spain).

Answer 2

最有效的方法是string.translate以避免在每个无效字符上循环。 Outfile 必须以某种方式定义。

import glob 
import os
from string import maketrans

chars_to_remove = '‘’“”|n.d.…•∈αβδΔεθϑφΣμτσχ€$∞http:www.←→≥≤<>▷×°±*⁃'
translator = maketrans(chars_to_remove,'\0'*len(chars_to_remove))

file_location = os.path.join('Desktop', 'Documents', '*.txt')
file_names = glob.glob(file_location)
print(file_names)

for f in file_names:
    infile = open(f,'r',encoding='latin-1')
    data = infile.read()
    data.translate(translator).replace('\0','')
    infile.close()
    
    #Now data is translated
    # You must write it in a new file
    with open('...','wt') as outfile:
        outfile.write(data)

打

此代码有效，但效率低下，文件已完全加载到 memory 中。 更好的方法是翻转 infile 并同时写入 outfile。

如何从多个txt文件中删除多个字符

问题描述

2 个解决方案

解决方案1
1 2021-01-07 21:41:35

Output

解决方案2
0 2021-01-07 21:34:29

打

如何从多个txt文件中删除多个字符

问题描述

2 个解决方案

解决方案1 1 2021-01-07 21:41:35

Output

解决方案2 0 2021-01-07 21:34:29

打

解决方案1
1 2021-01-07 21:41:35

解决方案2
0 2021-01-07 21:34:29