Python无法读取包含特定字符串的文件

Question

I've written a function to remove certain words and characters for a string. 我编写了一个函数来删除字符串中的某些单词和字符。 The string in question is read into the program using a file. 使用文件将有问题的字符串读入程序。 The program works fine except when a file, anywhere, contains the following anywhere in the body of the file. 该程序可以正常工作，除非文件的任何位置在文件正文中的任何位置包含以下内容。

Security Update for Secure Boot (3177404) This security update resolves a vulnerability in Microsoft Windows. 安全启动的安全更新（3177404）此安全更新解决了Microsoft Windows中的漏洞。 The vulnerability could allow Secure Boot security features to be bypassed if an attacker installs an affected policy on a target device. 如果攻击者在目标设备上安装了受影响的策略，则该漏洞可能允许绕过安全启动安全功能。 An attacker must have either administrative privileges or physical access to install a policy and bypass Secure Boot. 攻击者必须具有管理特权或物理访问权限才能安装策略并绕过安全启动。

I've never experienced such weird behavior. 我从未经历过这种奇怪的行为。 Anybody have any suggestions? 有人有什么建议吗？

This is the function I've written. 这是我编写的功能。

def scrub(file_name):
    try:
        file = open(file_name,"r")
        unscrubbed_string = file.read()
        file.close()

        cms = open("common_misspellings.csv","r")
        for line in cms:
            replacement = line.strip('\n').split(',')
            while replacement[0] in unscrubbed_string:
                unscrubbed_string = unscrubbed_string.replace(replacement[0],replacement[1])

        cms.close()

        special_chars = ['.',',',';',"'","\""]

        for char in special_chars:
            while char in unscrubbed_string:
                unscrubbed_string = unscrubbed_string.replace(char,"")

        unscrubbed_list = unscrubbed_string.split()

        noise = open("noise.txt","r")
        noise_list = []

        for word in noise:
            noise_list.append(word.strip('\n'))

        noise.close()

        for noise in noise_list:
            while noise in unscrubbed_list:
                    unscrubbed_list.remove(noise)
        return unscrubbed_list

    except:
        print("""[*] File not found.""")

Answer 1

Your code may be hanging because your .replace() call is in a while loop. 您的代码可能正在挂起，因为您的.replace()调用处于while循环中。 If, for any particular line of your .csv file, the replacement[0] string is a substring of its corresponding replacement[1] , and if either of them appears in your critical text, then the while loop will never finish. 如果对于.csv文件的任何特定行， .csv replacement[0]字符串是其对应的replacement[1]的子字符串 ，并且如果其中任何一个出现在关键文本中，则while循环将永远不会结束。 In fact, you don't need the while loop at all—a single .replace() call will replace all occurrences. 实际上，您根本不需要while循环-单个.replace()调用将替换所有出现的事件。

But that's only one example of the problems you'll encounter with your current approach of using a blanket unscrubbed_string.replace(...) You'll either need to use regular expression substitution (from the re ) module, or break your string down into words yourself and work word-by-word instead. 但这只是您使用一揽子unscrubbed_string.replace(...)当前方法所遇到的问题的一个示例，您将需要使用正则表达式替换（来自re ）模块，或者分解您的字符串自己说出来，然后逐字逐句地工作。 Why? 为什么？ Well, here's a simple example: 'Teh' needs to be corrected to 'The' —but what if the document contains a reference to 'Tehran' ? 好吧，这是一个简单的示例： 'Teh'需要更正为'The'但是如果文档中包含对'Tehran'的引用怎么办？ Your "Secure Boot" text will contain an example analogous to this. 您的“安全启动”文本将包含与此类似的示例。

If you go the regular-expression route, the symbol \\b solves this by matching word boundaries of any kind (start or end of string, spaces, punctuation). 如果使用正则表达式，符号\\b可以通过匹配任何类型的单词边界（字符串的开头或结尾，空格，标点符号）来解决此问题。 Here's a simplified example: 这是一个简化的示例：

import re

replacements = {
    'Teh':'The',
}
unscrubbed = 'Teh capital of Iran is Tehran. Teh capital of France is Paris.'

better = unscrubbed
naive = unscrubbed
for target, replacement in replacements.items():
    naive = naive.replace(target, replacement)

    pattern = r'\b' + target + r'\b'
    better = re.sub(pattern, replacement, better)

print(unscrubbed)
print(naive)
print(better)

Output, with mistakes emphasized: 输出，强调错误：

Teh capital of Iran is Tehran. 伊朗德资本是德黑兰。 Teh capital of France is Paris. 法国的德资本是巴黎。 ( unscrubbed ) （未unscrubbed ）

The capital of Iran is Theran . 伊朗的首都是Theran 。 The capital of France is Paris. 法国的首都是巴黎。 ( naive ) （ naive ）

The capital of Iran is Tehran. 伊朗的首都是德黑兰。 The capital of France is Paris. 法国的首都是巴黎。 ( better ) （ better ）

Python无法读取包含特定字符串的文件

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-09-24 00:29:31

Python无法读取包含特定字符串的文件

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-09-24 00:29:31

解决方案1
1 已采纳 2016-09-24 00:29:31