简体   繁体   English

Python无法读取包含特定字符串的文件

[英]Python cannot read a file which contains a specific string

I've written a function to remove certain words and characters for a string. 我编写了一个函数来删除字符串中的某些单词和字符。 The string in question is read into the program using a file. 使用文件将有问题的字符串读入程序。 The program works fine except when a file, anywhere, contains the following anywhere in the body of the file. 该程序可以正常工作,除非文件的任何位置在文件正文中的任何位置包含以下内容。

Security Update for Secure Boot (3177404) This security update resolves a vulnerability in Microsoft Windows. 安全启动的安全更新(3177404)此安全更新解决了Microsoft Windows中的漏洞。 The vulnerability could allow Secure Boot security features to be bypassed if an attacker installs an affected policy on a target device. 如果攻击者在目标设备上安装了受影响的策略,则该漏洞可能允许绕过安全启动安全功能。 An attacker must have either administrative privileges or physical access to install a policy and bypass Secure Boot. 攻击者必须具有管理特权或物理访问权限才能安装策略并绕过安全启动。

I've never experienced such weird behavior. 我从未经历过这种奇怪的行为。 Anybody have any suggestions? 有人有什么建议吗?

This is the function I've written. 这是我编写的功能。

def scrub(file_name):
    try:
        file = open(file_name,"r")
        unscrubbed_string = file.read()
        file.close()

        cms = open("common_misspellings.csv","r")
        for line in cms:
            replacement = line.strip('\n').split(',')
            while replacement[0] in unscrubbed_string:
                unscrubbed_string = unscrubbed_string.replace(replacement[0],replacement[1])

        cms.close()

        special_chars = ['.',',',';',"'","\""]

        for char in special_chars:
            while char in unscrubbed_string:
                unscrubbed_string = unscrubbed_string.replace(char,"")

        unscrubbed_list = unscrubbed_string.split()

        noise = open("noise.txt","r")
        noise_list = []

        for word in noise:
            noise_list.append(word.strip('\n'))

        noise.close()

        for noise in noise_list:
            while noise in unscrubbed_list:
                    unscrubbed_list.remove(noise)
        return unscrubbed_list

    except:
        print("""[*] File not found.""")

Your code may be hanging because your .replace() call is in a while loop. 您的代码可能正在挂起,因为您的.replace()调用处于while循环中。 If, for any particular line of your .csv file, the replacement[0] string is a substring of its corresponding replacement[1] , and if either of them appears in your critical text, then the while loop will never finish. 如果对于.csv文件的任何特定行, .csv replacement[0]字符串是其对应的replacement[1]子字符串 ,并且如果其中任何一个出现在关键文本中,则while循环将永远不会结束。 In fact, you don't need the while loop at all—a single .replace() call will replace all occurrences. 实际上,您根本不需要while循环-单个.replace()调用将替换所有出现的事件。

But that's only one example of the problems you'll encounter with your current approach of using a blanket unscrubbed_string.replace(...) You'll either need to use regular expression substitution (from the re ) module, or break your string down into words yourself and work word-by-word instead. 但这只是您使用一揽子unscrubbed_string.replace(...)当前方法所遇到的问题的一个示例,您将需要使用正则表达式替换(来自re )模块,或者分解您的字符串自己说出来,然后逐字逐句地工作。 Why? 为什么? Well, here's a simple example: 'Teh' needs to be corrected to 'The' —but what if the document contains a reference to 'Tehran' ? 好吧,这是一个简单的示例: 'Teh'需要更正为'The'但是如果文档中包含对'Tehran'的引用怎么办? Your "Secure Boot" text will contain an example analogous to this. 您的“安全启动”文本将包含与此类似的示例。

If you go the regular-expression route, the symbol \\b solves this by matching word boundaries of any kind (start or end of string, spaces, punctuation). 如果使用正则表达式,符号\\b可以通过匹配任何类型的单词边界(字符串的开头或结尾,空格,标点符号)来解决此问题。 Here's a simplified example: 这是一个简化的示例:

import re

replacements = {
    'Teh':'The',
}
unscrubbed = 'Teh capital of Iran is Tehran. Teh capital of France is Paris.'

better = unscrubbed
naive = unscrubbed
for target, replacement in replacements.items():
    naive = naive.replace(target, replacement)

    pattern = r'\b' + target + r'\b'
    better = re.sub(pattern, replacement, better)

print(unscrubbed)
print(naive)
print(better)

Output, with mistakes emphasized: 输出,强调错误:

Teh capital of Iran is Tehran. 伊朗资本是德黑兰。 Teh capital of France is Paris. 法国的资本是巴黎。 ( unscrubbed ) (未unscrubbed

The capital of Iran is Theran . 伊朗的首都是Theran The capital of France is Paris. 法国的首都是巴黎。 ( naive ) naive

The capital of Iran is Tehran. 伊朗的首都是德黑兰。 The capital of France is Paris. 法国的首都是巴黎。 ( better ) better

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python打印结果中包含特定的字符串 - Python print results which contains specific string in it Python - 读取包含字节和字符串的文件 - Python - Read a file which contains bytes and strings 读取文件名包含python中的字符串的最新文件 - Read latest file with a filename contains string in python Python/Pandas 删除包含特定字符串的字符串的开头 - Python/Pandas remove the start of a string which contains a specific string Python:读取其中一列包含多个逗号的csv文件 - Python: Read csv file of which one column contains multiple commas 我正在寻找一个 function ,它有助于在 python 中的特定特殊字符之后从文件中读取字符串 - I am looking for a function which help to read string from a file after specific special character in python 无法将 python 文件(包含硒)转换为 exe - Cannot convert python file (which contains selenium) to exe 程序从1个文件中读取内容并将从特定单词开始并以特定单词结尾的字符串写入python中的另一个文件 - program to Read content from 1 file and write string which is started from specific word and end with specific word into another file in python 从python文件中读取特定字符串? - read a specific string from a file in python? 从多个文件中读取包含特定字符串作为工作表名称的工作表名称并连接到 pandas - Read the sheet name which contains specific string as sheet name from multiple files and concatenate in pandas
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM