简体   繁体   English

从文本文件中删除所有引号字符

[英]Removing all quote characters from text files

I am reading a utf8 file with normal python text encoding. 我正在读取具有常规python文本编码的utf8文件。 I also need to get rid of all the quotes in the file. 我还需要删除文件中的所有引号。 However, the utf8 code has multiple types of quotes and I can't figure out how to get rid of all of them. 但是,utf8代码具有多种引号,我无法弄清楚如何消除所有引号。 The code below serves as an example of what I've been trying to do. 下面的代码作为我一直在尝试的示例。

def change_things(string, remove):
    for thing in remove:
        string = string.replace(thing, remove[thing])
    return string

where 哪里

remove = {
'\'': '',
'\"': '',
}

Unfortunately, this code only removes normal quotes, not left or right facing quotes. 不幸的是,此代码仅删除普通引号,而不会去除左或右引号。 Is there any way to remove all such quotes using a similar format to what I have done (I recognize that there are other, more efficient ways of removing items from strings but given the overall context of the code this makes more sense for my specific project)? 有什么办法可以使用与我执行的操作类似的格式来删除所有此类引号(我认识到,还有其他更有效的方法可以从字符串中删除项目,但是鉴于代码的整体上下文,这对于我的特定项目更有意义)?

You can just type those sorts of into your file, and replace them same as any other character. 您可以在文件中键入这些类型的内容,然后将其替换为其他任何字符。

utf8_quotes = "“”‘’‹›«»"
mystr = 'Text with “quotes”'
mystr.replace('“', '"').replace('”', '"')

There's a few different single quote variants too. 也有一些不同的单引号变体。

There's a list of unicode quote marks at https://gist.github.com/goodmami/98b0a6e2237ced0025dd . https://gist.github.com/goodmami/98b0a6e2237ced0025dd上有一个Unicode引号列表。 That should allow you to remove any type of quotes. 那应该允许您删除任何类型的引号。

There are multiple ways to do this, regex is one: 有多种方法可以做到这一点,正则表达式是其中一种:

import re
newstr = re.sub(u'[\u201c\u201d\u2018\u2019]', '', oldstr)

Another clean way to do it is to use the Unidecode package . 另一种干净的方法是使用Unidecode This doesn't remove the quotes directly, but converts them to neutral quotes. 这不会直接删除引号,而是将其转换为中性引号。 It also converts any non-ASCII character to its closest ASCII equivalent: 还将所有非ASCII字符转换为与其最接近的ASCII等效字符:

from unidecode import unidecode
newstr = unidecode(oldstr)

Then, you can remove the quotes with your code. 然后,您可以使用代码删除引号。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM