简体   繁体   中英

Removing all quote characters from text files

I am reading a utf8 file with normal python text encoding. I also need to get rid of all the quotes in the file. However, the utf8 code has multiple types of quotes and I can't figure out how to get rid of all of them. The code below serves as an example of what I've been trying to do.

def change_things(string, remove):
    for thing in remove:
        string = string.replace(thing, remove[thing])
    return string

where

remove = {
'\'': '',
'\"': '',
}

Unfortunately, this code only removes normal quotes, not left or right facing quotes. Is there any way to remove all such quotes using a similar format to what I have done (I recognize that there are other, more efficient ways of removing items from strings but given the overall context of the code this makes more sense for my specific project)?

You can just type those sorts of into your file, and replace them same as any other character.

utf8_quotes = "“”‘’‹›«»"
mystr = 'Text with “quotes”'
mystr.replace('“', '"').replace('”', '"')

There's a few different single quote variants too.

There's a list of unicode quote marks at https://gist.github.com/goodmami/98b0a6e2237ced0025dd . That should allow you to remove any type of quotes.

There are multiple ways to do this, regex is one:

import re
newstr = re.sub(u'[\u201c\u201d\u2018\u2019]', '', oldstr)

Another clean way to do it is to use the Unidecode package . This doesn't remove the quotes directly, but converts them to neutral quotes. It also converts any non-ASCII character to its closest ASCII equivalent:

from unidecode import unidecode
newstr = unidecode(oldstr)

Then, you can remove the quotes with your code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM