简体   繁体   中英

Removing various symbols from a text

I am trying to clean some texts that are very different from one another. I would like to remove the headlines, quotation marks, abbreviations, special symbols and points that don't actually end sentences.

Example input:

This is a headline

And inside the text there are 'abbreviations', e.g. "bzw." in German or some German dates, like 2. Dezember 2017. Sometimes there are even enumerations, that I might just eliminate completely.
• they have
◦ different bullet points
- or even equations and 
Sometimes there are special symbols. ✓

Example output:

And inside the text there are abbreviations, for example beziehungsweise in German or some German dates, like 2 Dezember 2017. Sometimes there are even enumerations, that I might just eliminate completely. Sometimes there are special symbols.

What I did:

with open(r'C:\\Users\me\\Desktop\\ex.txt', 'r', encoding="utf8") as infile: 
    data = infile.read()
    data = data.replace("'", '')
    data = data.replace("e.g.", 'for example') 
    #and so on
with open(r'C:\\Users\me\\Desktop\\ex.txt', 'w', encoding="utf8") as outfile:
    outfile.write(data)

My problems (although number 2 is the most important):

  1. I just want a string with this input, but it obviously breaks because of the quotation marks, is there any way to do this other than working with files like I did? In reality, I'm copy-pasting a text and want an app to clean it.

  2. The code seems very inefficient because I just manually write the things that I remember to delete/clean, but I don't know all the abbreviations by heart. How do I clean it in one go, so to say?

  3. Is there any way to eliminate the headline and enumeration, and the point . that appears in that German date? My code doesn't do that.

Edit: I just remembered stuff like text = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text) , but regex is inefficient for huge texts, isn't it?

To easily remove all non standard symbols you can use the str.isalnum() which only returns true for any alphaneumaric sequence or str.isascii() for any ascii strings. isprintable() seems viable too. A full list can be found here Using those functions you can iterate over the string and filter each character. So something like this:

filteredData = filter(str.isidentifier, data)

You can also combine those by creating a function that checks multiple string variables like this:

def FilterKey(char:str): return char.isidentifier() and char.isalpha()

Which can be used in filter like this:

filteredData = filter(FilterKey, data)

if it returns true its included in the output if it returns false its excluded.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM