Removing various symbols from a text

Question

I am trying to clean some texts that are very different from one another. I would like to remove the headlines, quotation marks, abbreviations, special symbols and points that don't actually end sentences.

Example input:

This is a headline

And inside the text there are 'abbreviations', e.g. "bzw." in German or some German dates, like 2. Dezember 2017. Sometimes there are even enumerations, that I might just eliminate completely.
• they have
◦ different bullet points
- or even equations and 
Sometimes there are special symbols. ✓

Example output:

And inside the text there are abbreviations, for example beziehungsweise in German or some German dates, like 2 Dezember 2017. Sometimes there are even enumerations, that I might just eliminate completely. Sometimes there are special symbols.

What I did:

with open(r'C:\\Users\me\\Desktop\\ex.txt', 'r', encoding="utf8") as infile: 
    data = infile.read()
    data = data.replace("'", '')
    data = data.replace("e.g.", 'for example') 
    #and so on
with open(r'C:\\Users\me\\Desktop\\ex.txt', 'w', encoding="utf8") as outfile:
    outfile.write(data)

My problems (although number 2 is the most important):

I just want a string with this input, but it obviously breaks because of the quotation marks, is there any way to do this other than working with files like I did? In reality, I'm copy-pasting a text and want an app to clean it.
The code seems very inefficient because I just manually write the things that I remember to delete/clean, but I don't know all the abbreviations by heart. How do I clean it in one go, so to say?
Is there any way to eliminate the headline and enumeration, and the point . that appears in that German date? My code doesn't do that.

Edit: I just remembered stuff like text = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text) , but regex is inefficient for huge texts, isn't it?

Answer 1

To easily remove all non standard symbols you can use the str.isalnum() which only returns true for any alphaneumaric sequence or str.isascii() for any ascii strings. isprintable() seems viable too. A full list can be found here Using those functions you can iterate over the string and filter each character. So something like this:

filteredData = filter(str.isidentifier, data)

You can also combine those by creating a function that checks multiple string variables like this:

def FilterKey(char:str): return char.isidentifier() and char.isalpha()

Which can be used in filter like this:

filteredData = filter(FilterKey, data)

if it returns true its included in the output if it returns false its excluded.

Removing various symbols from a text

Question

1 answers

solution1
0 2022-01-04 01:05:56

Removing various symbols from a text

Question

1 answers

solution1 0 2022-01-04 01:05:56

solution1
0 2022-01-04 01:05:56