Text Pre-processing + Python + CSV : Removing special characters from a column of a CSV

Question

I am working on a text classification problem. My CSV file contains a column called 'description' which describes events. Unfortunately, that column is full of special characters apart from English words. Sometimes the entire field in a row is full of such characters, or, sometimes, few words are of such special characters and the rest are English words. I am showing you two specimen fields of two different rows:

├á┬ñ┬╣├á┬ñ┬░ ├á┬ñ┬╡├á┬ñ┬░├á┬Ñ┬ì├á┬ñ┬╖ ├á┬ñΓÇó├á┬ÑΓé¼ ├á┬ñ┬ñ├á┬ñ┬░├á┬ñ┬╣ ├á┬ñΓÇí├á┬ñ┬╕ ├á┬ñ┬╡├á┬ñ┬░├á┬Ñ┬ì├á┬ñ┬╖ ├á┬ñ┬¡├á┬ÑΓé¼ ├á┬ñ┬╕├á┬ñ┬┐├á┬ñΓÇÜ├á┬ñ┬º├á┬Ñ┬ü ├á┬ñ┬╕├á┬ÑΓÇí├á┬ñ┬¿├á┬ñ┬╛ ├á┬ñ┬ª├á┬Ñ┬ì├á┬ñ┬╡├á┬ñ┬╛├á┬ñ┬░├á┬ñ┬╛ ├á┬ñΓÇá├á┬ñ┬»├á┬ÑΓÇ╣├á┬ñ┼ô├á┬ñ┬┐├á┬ñ┬ñ ├á┬ñ┬╕├á┬ñ┬┐├á┬ñΓÇÜ├á┬ñ┬º├á┬ÑΓé¼ ├á┬ñ┬¬├á┬Ñ┬ì├á┬ñ┬░├á┬ÑΓé¼├á┬ñ┬«├á┬ñ┬┐├á┬ñ┬»├á┬ñ┬░ ├á┬ñ┬▓├á┬ÑΓé¼├á┬ñΓÇö ├á┬ñ┬½├á┬Ñ┬ü├á┬ñ┼╕├á┬ñ┬¼├á┬ÑΓÇ░├á┬ñ┬▓ ├á┬ñ┼╕├á┬ÑΓÇÜ├á┬ñ┬░├á┬Ñ┬ì├á┬ñ┬¿├á┬ñ┬╛├á┬ñ┬«├á┬ÑΓÇí├á┬ñΓÇÜ├á┬ñ┼╕ ├á┬ñΓÇó├á┬ñ┬╛ ├á┬ñΓÇá├á┬ñ┬»├á┬ÑΓÇ╣├á┬ñ┼ô├á┬ñ┬¿ ├á┬ñ┬Å├á┬ñ┬«├á┬ñ┬¼├á┬ÑΓé¼├á┬ñ┬Å├á┬ñ┬« ├á┬ñΓÇö├á┬Ñ┬ì├á┬ñ┬░├á┬ñ┬╛├á┬ñΓÇ░├á┬ñΓÇÜ├á┬ñ┬í ├á┬ñ┬«├á┬ÑΓÇí├á┬ñΓÇÜ ├á┬ñΓÇó├á┬ñ┬░├á┬ñ┬¿├á┬ÑΓÇí ├á┬ñ┼ô├á┬ñ┬╛ ├á┬ñ┬░├á┬ñ┬╣├á┬ÑΓé¼ ├á┬ñ┬╣├á┬Ñ╦å ├á┬ñ┼ô├á┬ñ┬┐├á┬ñ┬╕├á┬ñ┬«├á┬ÑΓÇí├á┬ñΓÇÜ ├á┬ñΓÇª├á┬ñΓÇÜ├á┬ñ┬í├á┬ñ┬░-19 ├á┬ñ┼╕├á┬ÑΓé¼├á┬ñ┬«├á┬ÑΓÇí├á┬ñΓÇÜ ├á┬ñ┬¡├á┬ñ┬╛├á┬ñΓÇö ├á┬ñ┬▓├á┬ÑΓÇí├á┬ñ┬ñ├á┬ÑΓé¼ ├á┬ñ┬╣├á┬Ñ╦å ├á┬ñΓÇá├á┬ñ┬¬ ├á┬ñ┬╕├á┬ñ┬¡├á┬ÑΓé¼ ├á┬ñ┬╕├á┬ÑΓÇí ├á┬ñ┬¿├á┬ñ┬┐├á┬ñ┬╡├á┬ÑΓÇí├á┬ñ┬ª├á┬ñ┬¿ ├á┬ñ┬╣├á┬Ñ╦å ├á┬ñ┬╕├á┬ñ┬«├á┬ñ┬╛├á┬ñ┼ô ├á┬ñΓÇó├á┬ÑΓÇí ├á┬ñ┬»├á┬Ñ┬ü├á┬ñ┬╡├á┬ñ┬╛├á┬ñΓÇ£├á┬ñΓÇÜ ├á┬ñΓÇó├á┬ÑΓÇ╣ ├á┬ñ┬¬├á┬Ñ┬ì├á┬ñ┬░├á┬ÑΓÇ╣├á┬ñ┬ñ├á┬Ñ┬ì├á┬ñ┬╕├á┬ñ┬╛├á┬ñ┬╣├á┬ñ┬¿ ├á┬ñΓÇó├á┬ñ┬░├á┬ñ┬¿├á┬ÑΓÇí ├á┬ñΓÇª├á┬ñ┬╡├á┬ñ┬╢├á┬Ñ┬ì├á┬ñ┬» ├á┬ñ┬¬├á┬ñ┬º├á┬ñ┬╛├á┬ñ┬░├á┬ÑΓÇí├á┬ñΓÇÜ

Unwind on the strums of Guitar &  immerse your soul into the magical vibes of music! ├»┬╕┬Å? ├»┬╕┬Å?..Guitar Night By Ashmik Patil.July 19, 2018.Thursday.9 PM Onwards.*Cover charges applicable...#GuitarNight #MusicalNight #MagicalMusic #MusicLove #Party #Enjoy #TheBarTerminal #Mumbaikars #Mumbai

In the first one the entire field is full of such unreadable characters, whereas in the second case, only few such characters are present. Rest of them are English words.

I want to remove only those special chars keeping the English words as they are, as I need those English words to form a bag of words at a later stage.

How to implement that with Python ( I am using a jupyter notebook) ?

Answer 1

You can do this by using regex . Assuming that you have been able to take out the text from the CSV file -

#python 2.7
import re
text = "Something with special characters á┬ñ┬╡├á┬ñ┬░├á┬Ñ┬ì├á┬ñ┬╖"
cleaned_text = re.sub(r'[^\x00-\x7f]+','', text)
print cleaned_text

Output - Something with special characters

To understand the regex expression used, refer here .

Answer 2

You can encode your string to ascii and ignore the errors.

>>> text = 'Something with special characters á┬ñ┬╡├á┬ñ┬░├á┬Ñ┬ì├á┬ñ┬╖'
>>> text = text.encode('ascii', 'ignore')

Which will give you a binary object, which you can further decode again to utf

>>> text
b'Something with special characters '

>>> text = text.decode('utf')
>>> text
'Something with special characters '

Answer 3

You could use pandas to read the csv file into a dataframe. using:

import pandas as pd 
df = pd.read_csv(fileName,convertor={COLUMN_NUMBER:func})

where func, is a function that takes a single string and removes special characters. this can be done in different ways, using regex, but here is a simple one

import string
def func(strg):
    return ''.join(c for c in strg if c in string.printable[:-5])

alternatively you can read the dataframe first then apply to alter the description column. ie.

import pandas as pd 
df = pd.read_csv(fileName)
df['description'] = df['description'].apply(func)

or using regex

df['description'] = df['description'].str.replace('[^A-Za-z _]','')

string.printable[:-5 ] is the set of characters '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\\'()*+,-./:;<=>?@[\\]^_`{|}~ '

Text Pre-processing + Python + CSV : Removing special characters from a column of a CSV

Question

3 answers

solution1
2 2018-09-24 12:38:46

solution2
1 2018-09-24 20:21:32

solution3
0 2018-09-24 12:36:27

Text Pre-processing + Python + CSV : Removing special characters from a column of a CSV

Question

3 answers

solution1 2 2018-09-24 12:38:46

solution2 1 2018-09-24 20:21:32

solution3 0 2018-09-24 12:36:27

solution1
2 2018-09-24 12:38:46

solution2
1 2018-09-24 20:21:32

solution3
0 2018-09-24 12:36:27