简体   繁体   中英

Text Pre-processing + Python + CSV : Removing special characters from a column of a CSV

I am working on a text classification problem. My CSV file contains a column called 'description' which describes events. Unfortunately, that column is full of special characters apart from English words. Sometimes the entire field in a row is full of such characters, or, sometimes, few words are of such special characters and the rest are English words. I am showing you two specimen fields of two different rows:

हर वर्ष की तरह इस वर्ष भी सिंधु सेना द्वारा आयोजित सिंधी प्रीमियर लीग फुटबॉल टूर्नामेंट का आयोजन एमबीएम ग्राउंड में करने जा रही है जिसमें अंडर-19 टीमें भाग लेती है आप सभी से निवेदन है समाज के युवाओं को प्रोत्साहन करने अवश्य पधारें

Unwind on the strums of Guitar &  immerse your soul into the magical vibes of music! ️? ️?..Guitar Night By Ashmik Patil.July 19, 2018.Thursday.9 PM Onwards.*Cover charges applicable...#GuitarNight #MusicalNight #MagicalMusic #MusicLove #Party #Enjoy #TheBarTerminal #Mumbaikars #Mumbai

In the first one the entire field is full of such unreadable characters, whereas in the second case, only few such characters are present. Rest of them are English words.

I want to remove only those special chars keeping the English words as they are, as I need those English words to form a bag of words at a later stage.

How to implement that with Python ( I am using a jupyter notebook) ?

You can do this by using regex . Assuming that you have been able to take out the text from the CSV file -

#python 2.7
import re
text = "Something with special characters á┬ñ┬╡├á┬ñ┬░├á┬Ñ┬ì├á┬ñ┬╖"
cleaned_text = re.sub(r'[^\x00-\x7f]+','', text)
print cleaned_text

Output - Something with special characters 

To understand the regex expression used, refer here .

You can encode your string to ascii and ignore the errors.

>>> text = 'Something with special characters á┬ñ┬╡├á┬ñ┬░├á┬Ñ┬ì├á┬ñ┬╖'
>>> text = text.encode('ascii', 'ignore')

Which will give you a binary object, which you can further decode again to utf

>>> text
b'Something with special characters '

>>> text = text.decode('utf')
>>> text
'Something with special characters '

You could use pandas to read the csv file into a dataframe. using:

import pandas as pd 
df = pd.read_csv(fileName,convertor={COLUMN_NUMBER:func})

where func, is a function that takes a single string and removes special characters. this can be done in different ways, using regex, but here is a simple one

import string
def func(strg):
    return ''.join(c for c in strg if c in string.printable[:-5])

alternatively you can read the dataframe first then apply to alter the description column. ie.

import pandas as pd 
df = pd.read_csv(fileName)
df['description'] = df['description'].apply(func)

or using regex

df['description'] = df['description'].str.replace('[^A-Za-z _]','')

string.printable[:-5 ] is the set of characters '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\\'()*+,-./:;<=>?@[\\]^_`{|}~ '

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM