简体   繁体   English

文本预处理+ Python + CSV:从CSV列中删除特殊字符

[英]Text Pre-processing + Python + CSV : Removing special characters from a column of a CSV

I am working on a text classification problem. 我正在研究文本分类问题。 My CSV file contains a column called 'description' which describes events. 我的CSV文件包含一个名为“description”的列,用于描述事件。 Unfortunately, that column is full of special characters apart from English words. 不幸的是,除了英文单词之外,该专栏还有很多特殊字符。 Sometimes the entire field in a row is full of such characters, or, sometimes, few words are of such special characters and the rest are English words. 有时,连续的整个字段都充满了这样的字符,或者有时,很少有单词具有这样的特殊字符,其余的都是英文单词。 I am showing you two specimen fields of two different rows: 我向你展示了两个不同行的两个标本字段:

हर वर्ष की तरह इस वर्ष भी सिंधु सेना द्वारा आयोजित सिंधी प्रीमियर लीग फुटबॉल टूर्नामेंट का आयोजन एमबीएम ग्राउंड में करने जा रही है जिसमें अंडर-19 टीमें भाग लेती है आप सभी से निवेदन है समाज के युवाओं को प्रोत्साहन करने अवश्य पधारें

Unwind on the strums of Guitar &  immerse your soul into the magical vibes of music! ️? ️?..Guitar Night By Ashmik Patil.July 19, 2018.Thursday.9 PM Onwards.*Cover charges applicable...#GuitarNight #MusicalNight #MagicalMusic #MusicLove #Party #Enjoy #TheBarTerminal #Mumbaikars #Mumbai

In the first one the entire field is full of such unreadable characters, whereas in the second case, only few such characters are present. 在第一个中,整个字段中充满了这些不可读的字符,而在第二种情况下,只有少数这样的字符存在。 Rest of them are English words. 其余的都是英文单词。

I want to remove only those special chars keeping the English words as they are, as I need those English words to form a bag of words at a later stage. 我只想删除那些保留英语单词的特殊字符,因为我需要那些英语单词在后期形成一个单词。

How to implement that with Python ( I am using a jupyter notebook) ? 如何用Python实现(我使用jupyter笔记本)?

You can do this by using regex . 您可以使用正则表达式执行此操作。 Assuming that you have been able to take out the text from the CSV file - 假设您已经能够从CSV文件中取出文本 -

#python 2.7
import re
text = "Something with special characters á┬ñ┬╡├á┬ñ┬░├á┬Ñ┬ì├á┬ñ┬╖"
cleaned_text = re.sub(r'[^\x00-\x7f]+','', text)
print cleaned_text

Output - Something with special characters 

To understand the regex expression used, refer here . 要了解所使用的正则表达式, 请参阅此处

You can encode your string to ascii and ignore the errors. 您可以将字符串编码为asciiignore错误。

>>> text = 'Something with special characters á┬ñ┬╡├á┬ñ┬░├á┬Ñ┬ì├á┬ñ┬╖'
>>> text = text.encode('ascii', 'ignore')

Which will give you a binary object, which you can further decode again to utf 这将为您提供一个二进制对象,您可以再次解码为utf

>>> text
b'Something with special characters '

>>> text = text.decode('utf')
>>> text
'Something with special characters '

You could use pandas to read the csv file into a dataframe. 您可以使用pandas将csv文件读入数据帧。 using: 使用:

import pandas as pd 
df = pd.read_csv(fileName,convertor={COLUMN_NUMBER:func})

where func, is a function that takes a single string and removes special characters. 其中func是一个函数,它接受一个字符串并删除特殊字符。 this can be done in different ways, using regex, but here is a simple one 这可以用不同的方式完成,使用正则表达式,但这里是一个简单的

import string
def func(strg):
    return ''.join(c for c in strg if c in string.printable[:-5])

alternatively you can read the dataframe first then apply to alter the description column. 或者,您可以先读取数据帧,然后应用于更改描述列。 ie. 即。

import pandas as pd 
df = pd.read_csv(fileName)
df['description'] = df['description'].apply(func)

or using regex 或使用正则表达式

df['description'] = df['description'].str.replace('[^A-Za-z _]','')

string.printable[:-5 ] is the set of characters '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\\'()*+,-./:;<=>?@[\\]^_`{|}~ ' string.printable[:-5 ]是字符集'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!“#$%&\\'()* +, - 。/:; <=>?@ [\\ _] ^ _` {|}〜'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM