[英]Text data preprocessing in python
我正在提取python中的肯定,否定和中性关键字。我的注释remarks.txt文件(编码为UTF-8)中有10,000条注释。我想导入文本文件,阅读注释的每一行并提取单词(标记)在c2列中提到的注释,并将其存储在下一个相邻的列中。 我已经编写了一个小程序在Python中调用get_keywords函数。我已经创建了get_keywords()函数,但是遇到了将数据帧的每一行作为参数传递并使用迭代调用以提供关键字并将其存储在相邻列中的问题。
代码没有为df数据框中的所有已处理单词提供预期的列“令牌”。
import nltk
import pandas as pd
import re
import string
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
remarks = pd.read_csv('/Users/ZKDN0YU/Desktop/comments/New
comments/ccomments.txt')
df = pd.DataFrame(remarks, columns= ['c2'])
df.head(50)
df.tail(50)
filename = 'ccomments.txt'
file = open(filename, 'rt', encoding="utf-8")
text = file.read()
file.close()
def get_keywords(row):
# split into tokens by white space
tokens = text.split(str(row))
# prepare regex for char filtering
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
# remove punctuation from each word
tokens = [re_punc.sub('', w) for w in tokens]
# remove remaining tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
# stemming of words
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
# filter out short tokens
tokens = [word for word in tokens if len(word) > 1]
return tokens
df['tokens'] = df.c2.apply(lambda row: get_keywords(row['c2']),
axis=1)
for index, row in df.iterrows():
print(index, row['c2'],"tokens : {}".format(row['tokens']))
预期的输出:-一个Comment_modified文件,其中包含具有10,000条注释的数据帧所有行的列1)索引,2)c2(注释)和3)标记的单词。
假设文本文件ccomments.txt
没有任何标题(即,数据从第一行本身开始)并且每行只有一个列数据(即,文本文件仅具有注释),则下面的代码将返回单词列表。
import nltk
import pandas as pd
import re
import string
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
def get_keywords(row):
# split into tokens by white space
tokens = row.split()
# prepare regex for char filtering
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
# remove punctuation from each word
tokens = [re_punc.sub('', w) for w in tokens]
# remove remaining tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if w not in stop_words]
# stemming of words
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
# filter out short tokens
tokens = [word for word in tokens if len(word) > 1]
return tokens
df = pd.read_csv('ccomments.txt',header=None,names = ['c2'])
df['tokens'] = df.c2.apply(lambda row: get_keywords(row))
for index, row in df.iterrows():
print(index, row['c2'],"tokens : {}".format(row['tokens']))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.