简体   繁体   English

Python中的文本数据预处理

[英]Text data preprocessing in python

I am working on extraction of positive, negative & neutral keyword in python.There are 10,000 comments in my comments remarks.txt file(encoded UTF-8).I want to import the text file, read the individual row of comments & extract words(tokenize) from the comments mentioned in column c2 & store it in a next adjacent column. 我正在提取python中的肯定,否定和中性关键字。我的注释remarks.txt文件(编码为UTF-8)中有10,000条注释。我想导入文本文件,阅读注释的每一行并提取单词(标记)在c2列中提到的注释,并将其存储在下一个相邻的列中。 I have written a small program calling get_keywords function in Python.I have created get_keywords() function but facing issues passing each row of the dataframe as argument & calling it using iterations to provide keywords & store it in adjacent columns. 我已经编写了一个小程序在Python中调用get_keywords函数。我已经创建了get_keywords()函数,但是遇到了将数据帧的每一行作为参数传递并使用迭代调用以提供关键字并将其存储在相邻列中的问题。

Codes are not providing expected column "tokens" with all the processed words in the df dataframe. 代码没有为df数据框中的所有已处理单词提供预期的列“令牌”。

    import nltk
    import pandas as pd
    import re
    import string
    from nltk import sent_tokenize, word_tokenize
    from nltk.corpus import stopwords
    from nltk.stem.porter import PorterStemmer
    remarks = pd.read_csv('/Users/ZKDN0YU/Desktop/comments/New 
    comments/ccomments.txt')
    df = pd.DataFrame(remarks, columns= ['c2'])
    df.head(50)
    df.tail(50)

    filename = 'ccomments.txt'
    file = open(filename, 'rt', encoding="utf-8")
    text = file.read()
    file.close()

    def get_keywords(row):     
    # split into tokens by white space
      tokens = text.split(str(row))
    # prepare regex for char filtering
      re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    # remove punctuation from each word
      tokens = [re_punc.sub('', w) for w in tokens]
    # remove remaining tokens that are not alphabetic
      tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
      stop_words = set(stopwords.words('english'))
      tokens = [w for w in tokens if not w in stop_words]
    # stemming of words
      porter = PorterStemmer()
      stemmed = [porter.stem(word) for word in tokens]
    # filter out short tokens
      tokens = [word for word in tokens if len(word) > 1]
      return tokens
      df['tokens'] = df.c2.apply(lambda row: get_keywords(row['c2']), 
       axis=1)
      for index, row in df.iterrows():
      print(index, row['c2'],"tokens : {}".format(row['tokens']))

Expected Output:- A Comments_modified file containing columns 1)index,2) c2(Comments) & 3)tokenized words for all rows of the dataframe having 10,000 comments. 预期的输出:-一个Comment_modified文件,其中包含具有10,000条注释的数据帧所有行的列1)索引,2)c2(注释)和3)标记的单词。

Assuming that your text file ccomments.txt does not have any heading (ie data starts from first row itself) and has only one column data per row (ie, text file only has comments), below code will return a list of words. 假设文本文件ccomments.txt没有任何标题(即,数据从第一行本身开始)并且每行只有一个列数据(即,文本文件仅具有注释),则下面的代码将返回单词列表。

import nltk
import pandas as pd
import re
import string
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer


def get_keywords(row):     
    # split into tokens by white space
      tokens = row.split()
    # prepare regex for char filtering
      re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    # remove punctuation from each word
      tokens = [re_punc.sub('', w) for w in tokens]
    # remove remaining tokens that are not alphabetic
      tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
      stop_words = set(stopwords.words('english'))
      tokens = [w for w in tokens if w not in stop_words]
    # stemming of words
      porter = PorterStemmer()
      stemmed = [porter.stem(word) for word in tokens]
    # filter out short tokens
      tokens = [word for word in tokens if len(word) > 1]
      return tokens


df = pd.read_csv('ccomments.txt',header=None,names = ['c2'])                      
df['tokens'] = df.c2.apply(lambda row: get_keywords(row))
for index, row in df.iterrows():
    print(index, row['c2'],"tokens : {}".format(row['tokens']))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM