简体   繁体   English

创建 append 2 列到 dataframe 通过自定义 function - ZA7F5F35423B5x688

[英]Create and append 2 columns to dataframe via custom function - Python 3.x

I have a dataframe called csv_table that looks like this:我有一个名为csv_table的 dataframe 看起来像这样:

      class                      ID                                               text
0         2  BIeDBg4MrEd1NwWRlFHLQQ  Decent but terribly inconsistent food. I've ha...
1         4  NJHPiW30SKhItD5E2jqpHw  Looks aren't everything.......  This little di...
2         2  nnS89FMpIHz7NPjkvYHmug  Being a creature of habit anytime I want good ...
3         2   FYxSugh9PGrX1PR0BHBIw  I recently told a friend that I cant figure ou...
4         4  ScViKtQ2xq6i5AyN4curYQ  Chevy's five years ago was crisp and fresh and...
5         2  vz8Q37FSlypZlgy5N7Ym0A  Every time I go to this Jack In The Box I get ...
6         4   OJuG2EvItSZXbu8KowI9A  I've been going to Cluckers for years. Every t...
7         4   k9ci6SfI5RZT3smNdnvSg  .                                             ...
8         4  qq6bQbrBZyd4lOBd8KSCoA  Well, after their remodel the place no longer ...
9         4     FldFfwfuk9T8kvkp8iw  Beer selection was good, but they were out of ...
10        4  63ufCUqbPcnl6abC1SBpvQ  Ihop is my favorite breakfast chain, and the s...
11        4   nDYCZDIAvdcx77EcmYz0Q  A very good Jewish deli tucked in and amongst ...
12        4  uoC1llZumwFKgXAMlDbZIg  Went here for lunch with Rand H. and this plac...
13        2   BBs1rbz75dDifvoQyVMDg  Picture the least attractive person you'd sett...
14        4    2t9znjapzhioLqb4Pf1Q  Really really really strong Margaritas!   The ...
15        4  GqLgixGcbWh51IzkwsiswA  I would not have known about this place had it...

[1999 rows x 3 columns]

I am trying to add 2 columns to the csv_table , one that specifies the number of words in the text column (as denoted with a "word" being a split on space), and a column that specifies the number of "clean" words as defined by a custom function.我正在尝试将2列添加到csv_table ,其中一列指定text列中的单词数(用“单词”表示为空间拆分),一列指定“干净”单词的数量为由自定义 function 定义。

I have the ability to count the total clean and dirty words, but how can I apply these functions to each row in the dataframe, and append those columns?我有能力计算干净和脏字的总数,但是如何将这些函数应用于 dataframe 和 append 中的每一行?

Code is below:代码如下:

import nltk, re, pandas as pd
from nltk.corpus import stopwords
import sklearn, string
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from itertools import islice

# This function removes numbers from an array
def remove_nums(arr): 
    # Declare a regular expression
    pattern = '[0-9]'  
    # Remove the pattern, which is a number
    arr = [re.sub(pattern, '', i) for i in arr]    
    # Return the array with numbers removed
    return arr

# This function cleans the passed in paragraph and parses it
def get_words(para):   
    # Create a set of stop words
    stop_words = set(stopwords.words('english'))
    # Split it into lower case    
    lower = para.lower().split()
    # Remove punctuation
    no_punctuation = (nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in lower)
    # Remove integers
    no_integers = remove_nums(no_punctuation)
    # Remove stop words
    dirty_tokens = (data for data in no_integers if data not in stop_words)
    # Ensure it is not empty
    tokens = [data for data in dirty_tokens if data.strip()]
    # Ensure there is more than 1 character to make up the word
    tokens = [data for data in tokens if len(data) > 1]

    # Return the tokens
    return tokens 

def main():

    tsv_file = "filepath"
    csv_table=pd.read_csv(tsv_file, sep='\t')
    csv_table.columns = ['class', 'ID', 'text']

    print(csv_table)

    s = pd.Series(csv_table['text'])
    new = s.str.cat(sep=' ')
    clean_words = get_words(new)
    dirty_words = [word for word in new if word.split()]
    clean_length = len(clean_words)
    dirty_length = len(dirty_words)
    print("Clean Length: ", clean_length)
    print("Dirty Length: ", dirty_length)


main()

Which currently produces:目前生产:

Clean Length:  125823
Dirty Length:  1091370

I did try csv_table['clean'] = csv_table['text'].map(get_words(csv_table['text'])) which yielded:我确实尝试csv_table['clean'] = csv_table['text'].map(get_words(csv_table['text']))产生:

AttributeError: 'Series' object has no attribute 'lower'

How can I apply the dirty / clean logic to each row and append those two columns to the dataframe?如何将脏/干净逻辑应用于每一行并将 append 这两列应用于 dataframe?

Use apply to apply a function on each row.使用apply在每一行上应用 function。 For the dirty word count you can split the strings with pandas and then apply len to get the count.对于脏字数,您可以使用 pandas 拆分字符串,然后应用len来获取计数。 For the clean word count, directly apply the custom function:对于干净的字数,直接应用自定义的function:

csv_table['dirty'] = csv_table['text'].str.split().apply(len)
csv_table['clean'] = csv_table['text'].apply(lambda s: len(get_words(s)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM