[英]Create and append 2 columns to dataframe via custom function - Python 3.x
I have a dataframe called csv_table
that looks like this:我有一个名为
csv_table
的 dataframe 看起来像这样:
class ID text
0 2 BIeDBg4MrEd1NwWRlFHLQQ Decent but terribly inconsistent food. I've ha...
1 4 NJHPiW30SKhItD5E2jqpHw Looks aren't everything....... This little di...
2 2 nnS89FMpIHz7NPjkvYHmug Being a creature of habit anytime I want good ...
3 2 FYxSugh9PGrX1PR0BHBIw I recently told a friend that I cant figure ou...
4 4 ScViKtQ2xq6i5AyN4curYQ Chevy's five years ago was crisp and fresh and...
5 2 vz8Q37FSlypZlgy5N7Ym0A Every time I go to this Jack In The Box I get ...
6 4 OJuG2EvItSZXbu8KowI9A I've been going to Cluckers for years. Every t...
7 4 k9ci6SfI5RZT3smNdnvSg . ...
8 4 qq6bQbrBZyd4lOBd8KSCoA Well, after their remodel the place no longer ...
9 4 FldFfwfuk9T8kvkp8iw Beer selection was good, but they were out of ...
10 4 63ufCUqbPcnl6abC1SBpvQ Ihop is my favorite breakfast chain, and the s...
11 4 nDYCZDIAvdcx77EcmYz0Q A very good Jewish deli tucked in and amongst ...
12 4 uoC1llZumwFKgXAMlDbZIg Went here for lunch with Rand H. and this plac...
13 2 BBs1rbz75dDifvoQyVMDg Picture the least attractive person you'd sett...
14 4 2t9znjapzhioLqb4Pf1Q Really really really strong Margaritas! The ...
15 4 GqLgixGcbWh51IzkwsiswA I would not have known about this place had it...
[1999 rows x 3 columns]
I am trying to add 2
columns to the csv_table
, one that specifies the number of words in the text
column (as denoted with a "word" being a split on space), and a column that specifies the number of "clean" words as defined by a custom function.我正在尝试将
2
列添加到csv_table
,其中一列指定text
列中的单词数(用“单词”表示为空间拆分),一列指定“干净”单词的数量为由自定义 function 定义。
I have the ability to count the total clean and dirty words, but how can I apply these functions to each row in the dataframe, and append those columns?我有能力计算干净和脏字的总数,但是如何将这些函数应用于 dataframe 和 append 中的每一行?
Code is below:代码如下:
import nltk, re, pandas as pd
from nltk.corpus import stopwords
import sklearn, string
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from itertools import islice
# This function removes numbers from an array
def remove_nums(arr):
# Declare a regular expression
pattern = '[0-9]'
# Remove the pattern, which is a number
arr = [re.sub(pattern, '', i) for i in arr]
# Return the array with numbers removed
return arr
# This function cleans the passed in paragraph and parses it
def get_words(para):
# Create a set of stop words
stop_words = set(stopwords.words('english'))
# Split it into lower case
lower = para.lower().split()
# Remove punctuation
no_punctuation = (nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in lower)
# Remove integers
no_integers = remove_nums(no_punctuation)
# Remove stop words
dirty_tokens = (data for data in no_integers if data not in stop_words)
# Ensure it is not empty
tokens = [data for data in dirty_tokens if data.strip()]
# Ensure there is more than 1 character to make up the word
tokens = [data for data in tokens if len(data) > 1]
# Return the tokens
return tokens
def main():
tsv_file = "filepath"
csv_table=pd.read_csv(tsv_file, sep='\t')
csv_table.columns = ['class', 'ID', 'text']
print(csv_table)
s = pd.Series(csv_table['text'])
new = s.str.cat(sep=' ')
clean_words = get_words(new)
dirty_words = [word for word in new if word.split()]
clean_length = len(clean_words)
dirty_length = len(dirty_words)
print("Clean Length: ", clean_length)
print("Dirty Length: ", dirty_length)
main()
Which currently produces:目前生产:
Clean Length: 125823
Dirty Length: 1091370
I did try csv_table['clean'] = csv_table['text'].map(get_words(csv_table['text']))
which yielded:我确实尝试
csv_table['clean'] = csv_table['text'].map(get_words(csv_table['text']))
产生:
AttributeError: 'Series' object has no attribute 'lower'
How can I apply the dirty / clean logic to each row and append those two columns to the dataframe?如何将脏/干净逻辑应用于每一行并将 append 这两列应用于 dataframe?
Use apply
to apply a function on each row.使用
apply
在每一行上应用 function。 For the dirty word count you can split the strings with pandas and then apply len
to get the count.对于脏字数,您可以使用 pandas 拆分字符串,然后应用
len
来获取计数。 For the clean word count, directly apply the custom function:对于干净的字数,直接应用自定义的function:
csv_table['dirty'] = csv_table['text'].str.split().apply(len)
csv_table['clean'] = csv_table['text'].apply(lambda s: len(get_words(s)))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.