[英]Pandas not dropping rows and columns that meet criteria
I am trying to make a regression model in order to predict ratings (1-5) based on words that appear (the regression doesn't have to perform well per se, it's more about the methodology applied). 我试图建立一个回归模型,以便根据出现的单词预测评级(1-5)(回归本身不一定表现良好,更多的是应用的方法)。 I created a term frequency matrix with this code: 我用这段代码创建了一个术语频率矩阵:
bow = df.Review2.str.split().apply(pd.Series.value_counts)
which look like this: 看起来像这样:
I am now interested in deleting columns (words) that rarely appear throughout the reviews. 我现在有兴趣删除在整个评论中很少出现的栏目(单词)。 Moreover, I want to iterate through only the reviews (rows) that have a Rating
value which is not NaN
. 此外,我想只迭代具有不是NaN
的Rating
值的评论(行)。
here is my attempt: 这是我的尝试:
# Delete row if Rating less than 1
for index, row in df.iterrows():
if (df.Rating[index] < 1):
bow.drop(bow.index[index], axis=0, inplace = True)
# Delete column if word occurs less than 50 times
sum1 = bow.sum(axis=0)
cntr = 0
for i in sum1:
if (i < 50):
bow.drop(bow.index[cntr], axis=1, inplace = True)
cntr += 1
This doesn't seem to do the work as it leaves words that occur only once. 这似乎不起作用,因为它留下只出现一次的单词。
EDIT: 编辑:
This is my sparse dataframe containing occurrences of words. 这是我的稀疏数据框,包含单词的出现。 Col -> words; Col - >单词; Rows -> sentences (item's reviews) (I have 1.5k items, thus 1.5k rows) 行 - >句子(项目的评论)(我有1.5k项,因此1.5k行)
hi this are just some random words I don t ... zing zingy zingzang
0 1.0 NaN 1.0 1.0 1.0 NaN NaN NaN NaN NaN ... NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 ... NaN NaN NaN
Rating
is a single column of my original dataframe containing integers in the [1,5]
range or NaN
Rating
是我原始数据帧的单列,包含[1,5]
范围或NaN
中的整数
You can use vectorised operations instead of manual iteration: 您可以使用矢量化操作而不是手动迭代:
# filter out rows where Rating < 1
bow = bow[~(bow['Rating'] < 1)]
# filter out columns where sum < 50
bow = bow.loc[:, ~(bow.sum(0) < 50)]
Or simultaneously: 或同时:
# filter rows and columns with Boolean series
bow = bow.loc[~(bow['Rating'] < 1), ~(bow.sum(0) < 50)]
I made this working toy example: 我制作了这个有用的玩具示例:
import pandas as pd
import numpy as np
# Create a toy daframe
df = pd.DataFrame(np.arange(12).reshape(3,4),columns=['A', 'B', 'C', 'D'])
print(df)
# A B C D
#-------------
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
# Sum all the values for each column
column_sum = df.sum(axis=0)
print(column_sum)
# A 12
# B 15
# C 18
# D 21
# Iterate over Columns name and sum value
for key,value in zip(df.keys(),sum1):
if(value < 16):
df.drop(columns=key, axis=1, inplace = True)
print(df)
# C D
# 0 2 3
# 1 6 7
# 2 10 11
so I guess that if you change your code to: 所以我想如果你改变你的代码:
for key,value in zip(df.keys(),sum1):
if(value < 50):
bow.drop(columns=key, axis=1, inplace = True)
it should get the job done. 它应该完成工作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.