Pandas不会丢弃符合条件的行和列

Question

I am trying to make a regression model in order to predict ratings (1-5) based on words that appear (the regression doesn't have to perform well per se, it's more about the methodology applied). 我试图建立一个回归模型，以便根据出现的单词预测评级（1-5）（回归本身不一定表现良好，更多的是应用的方法）。 I created a term frequency matrix with this code: 我用这段代码创建了一个术语频率矩阵：

bow = df.Review2.str.split().apply(pd.Series.value_counts)

which look like this: 看起来像这样：

I am now interested in deleting columns (words) that rarely appear throughout the reviews. 我现在有兴趣删除在整个评论中很少出现的栏目（单词）。 Moreover, I want to iterate through only the reviews (rows) that have a Rating value which is not NaN . 此外，我想只迭代具有不是NaN的Rating值的评论（行）。

here is my attempt: 这是我的尝试：

# Delete row if Rating less than 1
for index, row in df.iterrows():
    if (df.Rating[index] < 1):
        bow.drop(bow.index[index], axis=0, inplace = True)

# Delete column if word occurs less than 50 times
sum1 = bow.sum(axis=0)       
cntr = 0
for i in sum1:
    if (i < 50):
        bow.drop(bow.index[cntr], axis=1, inplace = True)
    cntr += 1

This doesn't seem to do the work as it leaves words that occur only once. 这似乎不起作用，因为它留下只出现一次的单词。

EDIT: 编辑：

This is my sparse dataframe containing occurrences of words. 这是我的稀疏数据框，包含单词的出现。 Col -> words; Col - >单词; Rows -> sentences (item's reviews) (I have 1.5k items, thus 1.5k rows) 行 - >句子（项目的评论）（我有1.5k项，因此1.5k行）

     hi this are just some random words  I  don t      ...  zing  zingy zingzang    
0   1.0 NaN  1.0 1.0  1.0   NaN   NaN   NaN NaN NaN    ...  NaN    NaN    NaN   
1   NaN NaN  NaN NaN  NaN   NaN   NaN   NaN NaN NaN    ...  NaN    NaN    NaN       
2   NaN NaN  NaN NaN  NaN   NaN   NaN   NaN NaN NaN    ...  NaN    NaN    NaN   
3   NaN NaN  NaN NaN  NaN   NaN   NaN   NaN NaN NaN    ...  NaN    NaN    NaN   
4   NaN NaN  NaN NaN  NaN   NaN   NaN   NaN NaN 1.0    ...  NaN    NaN    NaN

Rating is a single column of my original dataframe containing integers in the [1,5] range or NaN Rating是我原始数据帧的单列，包含[1,5]范围或NaN中的整数

Answer 1

You can use vectorised operations instead of manual iteration: 您可以使用矢量化操作而不是手动迭代：

# filter out rows where Rating < 1
bow = bow[~(bow['Rating'] < 1)]

# filter out columns where sum < 50
bow = bow.loc[:, ~(bow.sum(0) < 50)]

Or simultaneously: 或同时：

# filter rows and columns with Boolean series
bow = bow.loc[~(bow['Rating'] < 1), ~(bow.sum(0) < 50)]

Answer 2

I made this working toy example: 我制作了这个有用的玩具示例：

import pandas as pd
import numpy  as np

# Create a toy daframe
df = pd.DataFrame(np.arange(12).reshape(3,4),columns=['A', 'B', 'C', 'D'])

print(df)
#   A   B  C  D
#-------------
# 0  0  1  2  3
# 1  4  5  6  7
# 2  8  9 10 11

# Sum all the values for each column
column_sum = df.sum(axis=0)
print(column_sum)
# A    12
# B    15
# C    18
# D    21

# Iterate over Columns name and sum value
for key,value in zip(df.keys(),sum1):
    if(value < 16):
        df.drop(columns=key, axis=1, inplace = True)

print(df)

#    C  D
# 0  2  3
# 1  6  7
# 2 10 11

so I guess that if you change your code to: 所以我想如果你改变你的代码：

for key,value in zip(df.keys(),sum1):
    if(value < 50):
        bow.drop(columns=key, axis=1, inplace = True)

it should get the job done. 它应该完成工作。

Pandas不会丢弃符合条件的行和列

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-07-13 16:47:40

解决方案2
2 2018-07-13 17:03:38

Pandas不会丢弃符合条件的行和列

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-07-13 16:47:40

解决方案2 2 2018-07-13 17:03:38

解决方案1
2 已采纳 2018-07-13 16:47:40

解决方案2
2 2018-07-13 17:03:38