从Panda Dataframe转换为numpy数组期间出现奇怪的错误

Question

I have a pandas dataframe with two columns: "review"(text) and "sentiment"(1/0) 我有一个带有两列的熊猫数据框：“评论”（文本）和“情感”（1/0）

X_train = df.loc[0:25000, 'review'].values
y_train = df.loc[0:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

But after conversion to numpy array, using values() method. 但是在转换为numpy数组之后，使用values()方法。 I obtain numpy arrays of following shape: 我得到以下形状的numpy数组：

print(df.shape)   #(50000, 2)
print(X_train.shape) #(25001,)
print(y_train.shape) #(25001,)
print(X_test.shape) # (25000,)
print(y_test.shape) # (25000,)

So as you can see values() method, added one additional row. 这样就可以看到values()方法，又增加了一行。 This is really strange and I cant detect error. 这真的很奇怪，我无法检测到错误。

Answer 1

The df.loc is label based, ie it includes the upper bound. df.loc基于标签，即包括上限。 Use iloc : 使用iloc ：

df.iloc[:25000, 1].values # here 1 is the column of 'review' for example

if you want NumPy-like slicing. 如果您想要类似NumPy的切片。

With iloc you need to supply both rows and columns as integers or integer slices. 使用iloc您需要将行和列都提供为整数或整数切片。

Example 例

>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> df
   a  b
0  1  4
1  2  5
2  3  6

This is label based, ie upper bound inclusive: 这是基于标签的，即包括上限在内：

>>> df.loc[:1, 'a']
0    1
1    2
Name: a, dtype: int64

This works like slicing in NumPy, ie upper bound exclusive: 这就像在NumPy中切片一样，即上限互斥：

>>> df.iloc[:2, 0]
0    1
1    2
Name: a, dtype: int64

从Panda Dataframe转换为numpy数组期间出现奇怪的错误

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-04-07 21:59:25

Example 例

从Panda Dataframe转换为numpy数组期间出现奇怪的错误

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-04-07 21:59:25

Example 例

解决方案1
1 已采纳 2016-04-07 21:59:25