Transforming data frame into Series creates NA's

Question

I've downloaded dataframe and tried to create pd.Series from this DataFrame

data = pd.read_csv(filepath_or_buffer = "train.csv", index_col = 0)
data.columns

Index([u'qid1',u'qid2',u'question1',u'question2'], dtype = 'object')

Here is columns in DataFrame, qid1 is ID of question1 and qid2 is ID for question2 Also, there is no Nan in my DataFrame:

data.question1.isnull().sum()
0

I want to create pandas.Series() from first questions with qid1 as index:

question1 = pd.Series(data.question1, index = data.qid1)
question1.isnull.sum()
68416

And now, there are 68416 Null values in my Series. Where is my mistake?

Answer 1

pass anonymous values so the Series ctor doesn't try to align:

question1 = pd.Series(data.question1.values, index = data.qid1)

The problem here is that question1 column has it's own index so it's going to try to use this during the construction

Example:

In [12]:
df = pd.DataFrame({'a':np.arange(5), 'b':list('abcde')})
df

Out[12]:
   a  b
0  0  a
1  1  b
2  2  c
3  3  d
4  4  e

In [13]:
s = pd.Series(df['a'], index = df['b'])
s

Out[13]:
b
a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
Name: a, dtype: float64

In [14]:
s = pd.Series(df['a'].values, index = df['b'])
s

Out[14]:
b
a    0
b    1
c    2
d    3
e    4
dtype: int32

Effectively what happens here is that you're reindexing your existing column with the passed in new index, because there are no index values that match you get NaN

Transforming data frame into Series creates NA's

Question

1 answers

solution1
3 2017-04-03 14:15:53

Transforming data frame into Series creates NA's

Question

1 answers

solution1 3 2017-04-03 14:15:53

solution1
3 2017-04-03 14:15:53