Pandas converting all data to NaN after adding column values

Question

I'm trying to add column headers to the following set of data. As per specifications of the project, I cannot simply modify the file to add those headers manually.

Sample of the data that I'm working with:

38.049133   0.224026 0.05398  -19.11 -20.03
38.352526   0.212491 0.05378  -18.35 -19.19
38.363598   0.210654 0.05401  -20.11 -20.89
54.936819   0.216794 0.20114  -20.94 -21.88
54.534881   0.578615 0.12887  -19.75 -20.66
54.743075   0.508774 0.18331  -20.54 -21.53
54.867240   0.562636 0.13956  -19.95 -20.85
54.856908   0.544031 0.13938  -20.14 -21.03
54.977748   0.501912 0.13923  -20.27 -21.01
54.992762   0.460376 0.12723  -20.24 -20.83

I've created an array of 5 strings to act as the headers of each of the columns within this DataFrame. Using the designated header does select only that column (ie print(df['z']) does only print that one column (supposedly) but all of the data in the DataFrame, that displays just fine (ie shows the above sample lines exactly and detects the columns properly) when I do not specify columns, suddenly becomes "NaN" when I specify column titles from the array of strings.

Sample of my code:

... imports and whatnot not shown

dataColumns = ['RA', 'DEC', 'z', 'M(g)', 'M(r)']
dataFile = pd.read_csv(data = 'file_name', delim_whitespace = True)
df = pd.DataFrame(data = dataFile, columns = dataColumns)

print(df)

Sample output of the above code (it is supposed to display exactly the sample data above but with added column headers):

RA   DEC z  M(g) M(r)
NaN   NaN NaN  NaN NaN
NaN   NaN NaN  NaN NaN
NaN   NaN NaN  NaN NaN
NaN   NaN NaN  NaN NaN
NaN   NaN NaN  NaN NaN
NaN   NaN NaN  NaN NaN
NaN   NaN NaN  NaN NaN
NaN   NaN NaN  NaN NaN
NaN   NaN NaN  NaN NaN
NaN   NaN NaN  NaN NaN

Why is it that, without specifying the 'columns' parameter for DataFrame, the data will properly print wheras after specifying the parameter, everything displays as NaN?

Any help would be appreciated!

-- paanvaannd

Answer 1

To fix your problem, use this line instead:

df = pd.read_csv('file_name', header=None, names=dataColumns)

pd.read_csv returns a DataFrame, so the above line should handle the entirety of the import (ie calling pd.DataFrame on the result of pd.read_csv is superfluous). header=None indicates that pandas shouldn't interpret the first line of the CSV as headers, and then names=... allows you to specify the column names you'd like to use. delim_whitespace shouldn't be used, since commas, not whitespace, appears to be the delimiter in your data ('comma' is the 'c' in 'csv', after all). In fact, without testing your data, I'd say the use of delim_whitespace is the most likely culprit behind the NaN values.

Answer 2

You are passing a dataframe that you created when you used .read_csv to a the dataframe constructor pd.DataFrame . I am actually surprised it didn't throw an error.

Try this:

df = pd.read_csv(data = 'file_name', delim_whitespace = True)
df.columns = dataColumns

Pandas converting all data to NaN after adding column values

Question

2 answers

solution1
1 ACCPTED 2017-02-11 04:14:28

solution2
0 2017-02-11 04:13:45

Pandas converting all data to NaN after adding column values

Question

2 answers

solution1 1 ACCPTED 2017-02-11 04:14:28

solution2 0 2017-02-11 04:13:45

solution1
1 ACCPTED 2017-02-11 04:14:28

solution2
0 2017-02-11 04:13:45