简体   繁体   中英

Reading CSV file with numpy.genfromtxt() - delimiter as a part of a row name

I've downloaded dataset "Age at 1st marriage (women)" from http://www.gapminder.org/data in Excel/CSV format. The dataset has the first row with header and the first column contains names of countries.

To read these data I am using the code below.

import numpy as np

source=open("D:\FirstMarriage.csv")

data = np.genfromtxt(source, dtype=None, delimiter=",", skip_header=1)
print data

After executing this code (in Spyder IDE) I receive this error:

ValueError: Some errors were detected !
Line #37 (got 118 columns instead of 117)
Line #38 (got 118 columns instead of 117)
Line #72 (got 118 columns instead of 117)
Line #87 (got 118 columns instead of 117)
Line #97 (got 118 columns instead of 117)
Line #98 (got 118 columns instead of 117)
Line #184 (got 118 columns instead of 117)

When I open the csv file with Notepad++ and I look for the indicated lines I find that these rows contain names of the countries with coma in their names. Moreover, these names are taken into quotation marks as the only ones probably to indicate that this is a full name. However, it doesn't help me. Please see the example below (I am showing only the first column):

China
Colombia
"Congo, Dem. Rep."
"Congo, Rep."
Costa Rica

Is there any easy way to clean this data and treat the name in quotation marks as a single string?

I use Python 2.7 (Anaconda) on Windows 10.

Thanks ahead!

The best way, in my opinion, to read a csv or any other character delimited file is to use the DataFrame class from Pandas. You won't have to deal with the presence of commas since DataFrame s follow all commons CSV specs.

import pandas as pd
data = pd.read_csv(source)

numpy is quote unaware.

There are 2 solutions to this.

  1. Add a pre and post processor to change the comma to | and then back.
  2. Use pandas library

     import pandas pandas.read_csv(filepath_or_buffer, quotechar='"').as_matrix() 

It can be done using 2 csv files. First one you would have to create to relieve your data off the commas and add a separate delimiter say ; and eliminating those double quotes present. For more understanding visit: https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html use the deletechars parameter. Then in the generated csv file use it to as an input to a numpy array just use delimiter as ;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM