简体   繁体   English

将CSV导入Python

[英]Importing CSV into Python

I have a CSV dataset that looks like this: 我有一个如下所示的CSV数据集:

FirstAge,SecondAge,FirstCountry,SecondCountry,Income,NAME
41,41,USA,UK,113764,John
53,43,USA,USA,145963,Fred
47,37,USA,UK,42857,Dan
47,44,UK,USA,95352,Mark  

I'm trying to load it into Python 3.6 with this code: 我正在尝试使用以下代码将其加载到Python 3.6中:

>>> from numpy import genfromtxt

>>> my_data = genfromtxt('first.csv', delimiter=',')
>>> print(train_data)

Output: 输出:

 [[             nan              nan              nan              nan
               nan              nan]
 [  4.10000000e+01   4.10000000e+01              nan              nan
    1.13764000e+05              nan]
 [  5.30000000e+01   4.30000000e+01              nan              nan
    1.45963000e+05              nan]
 ..., 
 [  2.10000000e+01   3.00000000e+01              nan              nan
    1.19929000e+05              nan]
 [  6.90000000e+01   6.40000000e+01              nan              nan
    1.52667000e+05              nan]
 [  2.00000000e+01   1.90000000e+01              nan              nan
    1.05077000e+05              nan]]

I've looked at the Numpy docs and I don't see anything about this. 我看过Numpy文档,对此一无所获。

Go with pandas , it will save you the trouble: pandas去,可以为您省去麻烦:

import pandas as pd

df = pd.read_csv('first.csv')
print(df)

Alternative from using pandas is to use csv library 使用pandas替代方法是使用csv

import csv
import numpy as np
ls = list(csv.reader(open('first.csv', 'r')))
val_array = np.array(ls)[1::] # exclude first row (columns name)

You could use the dtype argument: 您可以使用dtype参数:

import numpy as np

output = np.genfromtxt("main.csv", delimiter=',', skip_header=1, dtype='f, f, |S6, |S6, f, |S6')

print(output)

Output: 输出:

[( 41.,  41., b'USA', b'UK',  113764., b'John')
 ( 53.,  43., b'USA', b'USA',  145963., b'Fred')
 ( 47.,  37., b'USA', b'UK',   42857., b'Dan')
 ( 47.,  44., b'UK', b'USA',   95352., b'Mark')]

With a few general paramters genfromtxt can read this file (in PY3 here): 通过一些常规参数, genfromtxt可以读取此文件(此处为PY3):

In [100]: data = np.genfromtxt('stack43444219.txt', delimiter=',', names=True, dtype=None)
In [101]: data
Out[101]: 
array([(41, 41, b'USA', b'UK', 113764, b'John'),
       (53, 43, b'USA', b'USA', 145963, b'Fred'),
       (47, 37, b'USA', b'UK',  42857, b'Dan'),
       (47, 44, b'UK', b'USA',  95352, b'Mark')], 
      dtype=[('FirstAge', '<i4'), ('SecondAge', '<i4'), ('FirstCountry', 'S3'), ('SecondCountry', 'S3'), ('Income', '<i4'), ('NAME', 'S4')])

This is a structured array. 这是一个结构化数组。 2 fields are integer, 2 are string (byte string by default), another integer, and string. 2个字段是整数,2个字段是字符串(默认情况下为字节字符串),另一个整数和字符串。

The default genfromtxt reads all lines as data. 默认的genfromtxt将所有行读取为数据。 I uses names=True to get to use the first line a field names. 我使用names=True来使用字段名称的第一行。

It also tries to read all strings a float (default dtype). 它还尝试以浮点数(默认dtype)读取所有字符串。 The string columns then load as nan . 然后,将字符串列加载为nan

All of this is in the genfromtxt docs. 所有这些都在genfromtxt文档中。 Admittedly they are long, but they aren't hard to find. 诚然,它们很长,但并不难找到。

Access fields by name, data['FirstName'] etc. 通过名称, data['FirstName']等访问字段


Using the csv reader gives a 2d array of strings: 使用csv阅读器可以产生二维数组的字符串:

In [102]: ls =list(csv.reader(open('stack43444219.txt','r')))
In [103]: ls
Out[103]: 
[['FirstAge', 'SecondAge', 'FirstCountry', 'SecondCountry', 'Income', 'NAME'],
 ['41', '41', 'USA', 'UK', '113764', 'John'],
 ['53', '43', 'USA', 'USA', '145963', 'Fred'],
 ['47', '37', 'USA', 'UK', '42857', 'Dan'],
 ['47', '44', 'UK', 'USA', '95352', 'Mark']]
In [104]: arr=np.array(ls)
In [105]: arr
Out[105]: 
array([['FirstAge', 'SecondAge', 'FirstCountry', 'SecondCountry', 'Income',
        'NAME'],
       ['41', '41', 'USA', 'UK', '113764', 'John'],
       ['53', '43', 'USA', 'USA', '145963', 'Fred'],
       ['47', '37', 'USA', 'UK', '42857', 'Dan'],
       ['47', '44', 'UK', 'USA', '95352', 'Mark']], 
      dtype='<U13')

I think the an issue that you could be running into is the data that you are trying to parse is not all numerics and this could potentially cause unexpected behavior. 我认为您可能会遇到的一个问题是,您尝试解析的数据并非全部为数字,这可能会导致意外行为。

One way to detect the types would be to try and identify the types before they are added to your array. 检测类型的一种方法是在将类型添加到数组之前尝试识别它们。 For example: 例如:

for obj in my_data:
    if type(obj) == int:
        # process or add your data to numpy
    else:
        # cast or discard the data

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM