简体   繁体   中英

python loadtxt to read delimited file

I am just switching from Matlab to Python, and would like to learn how to read this file in Python using loadtxt from numpy package. (I use textscan in Matlab to read it)

"07220S006","14/01/12 01:59:50",10,"0"

"07220S006","14/01/12 02:00:00",10,"0"

"07220S006","14/01/12 02:00:10",10,"0"

I am able to use the split function in Python regular expression package to read this file in, however, given my data containing about a few hundred thousands of lines like these, the split function applied on each single line will result in a significant analysis time. So I think loadtxt will do a better job in this case. I have found a number of solutions to read similar file, but this file is much more complicated than those examples and I got no idea how to read it.

Any help and recommendation is appreciated

You could do it easy with pandas and then if you need numpy array you could access to values :

import pandas as pd
from io import StringIO

data = """
"07220S006","14/01/12 01:59:50",10,"0"
"07220S006","14/01/12 02:00:00",10,"0"
"07220S006","14/01/12 02:00:10",10,"0"
"""

df = pd.read_csv(StringIO(data), header=None)

print(df)
           0                  1   2  3
0  07220S006  14/01/12 01:59:50  10  0
1  07220S006  14/01/12 02:00:00  10  0
2  07220S006  14/01/12 02:00:10  10  0


print(df.values)
array([['07220S006', '14/01/12 01:59:50', 10, 0],
       ['07220S006', '14/01/12 02:00:00', 10, 0],
       ['07220S006', '14/01/12 02:00:10', 10, 0]], dtype=object)

EDIT

IUUC you want to split date column to date and time (or to year, month and etc.)/ You could first convert that column to datetime object with pd.to_datetime and then access to fields with datetime with dt and write it to new columns:

date_col = pd.to_datetime(df[1])
date_col.dt.year
print(date_col.dt.year) 
0    2012
1    2012
2    2012
Name: 1, dtype: int64

Or you could convert it string if you want any with dt.strftime , eg:

print(date_col.dt.strftime("%Y/%m %H:%M"))
0    2012/01 01:59
1    2012/01 02:00
2    2012/01 02:00
Name: 1, dtype: object

You could create very easy as:

df['year'] = date_col.dt.year

print(df)
           0                  1   2  3  year
0  07220S006  14/01/12 01:59:50  10  0  2012
1  07220S006  14/01/12 02:00:00  10  0  2012
2  07220S006  14/01/12 02:00:10  10  0  2012

Treating any value in quotes as strings, and using numpy.genfromtxt instead (better at dealing with missing values):

import numpy as np
from StringIO import String IO

example_data = '"07220S006","14/01/12 01:59:50",10,"0"\n"07220S006","14/01/12 02:00:00",10,"0"\n"07220S006","14/01/12 02:00:10",10,"0"'
# approximation of your input data

data = np.genfromtxt(StringIO(example_data), delimiter=',', dtype='S16,S16,i4,S3')
# dtypes: Sx - x char string, i4 - 32 bit integer
# more here: http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html

print data
[('"07220S006"', '"14/01/12 01:59:50"', 10, '"0"')
 ('"07220S006"', '"14/01/12 02:00:00"', 10, '"0"')
 ('"07220S006"', '"14/01/12 02:00:10"', 10, '"0"')]

Cant think of a simple way of removing the quote marks using numpy, I think as in the post above using pandas would probably be a better solution or python CSVReader

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM