I am just switching from Matlab to Python, and would like to learn how to read this file in Python using loadtxt
from numpy
package. (I use textscan
in Matlab
to read it)
"07220S006","14/01/12 01:59:50",10,"0"
"07220S006","14/01/12 02:00:00",10,"0"
"07220S006","14/01/12 02:00:10",10,"0"
I am able to use the split
function in Python regular expression package to read this file in, however, given my data containing about a few hundred thousands of lines like these, the split
function applied on each single line will result in a significant analysis time. So I think loadtxt
will do a better job in this case. I have found a number of solutions to read similar file, but this file is much more complicated than those examples and I got no idea how to read it.
Any help and recommendation is appreciated
You could do it easy with pandas
and then if you need numpy
array you could access to values
:
import pandas as pd
from io import StringIO
data = """
"07220S006","14/01/12 01:59:50",10,"0"
"07220S006","14/01/12 02:00:00",10,"0"
"07220S006","14/01/12 02:00:10",10,"0"
"""
df = pd.read_csv(StringIO(data), header=None)
print(df)
0 1 2 3
0 07220S006 14/01/12 01:59:50 10 0
1 07220S006 14/01/12 02:00:00 10 0
2 07220S006 14/01/12 02:00:10 10 0
print(df.values)
array([['07220S006', '14/01/12 01:59:50', 10, 0],
['07220S006', '14/01/12 02:00:00', 10, 0],
['07220S006', '14/01/12 02:00:10', 10, 0]], dtype=object)
EDIT
IUUC you want to split date column to date and time (or to year, month and etc.)/ You could first convert that column to datetime
object with pd.to_datetime
and then access to fields with datetime
with dt
and write it to new columns:
date_col = pd.to_datetime(df[1])
date_col.dt.year
print(date_col.dt.year)
0 2012
1 2012
2 2012
Name: 1, dtype: int64
Or you could convert it string if you want any with dt.strftime
, eg:
print(date_col.dt.strftime("%Y/%m %H:%M"))
0 2012/01 01:59
1 2012/01 02:00
2 2012/01 02:00
Name: 1, dtype: object
You could create very easy as:
df['year'] = date_col.dt.year
print(df)
0 1 2 3 year
0 07220S006 14/01/12 01:59:50 10 0 2012
1 07220S006 14/01/12 02:00:00 10 0 2012
2 07220S006 14/01/12 02:00:10 10 0 2012
Treating any value in quotes as strings, and using numpy.genfromtxt instead (better at dealing with missing values):
import numpy as np
from StringIO import String IO
example_data = '"07220S006","14/01/12 01:59:50",10,"0"\n"07220S006","14/01/12 02:00:00",10,"0"\n"07220S006","14/01/12 02:00:10",10,"0"'
# approximation of your input data
data = np.genfromtxt(StringIO(example_data), delimiter=',', dtype='S16,S16,i4,S3')
# dtypes: Sx - x char string, i4 - 32 bit integer
# more here: http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
print data
[('"07220S006"', '"14/01/12 01:59:50"', 10, '"0"')
('"07220S006"', '"14/01/12 02:00:00"', 10, '"0"')
('"07220S006"', '"14/01/12 02:00:10"', 10, '"0"')]
Cant think of a simple way of removing the quote marks using numpy, I think as in the post above using pandas would probably be a better solution or python CSVReader
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.