简体   繁体   中英

How to read data from text file into array with Python

I have a bit trouble with some data stored in a text file on hand for regression analysis using Python.

The data are stored in the format that look like this:

2104,3,399900 1600,3,329900 2400,3,369000 ....

I need to do some analysis like finding mean by this: (2104+1600+...)/number of data

I think the appropriate steps is to store the data into array. But I have no idea how to store it. I think of two ways to do so. The first one is to set 3 array that stores like

a=[2104 1600 2400 ...] b=[3 3 3 ...] c=[399900 329900 36000 ...]

The second way is to store in

a=[2104 3 399900], b=[1600 3 329900] and so on. 

Which one is better?

Also, how to write code that allows the data can be stored into array? I think of like this:

with open("file.txt", "r") as ins:
array = []
elt.strip(',."\'?!*:') for line in ins:
array.append(line)

Is that correct?

Instead of having multiple arrays a , b , c ... you could store your data as an array of arrays (a 2 dimensional array). For example:

[[2104,3,399900],
 [1600,3,329900],
 [2400,3,369000]...]

This way you don't have to deal with dynamically naming your arrays. How you store your data, ie 3 * array of length n or n * array of length 3 is up to you. I would prefer the second way. To read the data into your array you should then use the split() function, which will split your input into an array. So in your case:

with open("file.txt", "r") as ins:
    tmp = ins.read().split(" ")
    array = [i.split(",") for i in tmp]

>>> array
[['2104', '3', '399900'], ['1600', '3', '329900'], ['2400', '3', '369000']]

Edit: To find the mean eg for the first element in each list you could do the following:

arraymean = sum([int(i[0]) for i in array]) / len(array)

Where the 0 in i[0] specifies the first element in each list. Note that this code uses list comprehension, which you can learn more about in this post if you want to.

Also this code stores the values in the array as strings, hence the cast to int in the part to get the mean. If you want to store the data as int directly just edit the part in the file reading section:

array = [[int(j) for j in i.split(",")] for i in tmp]

Using pandas and numpy you can get the data into an array as follows:

In [37]: data = "2104,3,399900 1600,3,329900 2400,3,369000"

In [38]: d = pd.read_csv(StringIO.StringIO(data), sep=',| ', header=None, index_col=None, engine="python")

In [39]: d.values.reshape(3, d.shape[1]/3)
Out[39]: 
array([[  2104,      3, 399900],
       [  1600,      3, 329900],
       [  2400,      3, 369000]])

You could use :

with open('data.txt') as data:
    substrings = data.read().split()
    values = [map(int, substring.split(',')) for substring in substrings]
    average = sum([a for a, b, c in values]) / float(len(values))
    print average

With this data.txt , :

2104,3,399900 1600,3,329900 2400,3,369000
2105,3,399900 1601,3,329900 2401,3,369000

It outputs :

2035.16666667

This a quick solution without error checking (using a list comprehension technique, PEP202 ). But if your file has a consistent format you can do the following:

import numpy as np

a = np.array([np.array(i.split(",")).astype("float") for i in open("example.txt").read().split(" ")])

Should you print it:

print(a)
print("Mean of column 0: ", np.mean(a[:, 0]))

You'll obtain the following:

[[  2.10400000e+03   3.00000000e+00   3.99900000e+05]
 [  1.60000000e+03   3.00000000e+00   3.29900000e+05]
 [  2.40000000e+03   3.00000000e+00   3.69000000e+05]]
Mean of column 0:  2034.66666667

Notice how, in the code snippet, specified the "," as separator inside triplet, and the space " " as separator between triplets. This is the exact contents of the file I used as an example:

2104,3,399900 1600,3,329900 2400,3,369000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM