简体   繁体   中英

Inputting data from text as array

Hello everyone,

I have a text file which has the data in the following format:

[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,1,4,9,0,0,0,0,2,8,13,47,0,0,0,0,0,0,12,139,11,1,0,0,4,8,44,139,14,4,1,1,30,45,80,139,34,28,0,0,7,34,117,43,0,0,0,0,0,5,40,139,78,9,0,0,0,12,100,139,121,42,4,1,6,7,16,122,101,117,22,13,4,1,10,0,0,0,0,0,0,10,9,33,7,0,0,0,0,42,87,139,20,2,0,0,0,6,95,83,9,5,8,39,73,13,45]

That is each line is a sample of 128 dimension and likewise, there are 50k samples throughout my text file.

I am performing K-Means clustering for the above given format of data. When I input the data directly, it works perfectly fine in the following code:

from sklearn.cluster import MiniBatchKMeans
import numpy

data = [[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,1,4,9,0,0,0,0,2,8,13,47,0,0,0,0,0,0,12,139,11,1,0,0,4,8,44,139,14,4,1,1,30,45,80,139,34,28,0,0,7,34,117,43,0,0,0,0,0,5,40,139,78,9,0,0,0,12,100,139,121,42,4,1,6,7,16,122,101,117,22,13,4,1,10,0,0,0,0,0,0,10,9,33,7,0,0,0,0,42,87,139,20,2,0,0,0,6,95,83,9,5,8,39,73,13,45]]   



mbkm = MiniBatchKMeans(init='k-means++', n_clusters=8, batch_size=100, n_init=10, max_no_improvement=10, verbose=0)
mbkm.fit(data)
mbk_means_cluster_centers = mbkm.cluster_centers_

numpy.set_printoptions(threshold=numpy.nan)
print mbk_means_cluster_centers

But when I use this code

f = open("sample_input.txt", "r")
out = f.readlines()
for line in out:
    print line

To read the contents from the text file into array format, it is failing and I am getting the error "Value Error: Could not convert string to float".

I am not able to understand where I am going wrong. Please suggest me a better way to get the code running. Thanks in advance!

PS: I am coding in python 2.7 in ubuntu platform.

I must preface this by saying that storing data in a text file as a code representation of an array is a bad idea. If you can, store your data in a serializable format like CSV or JSON.

What's happening is that you're reading the line and it's in a string format, not an array format. When you iterate over the string (String is still an enumerable), it's getting each letter, but your code then complains about not being able to use that string because it expects a float.

If you REALLY need to read that file in that format and you trust the origin of the file, try doing this.

f = open("sample_input.txt", "r") 
out = [eval(arr) for arr in f.readlines()]

Note that this will also execute code within that file, so make sure you trust the origin of the file.

My python experience is limited, so there might be a safer way of doing this. Next time, use a CSV formatted file for data.

To reiterate Moox's point, it would probably be a good idea to use csv to store this information. You can then use the csv module to parse the file.

Avoiding eval is also a good idea. You can do something like this to parse the data in its current format -

def line_to_list_of_ints(line):
    # Split each line on commas and convert to an int
    return [int(item) for item in line.split(',')]

with open("sample_input.txt", "r") as f:
    lines = [line.strip() for line in f] # Remove new lines / whitespace
lines = [line[1:-1] for line in lines] # Remove square brackets from each end
lines = [line_to_list_of_ints(line) for line in lines] # Convert the line to a list of integers

If you were using a csv file it could be simplified to something like this -

import csv

with open("sample_input.csv", "r") as f:
    reader = csv.reader(f)
    lines = []
    for line in reader:
        lines.append([int(item) for item in line])

Use ast.literal_eval :

If you have a single array in the file:

from ast import  literal_eval

with open("sample_input.txt") as f:
   out = literal_eval(f.read())
   for line in out:
       print line

0
0
0
0
0
0
0
0
...............

For multiple arrays:

with open("in.txt") as f:
   for line in f:
       print literal_eval(line)

Say your data is in mydata.txt ...

% sed '1idata=[ 
s/$/,/;$a]' < mydata.txt > mydata.py

creates a python module that you can import in your program

from mydata import data

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM