简体   繁体   中英

Python- making a set from a text file

I have a text file. The guts of it look like this/ all of it looks like this (has been edited. This was also not what it initially looked like)

 (0, 16, 0)
 (0, 17, 0)
 (0, 18, 0)
 (0, 19, 0)
 (0, 20, 0)
 (0, 21, 0)
 (0, 22, 0)
 (0, 22, 1)
 (0, 22, 2)
 (0, 23, 0)
 (0, 23, 4)
 (0, 24, 0)
 (0, 25, 0)
 (0, 25, 1)
 (0, 26, 0)
 (0, 26, 3)
 (0, 26, 4)
 (0, 26, 5)
 (0, 26, 9)
 (0, 27, 0)
 (0, 27, 1)

Anyway, how do I put these values into a set on python 2?

My most recent attempt was

om_set = set(open('Rye Grass.txt').read()

EDIT: This is the code I used to get my text file. import cv2 import numpy as np import time

om=cv2.imread('spectrum1.png')
om=om.reshape(1,-1,3)
om_list=om.tolist()
om_tuple={tuple(item) for item in om_list[0]}
om_set=set(om_tuple)

im=cv2.imread('1.jpg')        
im=cv2.resize(im,(100,100))         
im= im.reshape(1,-1,3)
im_list=im.tolist()
im_tuple={tuple(item) for item in im_list[0]}
ColourCount= om_set & set(im_tuple)
with open('Weedlist', 'a') as outputfile:
    output = ', '.join([str(tup) for tup in sorted(ColourCount)])
    outputfile.write(output)
print 'done'

im=cv2.imread('2.jpg')        
im=cv2.resize(im,(100,100))         
im= im.reshape(1,-1,3)
im_list=im.tolist()
im_tuple={tuple(item) for item in im_list[0]}
ColourCount= om_set & set(im_tuple)
with open('Weedlist', 'a') as outputfile:
    output = ', '.join([str(tup) for tup in sorted(ColourCount)])
    outputfile.write(output)
print 'done'

As @TimPietzcker suggested and trusting the file to only have these fixed representations of integers in comma separated triplets, surrounded by parentheses, a simple parser in one go (OP's question also had a greed "read" of file into memors):

#! /usr/bin/env python
from __future__ import print_function


infile = 'pixel_int_tuple_reps.txt'
split_pits = None
with open(infile, 'rt') as f_i:
    split_pits = [z.strip(' ()') for z in f_i.read().strip().split('),')]
if split_pits:
    on_set = set(tuple(int(z.strip())
                 for z in tup.split(', ')) for tup in split_pits)

    print(on_set)

tramsforms:

(0, 19, 0), (0, 20, 0), (0, 21, 1), (0, 22, 0), (0, 24, 3), (0, 27, 0), (0, 29, 2), (0, 35, 2), (0, 36, 1)

into:

set([(0, 27, 0), (0, 36, 1), (0, 21, 1), (0, 22, 0), (0, 24, 3), (0, 19, 0), (0, 35, 2), (0, 29, 2), (0, 20, 0)])

The small snippet:

  1. splits the pixel integer triplets into substrings of 0, 19, 0 cleansing a bit the stray parens and spaces away (also taking care of the closing parentheses at the end.

  2. if that "worked" - further feeds the rgb split with integer conversion tuples into a set.

I would really think twice, before using eval/exec on that kind of deserialization task.

Update as suggested by comments from OP (please update the question!):

  1. The file at the OP's site seems to be too big to print (keep in memory)?
  2. It is not written, as the advertised in question ...

... so until we have further info from OP:

For a theoretical clean 3-int-tuple dump file this answer works (if not too big to load at once and map into a set).

For the concrete task, I may update the answer if sufficient new info has been added to the question ;-)

One way, if the triple "lines" are concat from previous stages with or without a newline separating, but alwayss missing the comma, to change the file reading part either:

  1. into a line based reader (when newlines separate) and pull the set generation into the loop always making a union of the new harvested set with the existing (accumulating one) like s = s | fresh s = s | fresh that is tackling them in "isolation"

or if these "chunks" are added like so (0, 1, 230)(13, ... that is )( "hitting hard":

  1. modify the existing code inside reader and instead of: f_i.read().strip().split('),') write f_i.read().replace(')('), (', ').strip().split('),') ... that is "fixing" the )( part into a ), ( part to be able to continue as if it would be a homogene "structure".

Update now parsing the version 2 of the dataset (updated question):

File pixel_int_tuple_reps_v2.txt now has:

 (0, 16, 0)
 (0, 17, 0)
 (0, 18, 0)
 (0, 19, 0)
 (0, 20, 0)
 (0, 21, 0)
 (0, 22, 0)
 (0, 22, 1)
 (0, 22, 2)
 (0, 23, 0)
 (0, 23, 4)
 (0, 24, 0)
 (0, 25, 0)
 (0, 25, 1)
 (0, 26, 0)
 (0, 26, 3)
 (0, 26, 4)
 (0, 26, 5)
 (0, 26, 9)
 (0, 27, 0)
 (0, 27, 1)

The code:

#! /usr/bin/env python
from __future__ import print_function


infile = 'pixel_int_tuple_reps_v2.txt'
on_set = set()
with open(infile, 'rt') as f_i:
    for line in f_i.readlines():
        rgb_line = line.strip().lstrip('(').rstrip(')')
        try:
            rgb = set([tuple(int(z.strip()) for z in rgb_line.split(', '))])
            on_set = on_set.union(rgb)
        except:
            print("Ignored:" + rgb_line)
            pass
print(len(on_set))
for rgb in sorted(on_set):
    print(rgb)

Now parses this file and first dumps the length of the set and (as is the elements in sorted order):

21
(0, 16, 0)
(0, 17, 0)
(0, 18, 0)
(0, 19, 0)
(0, 20, 0)
(0, 21, 0)
(0, 22, 0)
(0, 22, 1)
(0, 22, 2)
(0, 23, 0)
(0, 23, 4)
(0, 24, 0)
(0, 25, 0)
(0, 25, 1)
(0, 26, 0)
(0, 26, 3)
(0, 26, 4)
(0, 26, 5)
(0, 26, 9)
(0, 27, 0)
(0, 27, 1)

HTH. Note that there are no duplicates in the provided sample input. Doubling the last data line I still rceived 21 unique elements as output, so I guess now it works as designed ;-)

Only need small modification.You can try this.

om_set = set(eval(open('abc.txt').read()))

Result

{(0, 19, 0),
 (0, 20, 0),
 (0, 21, 1),
 (0, 22, 0),
 (0, 24, 3),
 (0, 27, 0),
 (0, 29, 2),
 (0, 35, 2)}

Edit Here is the working of code in in IPython prompt.

In [1]: file_ = open('abc.txt')
In [2]: text_read = file_.read()
In [3]: print eval(text_read)
((0, 19, 0), (0, 20, 0), (0, 21, 1), (0, 22, 0), (0, 24, 3), (0, 27, 0), (0, 29, 2), (0, 35, 2), (0, 36, 1))
In [4]: type(eval(text_read))
Out[1]: tuple
In [5]: print set(eval(text_read))
set([(0, 27, 0), (0, 36, 1), (0, 21, 1), (0, 22, 0), (0, 24, 3), (0, 19, 0), (0, 35, 2), (0, 29, 2), (0, 20, 0)])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM