简体   繁体   中英

Is a dictionary a good data-structure for this information?

I'm not particularly good at python and I'm having some issues trying to resolve a problem. What I am trying to do is the following:

I have a large text file with three key pieces of data on each line, there are ~1.2 million images and their related data here. For example:

123.jpg     | (200 x 200)   | /dir/123.jpg
456.jpg     | (200 x 200)   | /dir/456.jpg
123_0.jpg   | (1080 x 1080)   | /dir/123_0.jpg
456_001.jpg | (2080 x 2080) | /dir/456_001.jpg
596.jpg     | (200 x 480)   | /dir/593.jpg

As you can see from the above sample some images have the same name, with some extra bit tagged on. What I want to do is to be able to find the image id, ie 123, search the file and take only the file with the largest resolution and output this to a new file. ie for image id 123, the file that would end up in the output file would be 123_0.jpg's location.

My approach to this was to create a dictionary data type.

with open('test.txt', 'r') as data:
    for line in data:
        fileValue = line.split(' | ')
        data = {'Image Name':fileValue[0],
                'Resolution':fileValue[1],
                'Location':fileValue[2]
                }

However I cannot seem to figure out/access any values from the dict other than the last value. Clearly I am misunderstanding the data type and how to use it, but when I run something like print(data.values()) I am only getting the last line read from test.txt.

My question is how to I access each value, or store multiple values in a dictionary to do what I want to do? Am I misusing dictionaries here, ie should I be using a dictionaries of dictionaries?

A dictionary would be a good overall data-structure to use because it would make looking-up the data by the id very fast. You can also store "bits" of information associated with each id a dictionary, too.

import os
from pprint import pprint
img_dict = {}

with open('img_test_data.txt', 'r') as data_file:
    for line in data_file:
        filename, res, loc = [item.strip() for item in line.split(' | ')]
        id = os.path.splitext(filename)[0]  # remove extension
        img_dict[id] = {'Image Name': filename, 'Resolution': res, 'Location': loc}

pprint(img_dict)

Output:

{'123': {'Image Name': '123.jpg',
         'Location': '/dir/123.jpg',
         'Resolution': '(200 x 200)'},
 '123_0': {'Image Name': '123_0.jpg',
           'Location': '/dir/123_0.jpg',
           'Resolution': '(1080 x 1080)'},
 '456': {'Image Name': '456.jpg',
         'Location': '/dir/456.jpg',
         'Resolution': '(200 x 200)'},
 '456_001': {'Image Name': '456_001.jpg',
             'Location': '/dir/456_001.jpg',
             'Resolution': '(2080 x 2080)'},
 '596': {'Image Name': '596.jpg',
         'Location': '/dir/593.jpg',
         'Resolution': '(200 x 480)'}}

This will make accessing them fairly easy, although a bit verbose.

print(img_dict['456']['Image Name'])  # -> 456.jpg
print(img_dict['456']['Resolution'])  # -> (200 x 200)
print(img_dict['456']['Location'])    # -> /dir/456.jpg

There are ways to make accessing the information more concise. Instead of a sub-dictionary, you could create a collections.namedtuple . Another possibility would be an instance of a custom class. Either of these would reduce the above to something along these lines:

print(img_dict['456'].image_name)  # -> 456.jpg
print(img_dict['456'].resolution)  # -> (200 x 200)
print(img_dict['456'].location)    # -> /dir/456.jpg

Here's what creating a dictionary that contained namedtuple instances instead of sub-dictionaries would look like:

import os
from collections import namedtuple

MovieInfo = namedtuple('MovieInfo', 'image_name, resolution, location')
img_dict = {}

with open('img_test_data.txt', 'r') as data_file:
    for line in data_file:
        filename, res, loc = [item.strip() for item in line.split(' | ')]
        id = os.path.splitext(filename)[0]  # remove extension
        img_dict[id] = MovieInfo(filename, res, loc)

Resulting in a img_dict filled in like this:

{'123': MovieInfo(image_name='123.jpg', resolution='(200 x 200)', location='/dir/123.jpg'),
 '123_0': MovieInfo(image_name='123_0.jpg', resolution='(1080 x 1080)', location='/dir/123_0.jpg'),
 '456': MovieInfo(image_name='456.jpg', resolution='(200 x 200)', location='/dir/456.jpg'),
 '456_001': MovieInfo(image_name='456_001.jpg', resolution='(2080 x 2080)', location='/dir/456_001.jpg'),
 '596': MovieInfo(image_name='596.jpg', resolution='(200 x 480)', location='/dir/593.jpg')}

I think what you need is a list of dict s:

data = []
with open('test.txt', 'r') as data:
    for line in data:
        fileValue = line.split(' | ')
        data.append({'Image Name':fileValue[0],
                'Resolution':fileValue[1],
                'Location':fileValue[2]
                })

Now you can access the individual records extracted from the lines via an index:

record = data[index]

and the access the fields using your keys:

print record['Image Name']

One of my most glaring issues is that you already have a variable in the scope of the file being opened called data and you're trying to reset it to be a dictionary when its holding your file information.

Declaring a list outside of your with-as statement is a good way to toss dictionaries containing information from each into and save for later.

fileData = []
with open('test.txt', 'r') as data:
    for line in data:
        components = list(map(lambda s: s.strip(), line.split('|')))
        fileData.append({'Image Name': components[0],
                         'Resolution': components[1],
                         'Location': components[2]
                        })

The line components = list(map(lambda s: s.strip(), line.split('|'))) is simply generating a list for each line in the file where the values are split by the | character and all whitespace is stripped.

This will generate a list as such:

[
  {'Location': '/dir/123.jpg', 'Image Name': '123.jpg', 'Resolution': '(200 x 200)'}, 
  {'Location': '/dir/456.jpg', 'Image Name': '456.jpg', 'Resolution': '(200 x 200)'}, 
  {'Location': '/dir/123_0.jpg', 'Image Name': '123_0.jpg', 'Resolution': '(1080 x 1080)'}, 
  {'Location': '/dir/456_001.jpg', 'Image Name': '456_001.jpg', 'Resolution': '(2080 x 2080)'}, 
  {'Location': '/dir/593.jpg', 'Image Name': '596.jpg', 'Resolution': '(200 x 480)'}
]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM