I'm not particularly good at python and I'm having some issues trying to resolve a problem. What I am trying to do is the following:
I have a large text file with three key pieces of data on each line, there are ~1.2 million images and their related data here. For example:
123.jpg | (200 x 200) | /dir/123.jpg
456.jpg | (200 x 200) | /dir/456.jpg
123_0.jpg | (1080 x 1080) | /dir/123_0.jpg
456_001.jpg | (2080 x 2080) | /dir/456_001.jpg
596.jpg | (200 x 480) | /dir/593.jpg
As you can see from the above sample some images have the same name, with some extra bit tagged on. What I want to do is to be able to find the image id, ie 123, search the file and take only the file with the largest resolution and output this to a new file. ie for image id 123, the file that would end up in the output file would be 123_0.jpg's location.
My approach to this was to create a dictionary data type.
with open('test.txt', 'r') as data:
for line in data:
fileValue = line.split(' | ')
data = {'Image Name':fileValue[0],
'Resolution':fileValue[1],
'Location':fileValue[2]
}
However I cannot seem to figure out/access any values from the dict other than the last value. Clearly I am misunderstanding the data type and how to use it, but when I run something like print(data.values())
I am only getting the last line read from test.txt.
My question is how to I access each value, or store multiple values in a dictionary to do what I want to do? Am I misusing dictionaries here, ie should I be using a dictionaries of dictionaries?
A dictionary would be a good overall data-structure to use because it would make looking-up the data by the id very fast. You can also store "bits" of information associated with each id a dictionary, too.
import os
from pprint import pprint
img_dict = {}
with open('img_test_data.txt', 'r') as data_file:
for line in data_file:
filename, res, loc = [item.strip() for item in line.split(' | ')]
id = os.path.splitext(filename)[0] # remove extension
img_dict[id] = {'Image Name': filename, 'Resolution': res, 'Location': loc}
pprint(img_dict)
Output:
{'123': {'Image Name': '123.jpg',
'Location': '/dir/123.jpg',
'Resolution': '(200 x 200)'},
'123_0': {'Image Name': '123_0.jpg',
'Location': '/dir/123_0.jpg',
'Resolution': '(1080 x 1080)'},
'456': {'Image Name': '456.jpg',
'Location': '/dir/456.jpg',
'Resolution': '(200 x 200)'},
'456_001': {'Image Name': '456_001.jpg',
'Location': '/dir/456_001.jpg',
'Resolution': '(2080 x 2080)'},
'596': {'Image Name': '596.jpg',
'Location': '/dir/593.jpg',
'Resolution': '(200 x 480)'}}
This will make accessing them fairly easy, although a bit verbose.
print(img_dict['456']['Image Name']) # -> 456.jpg
print(img_dict['456']['Resolution']) # -> (200 x 200)
print(img_dict['456']['Location']) # -> /dir/456.jpg
There are ways to make accessing the information more concise. Instead of a sub-dictionary, you could create a collections.namedtuple
. Another possibility would be an instance of a custom class. Either of these would reduce the above to something along these lines:
print(img_dict['456'].image_name) # -> 456.jpg
print(img_dict['456'].resolution) # -> (200 x 200)
print(img_dict['456'].location) # -> /dir/456.jpg
Here's what creating a dictionary that contained namedtuple
instances instead of sub-dictionaries would look like:
import os
from collections import namedtuple
MovieInfo = namedtuple('MovieInfo', 'image_name, resolution, location')
img_dict = {}
with open('img_test_data.txt', 'r') as data_file:
for line in data_file:
filename, res, loc = [item.strip() for item in line.split(' | ')]
id = os.path.splitext(filename)[0] # remove extension
img_dict[id] = MovieInfo(filename, res, loc)
Resulting in a img_dict
filled in like this:
{'123': MovieInfo(image_name='123.jpg', resolution='(200 x 200)', location='/dir/123.jpg'),
'123_0': MovieInfo(image_name='123_0.jpg', resolution='(1080 x 1080)', location='/dir/123_0.jpg'),
'456': MovieInfo(image_name='456.jpg', resolution='(200 x 200)', location='/dir/456.jpg'),
'456_001': MovieInfo(image_name='456_001.jpg', resolution='(2080 x 2080)', location='/dir/456_001.jpg'),
'596': MovieInfo(image_name='596.jpg', resolution='(200 x 480)', location='/dir/593.jpg')}
I think what you need is a list
of dict
s:
data = []
with open('test.txt', 'r') as data:
for line in data:
fileValue = line.split(' | ')
data.append({'Image Name':fileValue[0],
'Resolution':fileValue[1],
'Location':fileValue[2]
})
Now you can access the individual records extracted from the lines via an index:
record = data[index]
and the access the fields using your keys:
print record['Image Name']
One of my most glaring issues is that you already have a variable in the scope of the file being opened called data
and you're trying to reset it to be a dictionary
when its holding your file information.
Declaring a list
outside of your with-as
statement is a good way to toss dictionaries
containing information from each into and save for later.
fileData = []
with open('test.txt', 'r') as data:
for line in data:
components = list(map(lambda s: s.strip(), line.split('|')))
fileData.append({'Image Name': components[0],
'Resolution': components[1],
'Location': components[2]
})
The line components = list(map(lambda s: s.strip(), line.split('|')))
is simply generating a list
for each line in the file where the values are split by the |
character and all whitespace is stripped.
This will generate a list as such:
[
{'Location': '/dir/123.jpg', 'Image Name': '123.jpg', 'Resolution': '(200 x 200)'},
{'Location': '/dir/456.jpg', 'Image Name': '456.jpg', 'Resolution': '(200 x 200)'},
{'Location': '/dir/123_0.jpg', 'Image Name': '123_0.jpg', 'Resolution': '(1080 x 1080)'},
{'Location': '/dir/456_001.jpg', 'Image Name': '456_001.jpg', 'Resolution': '(2080 x 2080)'},
{'Location': '/dir/593.jpg', 'Image Name': '596.jpg', 'Resolution': '(200 x 480)'}
]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.