I have a list filled with lists like this one: ['L1045', 'u0', 'm0', 'BIANCA', 'They do not!'] and this one ['L1981', 'u16', 'm1', 'COLUMBUS', "I haven't given you much of a life."] parsed from the Cornell Movie Dialog Corpus, where the index 0 is the dialogue line ID, index 2 is the movie ID, and index 3 is the line itself. There are many lines from each movie, so many lists have identical items at index 2 (many 'm0's for example). They do not have every line in each movie, though, so the items at index 0 may fall into some patterns, but other numbers are absent (for example, there might be an 'L99,' 'L100,' 'L102' for a particular movie, but then there may be a gap from 103-179).
Basically, I'm trying to create a separate list of strings of each index 3 for all the sequential lines in each movie. So a separate list of lines for each separate "scene" for each movie.
I'm just having a very hard time getting there. I don't know if I should be creating a dictionary where each unique movie (index 2) has a unique key with a value consisting of a list of tuples, each with the line number and the line itself. Then doing some kind of counter to check whether there is a gap in the line numbers, etc, etc). If I go this route, I'm struggling even figuring out how to do this for each specific movie...
Any help would be tremendously appreciated!
Below is some code I know doesn't work but shows some of my initial thought processes:
movie_lines = 'DIRECTORYPATH/movie_lines.txt'
with open(movie_lines, "r", encoding="ISO-8859-1") as fh:
lines_chunks = [line.split(" +++$+++ ") for line in fh]
number = 0
counter = 'm' + str(number)
new_list = []
for i in range(616):
number = 0
counter = 'm' + str(number)
for line in lines_chunks:
if line[2] == counter:
new_list.append([(line[2], line[0], line[4])])
number += 1
Here's my approach:
I'd use a nested dictionary to store data:
data = {'movie_id' : {'scene_id' : tuple(int(line_id), character, actual_line)}}
This way if you want to retrieve all lines from a particular scene in a particular movie, you'll just need to call data['movie']['scene']
and the return is a list of tuples.
Here's the code:
movie_lines = 'movie_lines.txt'
with open(movie_lines, "r") as f:
lines = [line.split(' +++$+++ ') for line in f]
data = dict()
for line in lines:
# line[0] --> line_id
# line[1] --> scene_id
# line[2] --> movie_id
# line[3] --> character???
# line[4] --> actual_line
if not line[2] in data:
data[line[2]] = {line[1]: [(int(line[0][1:]),line[3],line[4])]}
elif not line[1] in data[line[2]]:
data[line[2]][line[1]] = [(int(line[0][1:]),line[3],line[4])]
else:
data[line[2]][line[1]].append((int(line[0][1:]), line[3], line[4]))
# taking movie 'm0' and scene 'u0' as an example
test = data['m0']['u0']
test.sort() # by default sort is done by first element in tuple
print(test)
int(line[0][1:])
converts the line id "Lxxx" to an integer for ease of sorting later.
Output:
[(49, 'BIANCA', 'Did you change your hair?\\n'), (51, 'BIANCA', 'You might wanna think about it\\n'), (165, 'BIANCA', 'Nowhere... Hi, Daddy.\\n'), (179, 'BIANCA', "Now don't get upset. Daddy, but there's this boy... and I think he might ask...\\n"), ..., (1021, 'BIANCA', 'Is that woman a complete fruit-loop or is it just me?\\n'), (1045, 'BIANCA', 'They do not!\\n'), (1051, 'BIANCA', 'Patrick -- is that- a.\\n')]
Hope this could help you. Cheers.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.