简体   繁体   English

解析数据库中的文件并将信息添加到字典中

[英]Parsing through a file from database and adding info to dictionary

I have text file as follows: 我有如下文本文件:

HEADER INFO

Last1, First1       Movie1 (1991) random stuff
                        Movie2 (1992) random stuff
                        Movie3 (1995) random stuff
                        Movie4 (3455) random stuff

Last2, First2       Movie1 (1998) random stuff
                        Movie2 (4568) random stuff
                        Movie3 (2466) random stuff
                        Movie4 (4325) random stuff
                        Movie5 (4875) random stuff
                        Movie6 (3525) random stuff
                        Movie7 (4567) random stuff

FOOTER INFO

It also contains some header/footer info that I can skip. 它还包含一些可以跳过的页眉/页脚信息。 The spaces between the name and movie are not constant. 名称和电影之间的空格不是恒定的。 I want to add this data into a dictionary using while loops (no for loops for the whole process). 我想使用while循环将此数据添加到字典中(整个过程没有for循环)。 Basically the name will act as the key and the list of following movies will be the values (both are strings). 基本上,名称将充当键,随后的电影列表将是值(均为字符串)。 So far I can achieve either obtaining the lines which contain the names OR the lines which contain the movies. 到目前为止,我可以实现获取包含名称的行或包含电影的行。 I tried an using an if statement to get it to work but to no avail. 我尝试使用if语句使它正常工作,但无济于事。

Basically I was thinking of using an if statement to say if the line contains the name by some characteristic of the line, then splice out the name and splice out the movie and add to the dictionary. 基本上,我在考虑使用if语句来说明该行是否包含该行的某些特征的名称,然后拼接出该名称并拼接出影片,然后添加到字典中。 And if the name is not in the line, then associate that movie with the same name(multiple entries). 如果名称不在行中,则将该电影与相同名称关联(多个条目)。 But I think this is where Im lost. 但是我认为这是我失落的地方。 This part and maybe how Im iterating with the while loop. 这部分,以及我如何使用while循环进行迭代。

I didn't use any readline(). 我没有使用任何readline()。 Instead I used readlines() and I used that to toggle through the lines to pick out the information. 相反,我使用readlines(),并使用它来在各行之间切换以挑选信息。 I'm just wondering if anyone has any tips/hints they could offer. 我只是想知道是否有人可以提供任何提示/提示。

If anyone wants the actual data I'm using then please pm me. 如果有人想要我正在使用的实际数据,请pm。

Ill rephrase it: 改写:

CRC: 0xDE308B96  File: actors.list  Date: Fri Aug 12 00:00:00 2011

Copyright 1990-2007 The Internet Movie Database, Inc.  All rights reserved.

COPYING POLICY: Internet Movie Database (IMDb)
==============================================

CUTTING COPYRIGHT NOTICE

THE ACTORS LIST
===============

Name                    Titles
----                    ------
ActA, A                 m1 (2011)
                            m2 (2011)

ActB, B                 m1 (2011)
                            m2 (2011)
                            m3 (2001)

ActC, C                 m1 (2011)

ActD, D                 m3 (2003)
                            m6 (2006)

ActE, E                 m6 (2006)

ActF, F                 m4 (2004)

ActG, G                 m4 (2004)

ActH, H                 m5 (2005)

Bacon, Kevin            m2 (2011)
                        m5 (2005)

-----------------------------------------------------------------------------
SUBMITTING UPDATES
==================

CUTTING UPDATES

For further info visit http://www.imdb.com/licensing/contact

And basically I want the output to be a dictionary: 基本上我希望输出是字典:

{'E Acte': ['m6 (2006)'],
'A Acta': ['m1 (2011)', 'm2 (2011)'],
'G Actg': ['m4 (2004)'],
'B Actb': ['m1 (2011)', 'm2 (2011)', 'm3 (2001)'],
'D Actd': ['m3 (2003)', 'm6 (2006)'],
'F Actf': ['m4 (2004)'],
'Kevin Bacon': ['m2 (2011)', 'm5 (2005)'],
'H Acth': ['m5 (2005)'],
'C Actc': ['m1 (2011)']}

I'm suggested to use while loops since it'll make the process easier, but not restricted solely to it. 我建议使用while循环,因为它会使过程更轻松,但不仅限于此。

Here is a solution with a for loop which is much more natural in Python. 这是一个带有for循环的解决方案,这在Python中更为自然。 It assumes the input file is formatted with spaces, like the code posted in the question above. 假定输入文件的格式为空格,如上述问题中发布的代码。 I have posted an alternative answer now for the case when the list is formatted with tabs instead of spaces. 现在,当列表使用制表符而不是空格格式化时,我已经发布了一个替代答案。

Of course you could rewrite it as a while loop, but it would not make much sense. 当然,您可以将其重写为while循环,但这没有多大意义。 You can also simplify it a bit by using a defaultdict(list) for the output in newer Python versions. 您还可以通过在最新的Python版本中使用defaultdict(list)来简化输出。

output = {}

pos = -1 # char position of title column
current_name = None

for line in open('actors.list'):
    if pos < 0:
        if line.startswith('-'):
            pos = line.find(' ')
            if pos > 0:
                pos = line.find('-', pos)
    else:
        if line.startswith('-'):
            break
        name = line[:pos].strip()
        title = line[pos:].strip()
        if name:
            if ',' in name:
                name = name.split(',', 1)
                name[0] = name[0].rstrip()
                name[1] = name[1].lstrip()
                name.reverse()
                name = ' '.join(name)
            current_name = name
        if title:
            output.setdefault(
                current_name, []).append(title)

print output

Here is another solution for the case when the list is formatted with tab chars instead of spaces: 对于使用制表符而不是空格格式化列表的情况,这是另一种解决方案:

output = {}

in_list = False
current_name = None

for line in open('actors.list'):
    if in_list:
        if line.startswith('-'):
            break
        if '\t' not in line:
            continue
        name, title = line.split('\t', 1)
        name = name.strip()
        title = title.strip()
        if name:
            if ',' in name:
                name = name.split(',', 1)
                name[0] = name[0].rstrip()
                name[1] = name[1].lstrip()
                name.reverse()
                name = ' '.join(name)
            current_name = name
        if title:
            output.setdefault(
                current_name, []).append(title)
    else:
        if line.startswith('-'):
            in_list = True

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM