简体   繁体   English

如何解析字典中的电影脚本

[英]How to parse movie script in a dictionary

I have data which looks like this: 我有看起来像这样的数据:

script = """
JOSH:
How do I know if this works?

MICHAEL:
You would know

JOSH:
But how? 

DAN:
How indeed? I don't really know. 


UNKNOWN: 
I am unknown
"""

I want to find the text spoken by each character in [Josh, Michael. Dan] 我想找到[Josh, Michael. Dan]每个角色说的文字[Josh, Michael. Dan] [Josh, Michael. Dan] and ignore UNKNOWN . [Josh, Michael. Dan]并忽略UNKNOWN Note that in this toy example, each character has exactly one line per turn but it is more in the real thing. 请注意,在此玩具示例中,每个角色每转正好有一行,但实际上更多。

I'd like to ultimately return a dictionary of the form 我想最终返回以下形式的字典

lines = {}

lines[Josh] = ["How do I know if this works?", "But how?"]

lines[Michael] = "You would know"

lines[Dan] = ["How indeed?", "I don't really know."]

Or perhaps another data structure would be better. 也许其他数据结构会更好。

I added a few more lines for each name to get close to the real task, and used regular expressions to do it safely: 我为每个名称添加了几行内容以接近实际任务,并使用正则表达式安全地执行了此操作:

import re
import pprint

script = """
JOSH:
How do I know if this works?
And here is another line for JOSH

MICHAEL:
You would know
And another line for MICHAEL

JOSH:
But how? 
One more for JOSH

DAN:
How indeed? I don't really know. 
One more for DAN


UNKNOWN: 
I am unknown
"""

# split by paragraph, by at least 2 consecutive newlines
pars = re.split(r'\n\n+', script, re.S + re.M)
d = {}

for p in pars:  # for each paragraph
    # capture the name (anchored to beginning of line and all capitals)
    # and the rest of the paragraph - (.*)
    name, txt = re.search(r'^([A-Z]+):(.*)', p, re.S + re.M).group(1, 2)

    # Each sentence as a list item
    if name in d:
        d[name] += txt.strip().split('\n')
    else:
        d[name] = txt.strip().split('\n')



pprint.pprint(d)    

Output 输出量

{'DAN': ["How indeed? I don't really know. ", 'One more for DAN'],
 'JOSH': ['How do I know if this works?',
      'And here is another line for JOSH',
      'But how? ',
      'One more for JOSH'],
 'MICHAEL': ['You would know', 'And another line for MICHAEL'],
 'UNKNOWN': ['I am unknown']}

you can split the script into "blocks" on a double newline. 您可以在双换行符上将脚本拆分为“块”。

each block starts with a line containing the speaker, the rest is the text 每个块以包含讲话者的一行开头,其余为文本

try this: 尝试这个:

from collections import defaultdict

script = """\
JOSH:
How do I know if this works?

MICHAEL:
You would know

JOSH:
But how? 

DAN:
How indeed? I don't really know. 


UNKNOWN: 
I am unknown
"""

line_blocks = script.split("\n\n")

wanted_names = {name.upper() + ":": name for name in ["Josh", "Michael", "Dan"]}

result = defaultdict(list)

for block in line_blocks:
    name, text = block.split("\n", 1)
    if name in wanted_names:
        result[wanted_names[name]].append(text)

print(result["Josh"])
print(result["Michael"])
print(result["Dan"])

Output: 输出:

['How do I know if this works?', 'But how? ']
['You would know']
["How indeed? I don't really know. "]

I am not really sure about your final structure but, if it is very consistent, you can use regex. 我不确定您的最终结构,但如果非常一致,则可以使用正则表达式。

Here's my code: 这是我的代码:

import re

script = """
JOSH:
How do I know if this works?

MICHAEL:
You would know

JOSH:
But how? 

DAN:
How indeed? I don't really know. 

UNKNOWN:
I am unknown

"""
# This regex is extracting two groups.
# The first one is one or more words before the ":" (the character's name)
# The second one will be everything between newlines (the line)
matcher = re.compile("(\w+):\n(.*)\n")
groups_extracted = matcher.findall(script)

result = {}

for element in groups_extracted:
    # A little verbosity to make understanding easier
    author = element[0]
    line = element[1]
    if author in result:
        # In case the author name is already in the result dict
        # we just append a new line on his / her name
        result[author].append(line)
    else:
        # Otherwise the author name needs to be added to the dict
        # from scratch with his / her 1st line
        result[author] = [line]

print(result)

print(result['JOSH'])

{'JOSH': ['How do I know if this works?', 'But how? '], 'MICHAEL': ['You would know'], 'DAN': ["How indeed? I don't really know. "], 'UNKNOWN': ['I am unknown']}

['How do I know if this works?', 'But how? ']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM