简体   繁体   English

如何计算2个预定义单词之间的单词数量?

[英]how can i count the number of words between 2 predefined words?

<replace-add> that i dont know you know cause </replace-add> i could help you with <replace-del> that oh </replace-del> <replace-add> us </replace-add> thanks so i just set up a ride <replace-del> for </replace-del> <replace-add> from </replace-add> my daughter <replace-del> tenah dyer </replace-del> <replace-add> clear dire </replace-add> <replace-add>我不知道您知道原因</replace-add>我可以为您提供<replace-del></replace-del> <replace-add>我们</replace-add>我刚刚从</replace-add> <replace-del></replace-del> <replace-add>设置了乘车<replace-del>我的女儿<replace-del> tenah染色机</replace-del> <replace-add>清除可怕</replace-add>

How can i count the exact number of words between <replace-add> and </replace-add> in a text. 如何计算文本中<replace-add></replace-add>之间的确切单词数。

Without using any libraries: 不使用任何库:

def get_tag_indexes(text, tag, start_tag):
    tag_indexes = []
    start_index = -1

    while True:
        start_index = text.find(tag, start_index + 1)

        if start_index != -1:
            if start_tag:
                tag_indexes.append(start_index + len(tag))
            else:
                tag_indexes.append(start_index)
        else:
            return tag_indexes

text = """<replace-add>that i dont know you know cause</replace-add> i could help you with <replace-del>that oh</replace-del> <replace-add>us</replace-add> thanks so i just set up a ride <replace-del>for</replace-del> <replace-add>from</replace-add> my daughter <replace-del>tenah dyer</replace-del> <replace-add>clear dire</replace-add>"""

tag_starts = get_tag_indexes(text, "<replace-add>", True)
tag_ends = get_tag_indexes(text, "</replace-add>", False)

for start, end in zip(tag_starts, tag_ends):
    words = text[start:end].split()
    print "{} words - {}".format(len(words), words)

Giving you: 给你:

7 words - ['that', 'i', 'dont', 'know', 'you', 'know', 'cause']
1 words - ['us']
1 words - ['from']
2 words - ['clear', 'dire']

This uses a function to return a list of the locations of any given text. 这使用一个函数来返回任何给定文本位置的列表。 This can then be used to extract the text between two tags. 然后可以将其用于提取两个标签之间的文本。


As an alternative approach, this could actually also be done using beautifulsoup: 作为一种替代方法,实际上也可以使用beautifulsoup完成此操作:

from bs4 import BeautifulSoup

text = """<replace-add>that i dont know you know cause</replace-add> i could help you with <replace-del>that oh</replace-del> <replace-add>us</replace-add> thanks so i just set up a ride <replace-del>for</replace-del> <replace-add>from</replace-add> my daughter <replace-del>tenah dyer</replace-del> <replace-add>clear dire</replace-add>"""
soup = BeautifulSoup(text, "lxml")

for block in soup.find_all('replace-add'):
    words = block.text.split()
    print "{} words - {}".format(len(words), words)

Depending on how trusted the source is, you could do two things. 根据来源的信任程度,您可以做两件事。 Given that 鉴于

source = """<replace-add>that i dont know you know cause</replace-add> i could help you with <replace-del>that oh</replace-del> <replace-add>us</replace-add> thanks so i just set up a ride <replace-del>for</replace-del> <replace-add>from</replace-add> my daughter <replace-del>tenah dyer</replace-del> <replace-add>clear dire</replace-add>"""

You could use regular expressions like so: 您可以使用如下正则表达式:

import re

from itertools import chain

word_pattern = re.compile(r"(?<=<replace-add>).*?(?=</replace-add>)")
re_words = list(chain.from_iterable(map(str.split, word_pattern.findall(source))))

This will only work if the source matches those tags exactly, with no attributes etc. 仅在源与那些标签完全匹配,没有属性等的情况下,这才起作用。

The other option in the standard library is HTML parsing: 标准库中的另一个选项是HTML解析:

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def get_words(self, html):
        self.read_words = False
        self.words = []
        self.feed(html)
        return self.words

    def handle_starttag(self, tag, attrs):
        if tag == "replace-add":
            self.read_words = True

    def handle_data(self, data):
        if self.read_words:
            self.words.extend(data.split())

    def handle_endtag(self, tag):
        if tag == "replace-add":
            self.read_words = False


parser = MyParser()
html_words = parser.get_words(source)

This approach will be more reliable, and probably a little more efficient as it uses tools entirely focused on this task. 这种方法将更加可靠,并且由于使用完全专注于此任务的工具,因此效率可能会更高。

Now, doing 现在,做

print(re_words)
print(html_words)

We get 我们得到

['that', 'i', 'dont', 'know', 'you', 'know', 'cause', 'us', 'from', 'clear', 'dire']
['that', 'i', 'dont', 'know', 'you', 'know', 'cause', 'us', 'from', 'clear', 'dire']

(Of course, the len of this list is the number of words.) (当然,此列表的len是字数。)

If you strictly just require the number of words, you could just keep a running total and add the length of data.split to this total for each datum encountered. 如果严格只需要字数,则可以保持连续运行的总数,并为遇到的每个数据将data.split的长度添加到该总数中。

If you really can't make any imports, you will either have to make some sacrifices, or have to implement your own regex engine/html parser. 如果您真的不能进行任何导入,则要么必须做出一些牺牲,要么必须实现自己的regex engine / html解析器。 If this is a requirement of a homework assignment, really you should have shown some prior effort to posting the question. 如果这是家庭作业的要求,那么实际上您应该已经为发布问题付出了一些努力。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM