简体   繁体   English

查找文本中的大写单词

[英]Find capitalized words in a text

How to specify words that start with a capital letter and the number of that word in a text?如何指定以大写字母开头的单词以及该单词在文本中的数量? If no word with this attribute is found in the text, print it in the None output.如果在文本中找不到具有此属性的单词,则将其打印在 None 输出中。 The words at the beginning of the sentence should not be considered.不应考虑句子开头的单词。 Numbers should not be considered and if the semicolon is at the end of the word, that semicolon should be omitted.不应考虑数字,如果分号位于单词末尾,则应省略该分号。

Like the following example:像下面的例子:

Input = The University of Edinburgh is a public research university in Edinburgh, Scotland.输入 = 爱丁堡大学是苏格兰爱丁堡的一所公立研究型大学。 The University of Texas was included in the Association of American Universities in 1929.德克萨斯大学于 1929 年被纳入美国大学协会。

Output :输出 :

2:University 2:大学

4:Edinburgh 4:爱丁堡

11:Edinburgh 11:爱丁堡

12:Scotland 12:苏格兰

14:University 14:大学

16:Texas 16:德克萨斯州

21:Association 21:协会

23:American 23:美式

24:Universities 24:大学

Try this code试试这个代码

It's just you have to use the .istitle() method on the string to check whether it starts with a capital letter and the rest of them are in lower case只是你必须在字符串上使用.istitle()方法来检查它是否以大写字母开头,其余的都是小写

And using regex, you can take out the word excluding the symbols at the end (assuming that you don't want to include symbols as you mentioned to ignore semicolon at the end of the word)并使用正则表达式,您可以取出不包括末尾符号的单词(假设您不想包含您提到的符号以忽略单词末尾的分号)

import re

inp = 'The University; of Edinburgh is a public research university in Edinburgh, Scotland. The University of Texas was included in the Association of American Universities in 1929'
inp2 = ''

def capitalized_words_in_a_text(inp):
    lst = inp.split(' ')[1:]
    res = [f"{i}: {re.match(r'^[A-Za-z]+', j).group()}" for i,j in enumerate(lst, start=2) if j.istitle()]

    if len(res) == 0:
        return
    return '\n'.join(res)

print(capitalized_words_in_a_text(inp))
print(capitalized_words_in_a_text(inp.lower()))

Outputs:输出:

2: University
4: Edinburgh
11: Edinburgh
12: Scotland
13: The
14: University
16: Texas
21: Association
23: American
24: Universities
None # this is from the inp.lower() line, as there's no capital letters in the string

Tell me if its not working...告诉我它是否不起作用...

Here's the code.这是代码。 You can add any other character to strip and it should remove it from the end of the word.您可以添加任何其他字符来去除,它应该从单词的末尾删除它。 You can also change the last print to anything you want.您还可以将最后一次打印更改为您想要的任何内容。

import numpy as np

s1="The University of Edinburgh is a public research university in Edinburgh, Scotland. The University of Texas was included in the Association of American Universities in 1929."

n = []

for index, word in enumerate(s1.split()):
    if word[0].isupper():
        if string[index-1][-1] == ".": #check that previous word does not end in a ".". 
            continue
        print(f"""{index+1}:{word.strip(",.;:")}""") #python index is one number lower, so add one to it to get the numbers you requested
        n.append(word) #this is just to be able to print something if no words have capital letters
if len(n) == 0:
    print("None")

The words at the beginning of the sentence should not be considered句子开头的词不应被考虑

This makes the process harder because you should at first determine how the sentence is separated.这使过程变得更加困难,因为您应该首先确定句子的分隔方式。 a sentence can be ended with punctuation marks like . or ! or ?一个句子可以用标点符号结束,比如. or ! or ? . or ! or ? . . But you did not close the last sentence in your example with a full stop.但是您没有用句号结束​​示例中的最后一句话。 your corpus must be first preprocessed for this aim!为此,您的语料库必须首先进行预处理!


Putting this issue aside, suppose this scenario:把这个问题放在一边,假设这个场景:

import re

inp = "The University of Edinburgh is a public research university in Edinburgh, Scotland. The University of Texas was included in the Association of American Universities in 1929! The last Sentence."

sentences = re.findall(r"[\w\s,]*[\.\!\?]",inp)
counter = 0
for sentence in sentences:
    sentence = re.sub(r"\W", " ",sentence)
    sentence = re.sub(r"\s+", " ", sentence)
    words = re.split(r"\s", sentence)
    words = [w for w in words if w!=""]
    for i, word in enumerate(words):
        if word != "" and i != 0:
            if re.search(r"[A-Z]+", word):
                print("%d:%s" % (counter+i+1, word))
    counter += len(words)

This code is exactly what you want.这段代码正是你想要的。 It is not the best practice but it is a tight and simple code.这不是最佳实践,但它是一个紧凑而简单的代码。 Note that you need to specify the punctuations at the end of each sentence for the input sentence at first!!!请注意,首先需要为输入的句子指定每个句子末尾的标点符号!!!


The output:输出:

2:University                                                                                                                          
4:Edinburgh                                                                                                                           
11:Edinburgh                                                                                                                          
12:Scotland                                                                                                                           
14:University                                                                                                                         
16:Texas                                                                                                                              
21:Association                                                                                                                        
23:American                                                                                                                           
24:Universities                                                                                                                       
29:Sentence 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM