简体   繁体   English

拆分句子,处理单词并将句子重新组合在一起?

[英]Split sentences, process words, and put sentence back together?

I have a function that scores words. 我有一个对单词评分的功能。 I have lots of text from sentences to several page documents. 我有很多文本,从句子到几个页面文档。 I'm stuck on how to score the words and return the text near its original state. 我一直在研究如何给单词打分并在其原始状态附近返回文本。

Here's an example sentence: 这是一个例句:

"My body lies over the ocean, my body lies over the sea."

What I want to produce is the following: 我要生产的东西如下:

"My body (2) lies over the ocean (3), my body (2) lies over the sea."

Below is a dummy version of my scoring algorithm. 以下是我的评分算法的虚拟版本。 I've figured out how to take text, tear it apart and score it. 我已经弄清楚了如何提取文字,将其撕裂并对其评分。

However, I'm stuck on how to put it back together into the format I need it in. 但是,我坚持如何将其重新组合成所需的格式。

Here's a dummy version of my function: 这是我的函数的虚拟版本:

def word_score(text):
    words_to_work_with = []
    words_to_return = []
    passed_text = TextBlob(passed_text)
    for word in words_to_work_with:
        word = word.singularize().lower()
        word = str(word)
        e_word_lemma = lemmatizer.lemmatize(word)
        words_to_work_with.append(e_word_lemma)
    for word in words to work with:
        if word == 'body':
            score = 2
        if word == 'ocean':
            score = 3
        else:
            score = None
        words_to_return.append((word,score))
    return words_to_return

I'm a relative newbie so I have two questions: 我是一个相对较新手,所以我有两个问题:

  1. How can I put the text back together, and 如何将文本重新组合在一起,以及
  2. Should that logic be put into the function or outside of it? 该逻辑应该放在函数内还是函数外?

I'd really like to be able to feed entire segments (ie sentences, documents) into the function and have it return them. 我真的很希望能够将整个片段(即句子,文档)输入到函数中,并使其返回它们。

Thank you for helping me! 感谢你们对我的帮助!

So basically, you want to attribute a score for each word. 因此,基本上,您希望为每个单词分配一个分数。 The function you give may be improved using a dictionary instead of several if statements. 您可以使用字典而不是多个if语句来改善您提供的功能。 Also you have to return all scores, instead of just the score of the first word in words_to_work_with which is the current behavior of the function since it will return an integer on the first iteration. 此外,您还必须返回所有分数,而不仅仅是返回words_to_work_with中第一个word的分数,这是该函数的当前行为,因为它将在第一次迭代时返回一个整数。 So the new function would be : 因此,新功能将是:

def word_score(text)
    words_to_work_with = []
    passed_text = TextBlob(text)
    for word in words_to_work_with:
        word = word.singularize().lower()
        word = str(word) # Is this line really useful ?
        e_word_lemma = lemmatizer.lemmatize(word)
        words_to_work_with.append(e_word_lemma)

    dict_scores = {'body' : 2, 'ocean' : 3, etc ...}
    return [dict_scores.get(word, None)] # if word is not recognized, score is None

For the second part, which is reconstructing the string, I would actually do this in the same function (so this answers your second question) : 对于第二部分,即重构字符串,我实际上将在相同的函数中执行此操作(因此这将回答您的第二个问题):

def word_score_and_reconstruct(text):
    words_to_work_with = []
    passed_text = TextBlob(text)

    reconstructed_text = ''

    for word in words_to_work_with:
        word = word.singularize().lower()
        word = str(word)  # Is this line really useful ?
        e_word_lemma = lemmatizer.lemmatize(word)
        words_to_work_with.append(e_word_lemma)

    dict_scores = {'body': 2, 'ocean': 3}
    dict_strings = {'body': ' (2)', 'ocean': ' (3)'}

    word_scores = []

    for word in words_to_work_with:
        word_scores.append(dict_scores.get(word, None)) # we still construct the scores list here

        # we add 'word'+'(word's score)', only if the word has a score
        # if not, we add the default value '' meaning we don't add anything
        reconstructed_text += word + dict_strings.get(word, '')

    return reconstructed_text, word_scores

I'm not guaranteeing this code will work at first try, I can't test it but it'll give you the main idea 我不保证该代码在初次尝试时会起作用,我无法对其进行测试,但是它将为您提供主要思路

Hope this would help. 希望这会有所帮助。 Based on your question, it has worked for me. 根据您的问题,它对我有用。

best regards!! 最好的祝福!!

"""
Python 3.7.2

Input:
Saved text in the file named as "original_text.txt"
My body lies over the ocean, my body lies over the sea. 
"""
input_file = open('original_text.txt', 'r') #Reading text from file
output_file = open('processed_text.txt', 'w') #saving output text in file

output_text = []

for line in input_file:
    words =  line.split()
    for word in words:
        if word == 'body':
            output_text.append('body (2)')
            output_file.write('body (2) ')
        elif word == 'body,':
            output_text.append('body (2),')
            output_file.write('body (2), ')
        elif word == 'ocean':
            output_text.append('ocean (3)')
            output_file.write('ocean (3) ')
        elif word == 'ocean,':
            output_text.append('ocean (3),')
            output_file.write('ocean (3), ')
        else:
            output_text.append(word)
            output_file.write(word+' ')

print (output_text)
input_file.close()
output_file.close()

Here's a working implementation. 这是一个可行的实现。 The function first parses the input text as a list, such that each list element is a word or a combination of punctuation characters (eg. a comma followed by a space.) Once the words in the list have been processed, it combines the list back into a string and returns it. 该函数首先将输入文本解析为一个列表,以便每个列表元素是一个单词或标点符号的组合(例如,逗号后跟一个空格。)一旦处理了列表中的单词,它将组合列表返回一个字符串并返回它。

def word_score(text):
    words_to_work_with = re.findall(r"\b\w+|\b\W+",text)
    for i,word in enumerate(words_to_work_with):
        if word.isalpha():
            words_to_work_with[i] = inflection.singularize(word).lower()
            words_to_work_with[i] = lemmatizer.lemmatize(word)
            if word == 'body':
               words_to_work_with[i] = 'body (2)'
            elif word == 'ocean':
               words_to_work_with[i] = 'ocean (3)'
    return ''.join(words_to_work_with)

txt = "My body lies over the ocean, my body lies over the sea."
output = word_score(txt)
print(output)

Output: 输出:

My body (2) lie over the ocean (3), my body (2) lie over the sea.

If you have more than 2 words that you want to score, using a dictionary instead of if conditions is indeed a good idea. 如果您想给两个以上的单词打分,那么使用字典代替if条件确实是个好主意。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将熊猫中的句子分成句子数和单词 - Split sentences in pandas into sentence number and words 将 pandas 中的句子(字符串)拆分为带有句子编号的单独单词行 - split sentences (strings) in pandas into separate rows of words with sentence numbering 如何使用 pandas 将句子拆分为句子 ID、单词和标签? - How to split sentences into sentence Id, words and labels with pandas? 基于多个句子的句子中的单词对句子进行分类 - categorize sentence based on words in sentence for multiple sentences Python:用单词列表替换句子中的一个单词,并将新句子放在 pandas 的另一列中 - Python: Replace one word in a sentence with a list of words and put thenew sentences in another column in pandas 将用np.split()拆分的numpy数组放在一起 - Put numpy arrays split with np.split() back together Python计算拆分句子的单词? - Python count words of split sentence? 如何在两个特定单词之间拆分句子以稍后放入变量? - How can I split the sentence between two specific words to put in a variable later on? 如何将句子拆分为相关词(术语提取)? - How to split sentences into correlated words (term extraction)? 使用python将句子列表拆分为单词列表 - split a list of sentences to a list of words with python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM