简体   繁体   English

使用 nltk 和/或正则表达式从 pandas 文本列中的句子中提取关键字,并将单词作为句子中的组放在另一列中

[英]extract keyword from sentences in a pandas text column, using nltk, and or regex, and place words in another column as groups from a sentence

A pandas data frame of mostly structured data has 2 columns containing user input, text narratives.主要由结构化数据组成的 pandas 数据框有 2 列,其中包含用户输入、文本叙述。 Some narratives are poorly written.有些故事写得不好。 I'm looking to extract keywords that occur in the same sentence within each narrative.我正在寻找在每个叙述中提取出现在同一句子中的关键字。 The words are sometimes bigrams (fractured implant) but usually lots of non-keywords are in-between the keywords (implant was really fractured).这些词有时是双连词(植入物断裂),但通常在关键词之间有很多非关键词(植入物真的断裂了)。 They are only a pair if they occur in the same sentence within the narrative, and it's possible to have more than 2 keywords in a sentence.如果它们出现在叙述中的同一个句子中,它们只是一对,并且一个句子中可能有两个以上的关键字。 Here's an example, plus my attempt.这是一个例子,加上我的尝试。

import pandas as pd
import nltk

def get_keywords(x, y):
    tokens = nltk.tokenize.word_tokenize(x)
    keywords = [keyword for keyword in tokens if keyword in y]
    keywords_string = ', '.join(keywords)
    return keywords_string


text = ['after investigation it was found that plate was fractured.  It was a broken plate. 
     patient had fractured his femur. ', 
    'investigation took long.  upon xray the plate, which looked ok at first suffered 
     breakage.',
    'it happend that the screws had all broken', 'it was sad.   fractured was the implant.',
    'this sentance has nothing. as does this one.  and this one too.',
    'nothing happening here though a bone was fractured. bone was broke too as was screw.']

df = pd.DataFrame(text, columns = ['Text'])

## These are the key words.  The pairs belong to separate lists--(items, modes) in 
## either order.  These lists tend to grow as more keywords are discovered.
items = ['implant', 'implants', 'plate', 'plates', 'screw', 'screws']
modes = ['broke', 'broken', 'break', 'breaks', 'breakage' , 'fracture', 'fractured']
other = ['bone', 'femor', 'ulna' ]

# the apply(lambda) is slow but I don't mind it.
df['items'] = df['Text'].apply(lambda x: get_keywords(x, items))
df['F Modes'] = df['Text'].apply(lambda x: get_keywords(x, modes)) 
df['other'] = df['Text'].apply(lambda x: get_keywords(x, other)) 

### After using loc to isolate rows of interest, go back and grab whole 
## sentence for review. It's shorter than reading everything. But this
## is what I'm hoping to reduce.

xxx = df['Text'].str.extractall(r"([^.]*?fracture[^.]*\.)").unstack()

在此处输入图像描述

This takes a lot of effort and iteration.这需要大量的努力和迭代。 Pulling sentences that have the keywords is less than reading everything, but it's still a lot of work.拉出有关键字的句子比阅读所有内容要少,但仍然需要大量工作。 QUESTION: is it possible to look within each sentence and grab only words of interest, keep them in order, and place them as groups in a summary column.问题:是否可以在每个句子中查看并仅抓取感兴趣的单词,保持它们的顺序,并将它们作为组放在摘要列中。 Drop all words in-between the keywords of interest.删除感兴趣的关键字之间的所有单词。 Indices have to be preserved because this text data will merge to another df on the indices.必须保留索引,因为此文本数据将合并到索引上的另一个 df。

The desired df would look like this:所需的 df 如下所示:

text = [['after investigation it was found that plate was fractured.  It was a broken plate. 
         patient had fractured his femur. ', 'plate fractured, broken plate, fracture femur'],
        ['investigation took long.  upon xray the plate, which looked ok at first suffered 
         breakage.', 'plate breakage'],
        ['it happened that the screws had all broken', 'screws broken'],
        ['it was sad.   fractured was the implant.', 'fractured implant'],
        ['this sentence has nothing. as does this one.  and this one too.', ''],
        ['nothing happening here. though a bone was fractured. bone was broke too as was 
        screw.', 'bone fractured, bone broke screw']]

df = pd.DataFrame(text, columns = ['Text', 'Summary'])
df

在此处输入图像描述

You could try tokenizing the text before extracting the keywords:您可以在提取关键字之前尝试对文本进行标记:

import pandas as pd
import nltk
import numpy as np
from more_itertools import split_after

nltk.download('punkt')

text = ['after investigation it was found that plate was fractured.  It was a broken plate. patient had fractured his femur. ', 
    'investigation took long.  upon xray the plate, which looked ok at first suffered breakage.',
    'it happend that the screws had all broken', 'it was sad.   fractured was the implant.',
    'this sentance has nothing. as does this one.  and this one too.',
    'nothing happening here though a bone was fractured. bone was broke too as was screw.']

def tokenize(texts):
  return [nltk.tokenize.word_tokenize(t) for t in texts]

Afterwards, you can extract the key words as a new column (here I am extracting the key words from each sentence):之后,您可以将关键词提取为新列(这里我从每个句子中提取关键词):

def key_word_intersection(df):
  summaries = []
  for x in tokenize(df['Text'].to_numpy()):
    keywords = np.concatenate([
                                np.intersect1d(x, ['implant', 'implants', 'plate', 'plates', 'screw', 'screws']),
                                np.intersect1d(x, ['broke', 'broken', 'break', 'breaks', 'breakage' , 'fracture', 'fractured']), 
                                np.intersect1d(x, ['bone', 'femur', 'ulna' ])])

    dot_sep_sentences = np.array(list(split_after(x, lambda i: i == ".")), dtype=object)
    summary = []
    for i, s in enumerate(dot_sep_sentences):
      summary.append([dot_sep_sentences[i][j] for j, keyword in enumerate(s) if keyword in keywords ])
    summaries.append(', '.join([' '.join(x) for x in summary if x]))
  return summaries

df = pd.DataFrame(text, columns = ['Text'])
df['Summary'] = key_word_intersection(df)
|    | Text                                                                                                                | Summary                                        |
|---:|:--------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------|
|  0 | after investigation it was found that plate was fractured.  It was a broken plate. patient had fractured his femur. | plate fractured, broken plate, fractured femur |
|  1 | investigation took long.  upon xray the plate, which looked ok at first suffered breakage.                          | plate breakage                                 |
|  2 | it happend that the screws had all broken                                                                           | screws broken                                  |
|  3 | it was sad.   fractured was the implant.                                                                            | fractured implant                              |
|  4 | this sentance has nothing. as does this one.  and this one too.                                                     |                                                |
|  5 | nothing happening here though a bone was fractured. bone was broke too as was screw.                                | bone fractured, bone broke screw               |

If you do not want sentence-separated key words, but still want to main their order, you could just do:如果您不想要句子分隔的关键词,但仍想保持它们的顺序,您可以这样做:

def key_word_intersection(df):
  summaries = []
  for x in tokenize(df['Text'].to_numpy()):
    keywords = np.concatenate([
                                np.intersect1d(x, ['implant', 'implants', 'plate', 'plates', 'screw', 'screws']),
                                np.intersect1d(x, ['broke', 'broken', 'break', 'breaks', 'breakage' , 'fracture', 'fractured']), 
                                np.intersect1d(x, ['bone', 'femur', 'ulna' ])])
    summaries.append(np.array(x)[[i for i, keyword in enumerate(x) if keyword in keywords]])
  return summaries

df = pd.DataFrame(text, columns = ['Text'])
df['Summary'] = key_word_intersection(df)
|    | Text                                                                                                                | Summary                                                    |
|---:|:--------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------|
|  0 | after investigation it was found that plate was fractured.  It was a broken plate. patient had fractured his femur. | ['plate' 'fractured' 'broken' 'plate' 'fractured' 'femur'] |
|  1 | investigation took long.  upon xray the plate, which looked ok at first suffered breakage.                          | ['plate' 'breakage']                                       |
|  2 | it happend that the screws had all broken                                                                           | ['screws' 'broken']                                        |
|  3 | it was sad.   fractured was the implant.                                                                            | ['fractured' 'implant']                                    |
|  4 | this sentance has nothing. as does this one.  and this one too.                                                     | []                                                         |
|  5 | nothing happening here though a bone was fractured. bone was broke too as was screw.                                | ['bone' 'fractured' 'bone' 'broke' 'screw']                |

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用Regex从Pandas中的句子中提取单词进行网络分析 - Using Regex to extract words from sentences in Pandas for network analysis 在熊猫中如何从一列中的句子中提取特定单词 - in pandas how to extract specific words from a sentence in a column 如何从Pandas数据框文本列中使用NLTK语料库删除英语停用词? - How can I remove English stop words using NLTK corpus from the Pandas dataframe text column? 用列表中的单词替换句子中的单词并复制列中的新句子 - Replace a word in a sentence with words from a list and copying the new sentences in a column 如何通过从另一列中的句子中提取单词来在 pandas 数据框中创建一个新列? - How can I create a new column in a pandas data frame by extracting words from sentences in another column? Python:用单词列表替换句子中的一个单词,并将新句子放在 pandas 的另一列中 - Python: Replace one word in a sentence with a list of words and put thenew sentences in another column in pandas 通过索引从文本中提取单词到新列 Pandas Python - Extract words from the text by index into a new column Pandas Python 使用 NLTK 将句子标记为使用 Pandas 的单词 - Using NLTK to tokeniz sentences to words using pandas 从pandas df的列中提取某些单词 - extract certain words from column in a pandas df Pandas 从列中提取西里尔字母 - Pandas extract cyrillic words from a column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM