使用 nltk 和/或正则表达式从 pandas 文本列中的句子中提取关键字，并将单词作为句子中的组放在另一列中

Question

A pandas data frame of mostly structured data has 2 columns containing user input, text narratives.主要由结构化数据组成的 pandas 数据框有 2 列，其中包含用户输入、文本叙述。 Some narratives are poorly written.有些故事写得不好。 I'm looking to extract keywords that occur in the same sentence within each narrative.我正在寻找在每个叙述中提取出现在同一句子中的关键字。 The words are sometimes bigrams (fractured implant) but usually lots of non-keywords are in-between the keywords (implant was really fractured).这些词有时是双连词（植入物断裂），但通常在关键词之间有很多非关键词（植入物真的断裂了）。 They are only a pair if they occur in the same sentence within the narrative, and it's possible to have more than 2 keywords in a sentence.如果它们出现在叙述中的同一个句子中，它们只是一对，并且一个句子中可能有两个以上的关键字。 Here's an example, plus my attempt.这是一个例子，加上我的尝试。

import pandas as pd
import nltk

def get_keywords(x, y):
    tokens = nltk.tokenize.word_tokenize(x)
    keywords = [keyword for keyword in tokens if keyword in y]
    keywords_string = ', '.join(keywords)
    return keywords_string


text = ['after investigation it was found that plate was fractured.  It was a broken plate. 
     patient had fractured his femur. ', 
    'investigation took long.  upon xray the plate, which looked ok at first suffered 
     breakage.',
    'it happend that the screws had all broken', 'it was sad.   fractured was the implant.',
    'this sentance has nothing. as does this one.  and this one too.',
    'nothing happening here though a bone was fractured. bone was broke too as was screw.']

df = pd.DataFrame(text, columns = ['Text'])

## These are the key words.  The pairs belong to separate lists--(items, modes) in 
## either order.  These lists tend to grow as more keywords are discovered.
items = ['implant', 'implants', 'plate', 'plates', 'screw', 'screws']
modes = ['broke', 'broken', 'break', 'breaks', 'breakage' , 'fracture', 'fractured']
other = ['bone', 'femor', 'ulna' ]

# the apply(lambda) is slow but I don't mind it.
df['items'] = df['Text'].apply(lambda x: get_keywords(x, items))
df['F Modes'] = df['Text'].apply(lambda x: get_keywords(x, modes)) 
df['other'] = df['Text'].apply(lambda x: get_keywords(x, other)) 

### After using loc to isolate rows of interest, go back and grab whole 
## sentence for review. It's shorter than reading everything. But this
## is what I'm hoping to reduce.

xxx = df['Text'].str.extractall(r"([^.]*?fracture[^.]*\.)").unstack()

This takes a lot of effort and iteration.这需要大量的努力和迭代。 Pulling sentences that have the keywords is less than reading everything, but it's still a lot of work.拉出有关键字的句子比阅读所有内容要少，但仍然需要大量工作。 QUESTION: is it possible to look within each sentence and grab only words of interest, keep them in order, and place them as groups in a summary column.问题：是否可以在每个句子中查看并仅抓取感兴趣的单词，保持它们的顺序，并将它们作为组放在摘要列中。 Drop all words in-between the keywords of interest.删除感兴趣的关键字之间的所有单词。 Indices have to be preserved because this text data will merge to another df on the indices.必须保留索引，因为此文本数据将合并到索引上的另一个 df。

The desired df would look like this:所需的 df 如下所示：

text = [['after investigation it was found that plate was fractured.  It was a broken plate. 
         patient had fractured his femur. ', 'plate fractured, broken plate, fracture femur'],
        ['investigation took long.  upon xray the plate, which looked ok at first suffered 
         breakage.', 'plate breakage'],
        ['it happened that the screws had all broken', 'screws broken'],
        ['it was sad.   fractured was the implant.', 'fractured implant'],
        ['this sentence has nothing. as does this one.  and this one too.', ''],
        ['nothing happening here. though a bone was fractured. bone was broke too as was 
        screw.', 'bone fractured, bone broke screw']]

df = pd.DataFrame(text, columns = ['Text', 'Summary'])
df

Answer 1

You could try tokenizing the text before extracting the keywords:您可以在提取关键字之前尝试对文本进行标记：

import pandas as pd
import nltk
import numpy as np
from more_itertools import split_after

nltk.download('punkt')

text = ['after investigation it was found that plate was fractured.  It was a broken plate. patient had fractured his femur. ', 
    'investigation took long.  upon xray the plate, which looked ok at first suffered breakage.',
    'it happend that the screws had all broken', 'it was sad.   fractured was the implant.',
    'this sentance has nothing. as does this one.  and this one too.',
    'nothing happening here though a bone was fractured. bone was broke too as was screw.']

def tokenize(texts):
  return [nltk.tokenize.word_tokenize(t) for t in texts]

Afterwards, you can extract the key words as a new column (here I am extracting the key words from each sentence):之后，您可以将关键词提取为新列（这里我从每个句子中提取关键词）：

def key_word_intersection(df):
  summaries = []
  for x in tokenize(df['Text'].to_numpy()):
    keywords = np.concatenate([
                                np.intersect1d(x, ['implant', 'implants', 'plate', 'plates', 'screw', 'screws']),
                                np.intersect1d(x, ['broke', 'broken', 'break', 'breaks', 'breakage' , 'fracture', 'fractured']), 
                                np.intersect1d(x, ['bone', 'femur', 'ulna' ])])

    dot_sep_sentences = np.array(list(split_after(x, lambda i: i == ".")), dtype=object)
    summary = []
    for i, s in enumerate(dot_sep_sentences):
      summary.append([dot_sep_sentences[i][j] for j, keyword in enumerate(s) if keyword in keywords ])
    summaries.append(', '.join([' '.join(x) for x in summary if x]))
  return summaries

df = pd.DataFrame(text, columns = ['Text'])
df['Summary'] = key_word_intersection(df)

|    | Text                                                                                                                | Summary                                        |
|---:|:--------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------|
|  0 | after investigation it was found that plate was fractured.  It was a broken plate. patient had fractured his femur. | plate fractured, broken plate, fractured femur |
|  1 | investigation took long.  upon xray the plate, which looked ok at first suffered breakage.                          | plate breakage                                 |
|  2 | it happend that the screws had all broken                                                                           | screws broken                                  |
|  3 | it was sad.   fractured was the implant.                                                                            | fractured implant                              |
|  4 | this sentance has nothing. as does this one.  and this one too.                                                     |                                                |
|  5 | nothing happening here though a bone was fractured. bone was broke too as was screw.                                | bone fractured, bone broke screw               |

If you do not want sentence-separated key words, but still want to main their order, you could just do:如果您不想要句子分隔的关键词，但仍想保持它们的顺序，您可以这样做：

def key_word_intersection(df):
  summaries = []
  for x in tokenize(df['Text'].to_numpy()):
    keywords = np.concatenate([
                                np.intersect1d(x, ['implant', 'implants', 'plate', 'plates', 'screw', 'screws']),
                                np.intersect1d(x, ['broke', 'broken', 'break', 'breaks', 'breakage' , 'fracture', 'fractured']), 
                                np.intersect1d(x, ['bone', 'femur', 'ulna' ])])
    summaries.append(np.array(x)[[i for i, keyword in enumerate(x) if keyword in keywords]])
  return summaries

df = pd.DataFrame(text, columns = ['Text'])
df['Summary'] = key_word_intersection(df)

|    | Text                                                                                                                | Summary                                                    |
|---:|:--------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------|
|  0 | after investigation it was found that plate was fractured.  It was a broken plate. patient had fractured his femur. | ['plate' 'fractured' 'broken' 'plate' 'fractured' 'femur'] |
|  1 | investigation took long.  upon xray the plate, which looked ok at first suffered breakage.                          | ['plate' 'breakage']                                       |
|  2 | it happend that the screws had all broken                                                                           | ['screws' 'broken']                                        |
|  3 | it was sad.   fractured was the implant.                                                                            | ['fractured' 'implant']                                    |
|  4 | this sentance has nothing. as does this one.  and this one too.                                                     | []                                                         |
|  5 | nothing happening here though a bone was fractured. bone was broke too as was screw.                                | ['bone' 'fractured' 'bone' 'broke' 'screw']                |

使用 nltk 和/或正则表达式从 pandas 文本列中的句子中提取关键字，并将单词作为句子中的组放在另一列中

问题描述

1 个解决方案

解决方案1
2 已采纳 2022-02-05 12:29:25

使用 nltk 和/或正则表达式从 pandas 文本列中的句子中提取关键字，并将单词作为句子中的组放在另一列中

问题描述

1 个解决方案

解决方案1 2 已采纳 2022-02-05 12:29:25

解决方案1
2 已采纳 2022-02-05 12:29:25