在 DataFrame 中循环通过 Python 正则表达式“查找器”

Question

I am trying to obtain blocks of text (250 characters either side) of each occurrence of a word within dataset .我正在尝试获取dataset中每个单词出现的文本块（每侧 250 个字符）。 When I call the same code logic on a toy example:当我在玩具示例上调用相同的代码逻辑时：

import re

list_one = ['as','the','word']

text = 'This is sample text to test if this pythonic '\
       'program can serve as an indexing platform for '\
       'finding words in a paragraph. It can give '\
       'values as to where the word is located with the '\
       'different examples as stated'

#  find all occurances of the word 'as' in the above text
for i in list_one:
  find_the_word = re.finditer(i, text)

  for match in find_the_word:
      print('start {}, end {}, search string \'{}\''.
          format(match.start(), match.end(), match.group()))

the code is able to detect the position of each occurrence of every item of the list with no issues.该代码能够毫无问题地检测到列表中每个项目的每次出现的 position。 However, when I try to apply the same logic to a DataFrame using the 'apply' method, it returns the error TypeError: unhashable type: 'list'但是，当我尝试使用'apply'方法将相同的逻辑应用于 DataFrame 时，它返回错误TypeError: unhashable type: 'list'

Code:代码：

import re
import pandas as pd


def find_text_blocks(text, unique_items):
  '''
  This function doesn't work as intended.
  '''
  empty_list = []

  for i in unique_items:
    find_the_word = re.finditer(i, text)

    for match in find_the_word:
      pos_all = match.start()
      x = slice(pos_all-350, pos_all+350)
      text_slice = text[x]
      empty_list.append(text_slice)
  
  return empty_list

dataset['text_blocks'] = dataset['text'].apply(find_text_blocks, unique_items = dataset['unique_terms'])

Each row of the dataset['unique_items'] column contains a list, whilst each row of the dataset['text'] column contain strings. dataset['unique_items']列的每一行都包含一个列表，而dataset['text']列的每一行都包含字符串。

Any guidance on how to return a list of strings within each row of dataset['text_blocks'] is appreciated.任何有关如何在dataset['text_blocks']的每一行中返回字符串列表的指导都值得赞赏。 Thanks in advance:)提前致谢：）

Answer 1

In your last line of your code use unique_items=dataset['unique_terms'][0] instead of unique_items=dataset['unique_terms'] , and it will work.在代码的最后一行使用unique_items=dataset['unique_terms'][0]而不是unique_items=dataset['unique_terms'] ，它将起作用。

Exaplaination :说明：

First, let us construct the dataset:首先，让我们构建数据集：

dataset = pd.DataFrame({'text':[text] ,'unique_terms':[['as','the','word']]})

if we list that column unique_terms :如果我们列出该列unique_terms ：

list(dataset['unique_terms'])
Out[3]: [['as', 'the', 'word']]

This guide us that, in the last line of your code, instead of using这指导我们，在代码的最后一行，而不是使用

unique_items = dataset['unique_terms']

we should use我们应该使用

unique_items=dataset['unique_terms'][0]

and it will work.它会起作用。

Finall the full I tested code, with the modicication in last line is:最后我测试的完整代码，最后一行的修改是：

import re
import pandas as pd

list_one = ['as', 'the', 'word']

text = 'This is sample text to test if this pythonic '\
       'program can serve as an indexing platform for '\
       'finding words in a paragraph. It can give '\
       'values as to where the word is located with the '\
       'different examples as stated'

dataset = pd.DataFrame({'text': [text], 'unique_terms': [list_one]})

def find_text_blocks(text, unique_items):
    '''
    This function doesn't work as intended.
    '''
    print(text, unique_items)
    empty_list = []

    for i in unique_items:
        find_the_word = re.finditer(i, text)

        for match in find_the_word:
            pos_all = match.start()
            x = slice(pos_all - 350, pos_all + 350)
            text_slice = text[x]
            empty_list.append(text_slice)
        print(empty_list)
    return empty_list


dataset['text_blocks'] = dataset['text'].apply(find_text_blocks, unique_items=dataset['unique_terms'][0])

It works now.现在可以了。

在 DataFrame 中循环通过 Python 正则表达式“查找器”

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-11-26 06:47:48

在 DataFrame 中循环通过 Python 正则表达式“查找器”

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-11-26 06:47:48

解决方案1
1 已采纳 2020-11-26 06:47:48