Looping through Python regex 'finditer' within a DataFrame

Question

I am trying to obtain blocks of text (250 characters either side) of each occurrence of a word within dataset . When I call the same code logic on a toy example:

import re

list_one = ['as','the','word']

text = 'This is sample text to test if this pythonic '\
       'program can serve as an indexing platform for '\
       'finding words in a paragraph. It can give '\
       'values as to where the word is located with the '\
       'different examples as stated'

#  find all occurances of the word 'as' in the above text
for i in list_one:
  find_the_word = re.finditer(i, text)

  for match in find_the_word:
      print('start {}, end {}, search string \'{}\''.
          format(match.start(), match.end(), match.group()))

the code is able to detect the position of each occurrence of every item of the list with no issues. However, when I try to apply the same logic to a DataFrame using the 'apply' method, it returns the error TypeError: unhashable type: 'list'

Code:

import re
import pandas as pd


def find_text_blocks(text, unique_items):
  '''
  This function doesn't work as intended.
  '''
  empty_list = []

  for i in unique_items:
    find_the_word = re.finditer(i, text)

    for match in find_the_word:
      pos_all = match.start()
      x = slice(pos_all-350, pos_all+350)
      text_slice = text[x]
      empty_list.append(text_slice)
  
  return empty_list

dataset['text_blocks'] = dataset['text'].apply(find_text_blocks, unique_items = dataset['unique_terms'])

Each row of the dataset['unique_items'] column contains a list, whilst each row of the dataset['text'] column contain strings.

Any guidance on how to return a list of strings within each row of dataset['text_blocks'] is appreciated. Thanks in advance:)

Answer 1

In your last line of your code use unique_items=dataset['unique_terms'][0] instead of unique_items=dataset['unique_terms'] , and it will work.

Exaplaination :

First, let us construct the dataset:

dataset = pd.DataFrame({'text':[text] ,'unique_terms':[['as','the','word']]})

if we list that column unique_terms :

list(dataset['unique_terms'])
Out[3]: [['as', 'the', 'word']]

This guide us that, in the last line of your code, instead of using

unique_items = dataset['unique_terms']

we should use

unique_items=dataset['unique_terms'][0]

and it will work.

Finall the full I tested code, with the modicication in last line is:

import re
import pandas as pd

list_one = ['as', 'the', 'word']

text = 'This is sample text to test if this pythonic '\
       'program can serve as an indexing platform for '\
       'finding words in a paragraph. It can give '\
       'values as to where the word is located with the '\
       'different examples as stated'

dataset = pd.DataFrame({'text': [text], 'unique_terms': [list_one]})

def find_text_blocks(text, unique_items):
    '''
    This function doesn't work as intended.
    '''
    print(text, unique_items)
    empty_list = []

    for i in unique_items:
        find_the_word = re.finditer(i, text)

        for match in find_the_word:
            pos_all = match.start()
            x = slice(pos_all - 350, pos_all + 350)
            text_slice = text[x]
            empty_list.append(text_slice)
        print(empty_list)
    return empty_list


dataset['text_blocks'] = dataset['text'].apply(find_text_blocks, unique_items=dataset['unique_terms'][0])

It works now.

Looping through Python regex 'finditer' within a DataFrame

Question

1 answers

solution1
1 ACCPTED 2020-11-26 06:47:48

Looping through Python regex 'finditer' within a DataFrame

Question

1 answers

solution1 1 ACCPTED 2020-11-26 06:47:48

solution1
1 ACCPTED 2020-11-26 06:47:48