[英]Looping through Python regex 'finditer' within a DataFrame
I am trying to obtain blocks of text (250 characters either side) of each occurrence of a word within dataset
.我正在尝试获取
dataset
中每个单词出现的文本块(每侧 250 个字符)。 When I call the same code logic on a toy example:当我在玩具示例上调用相同的代码逻辑时:
import re
list_one = ['as','the','word']
text = 'This is sample text to test if this pythonic '\
'program can serve as an indexing platform for '\
'finding words in a paragraph. It can give '\
'values as to where the word is located with the '\
'different examples as stated'
# find all occurances of the word 'as' in the above text
for i in list_one:
find_the_word = re.finditer(i, text)
for match in find_the_word:
print('start {}, end {}, search string \'{}\''.
format(match.start(), match.end(), match.group()))
the code is able to detect the position of each occurrence of every item of the list with no issues.该代码能够毫无问题地检测到列表中每个项目的每次出现的 position。 However, when I try to apply the same logic to a DataFrame using the
'apply'
method, it returns the error TypeError: unhashable type: 'list'
但是,当我尝试使用
'apply'
方法将相同的逻辑应用于 DataFrame 时,它返回错误TypeError: unhashable type: 'list'
Code:代码:
import re
import pandas as pd
def find_text_blocks(text, unique_items):
'''
This function doesn't work as intended.
'''
empty_list = []
for i in unique_items:
find_the_word = re.finditer(i, text)
for match in find_the_word:
pos_all = match.start()
x = slice(pos_all-350, pos_all+350)
text_slice = text[x]
empty_list.append(text_slice)
return empty_list
dataset['text_blocks'] = dataset['text'].apply(find_text_blocks, unique_items = dataset['unique_terms'])
Each row of the dataset['unique_items']
column contains a list, whilst each row of the dataset['text']
column contain strings. dataset['unique_items']
列的每一行都包含一个列表,而dataset['text']
列的每一行都包含字符串。
Any guidance on how to return a list of strings within each row of dataset['text_blocks']
is appreciated.任何有关如何在
dataset['text_blocks']
的每一行中返回字符串列表的指导都值得赞赏。 Thanks in advance:)提前致谢:)
In your last line of your code use unique_items=dataset['unique_terms'][0]
instead of unique_items=dataset['unique_terms']
, and it will work.在代码的最后一行使用
unique_items=dataset['unique_terms'][0]
而不是unique_items=dataset['unique_terms']
,它将起作用。
Exaplaination :说明:
First, let us construct the dataset:首先,让我们构建数据集:
dataset = pd.DataFrame({'text':[text] ,'unique_terms':[['as','the','word']]})
if we list that column unique_terms
:如果我们列出该列
unique_terms
:
list(dataset['unique_terms'])
Out[3]: [['as', 'the', 'word']]
This guide us that, in the last line of your code, instead of using这指导我们,在代码的最后一行,而不是使用
unique_items = dataset['unique_terms']
we should use我们应该使用
unique_items=dataset['unique_terms'][0]
and it will work.它会起作用。
Finall the full I tested code, with the modicication in last line is:最后我测试的完整代码,最后一行的修改是:
import re
import pandas as pd
list_one = ['as', 'the', 'word']
text = 'This is sample text to test if this pythonic '\
'program can serve as an indexing platform for '\
'finding words in a paragraph. It can give '\
'values as to where the word is located with the '\
'different examples as stated'
dataset = pd.DataFrame({'text': [text], 'unique_terms': [list_one]})
def find_text_blocks(text, unique_items):
'''
This function doesn't work as intended.
'''
print(text, unique_items)
empty_list = []
for i in unique_items:
find_the_word = re.finditer(i, text)
for match in find_the_word:
pos_all = match.start()
x = slice(pos_all - 350, pos_all + 350)
text_slice = text[x]
empty_list.append(text_slice)
print(empty_list)
return empty_list
dataset['text_blocks'] = dataset['text'].apply(find_text_blocks, unique_items=dataset['unique_terms'][0])
It works now.现在可以了。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.