[英]Pandas: get the dataframe rows which value is matched with regexp
[英]Is there a way to get the value that the list contains which matched the values in Pandas Dataframe?
我有一個像這樣的單詞列表:
words1 = ['hi','my']
words2 = ['name','is']
我有這樣的 Dataframe df
:
id Sentence
0 'my name was'
1 'hi i am'
2 'my phone is'
3 'what is this'
4 'her name was'
我正在運行以下代碼來獲取值匹配的 Dataframe 的索引。
matched_idx1 = df.loc[df.Sentence.str.contains('|'.join(words1)),:].index.array
matched_idx2 = df.loc[df.Sentence.str.contains('|'.join(words2)),:].index.array
因此, matched_idx1
給出了數組:
[0,1,2]
而matched_idx2
給出了數組:
[0,2,3,4]
現在我想獲取在 contains 函數中匹配的值的列表或數組。
所以說一個新變量matched_idx1_values
輸出應該是:
['my','hi','my']
對於matched_idx2_values
,輸出應該是:
['name','is','is','name']
請讓我知道如何獲取這些索引以及它們匹配的值。 這個例子很瑣碎,我的列表有更多的單詞。
謝謝!
這是使用 spaCy 的完整示例:
# Sample data
import pandas as pd
df = pd.DataFrame({'id': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4}, 'Sentence': {0: 'my name was', 1: 'hi i am', 2: 'my phone is', 3: 'what is this', 4: 'her name was'}})
# Load spacy
import spacy
nlp = spacy.blank("en")
ruler = nlp.add_pipe('entity_ruler', config={"overwrite_ents": True}, last=True)
# add word patterns
lst_all_patterns = list()
for wrd in words1:
lst_all_patterns += [{"label": "words1", "pattern": [{"lower": wrd}]}]
for wrd in words2:
lst_all_patterns += [{"label": "words1", "pattern": [{"lower": wrd}]}]
ruler.add_patterns(lst_all_patterns)
# EXAMPLE:
doc_string = nlp('my name was')
for e in doc_string.ents:
print(e.label_, e, e.start, e.end)
# words1 my 0 1
# words1 name 1 2
# EXAMPLE dataframe
df['docs'] = df['Sentence'].map(nlp)
df['docs'].map(lambda x: [e.start for e in x.ents])
# 0 [0, 1]
# 1 [0]
# 2 [0, 2]
# 3 [1]
# 4 [1]
# Name: docs, dtype: object
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.