[英]List comparisons for pandas column of lists
I have a pandas dataframe representing a library.我有一个 pandas dataframe 代表一个库。 The columns represent meta data, such as author, title, year and text.
这些列代表元数据,例如作者、标题、年份和文本。 The text column contains lists with the book text, where each list element represents a sentence in the book (see below)
文本列包含带有书籍文本的列表,其中每个列表元素代表书中的一个句子(见下文)
Author Title Text
0 Smith ABC ["This is the first sentence", "This is the second sentence"]
1 Green XYZ ["Also a sentence", "And the second sentence"]
I want to carry out some NLP analysis on the sentences.我想对句子进行一些 NLP 分析。 For individual examples I would use list comparisons, but how can I use list comparisons for the column in the most Pythonic way?
对于个别示例,我将使用列表比较,但是如何以最 Pythonic 的方式对列使用列表比较?
What I want to do is eg make a new column with a list of sentences containing the word "the"
, such as in this example: How to test if a string contains one of the substrings in a list, in pandas?我想要做的是例如创建一个包含单词
"the"
的句子列表的新列,例如在此示例中: How to test if a string contains a list in a list, in pandas?
However, they use a dataframe with a string column not a list column.但是,他们使用带有字符串列而不是列表列的 dataframe。
You can do this by using DataFrame.apply
and regular expression.您可以使用
DataFrame.apply
和正则表达式来执行此操作。
import re
import pandas as pd
data = {
'Author': ['Smith', 'Green'],
'Title' : ['ABC', 'XYZ'],
'Text' : [
["This is the first sentence", "This is the second sentence"],
["Also a sentence", "And the second sentence"]
]
}
df = pd.DataFrame(data)
tokens = [
'first',
'second',
'th'
]
def find_token(text_list, re_pattern):
result = [
text
for text in text_list
if re.search(re_pattern, text.lower())
]
if result:
return result
return
for token in tokens:
re_pattern = re.compile(fr'(^|\s){token}($|\s)')
df[token] = df['Text'].apply(lambda x: find_token(x, re_pattern))
re match with the token word
.与标记
word
重新匹配。
So there must be a whitespace or start/end of sentence.所以必须有一个空格或句子的开始/结束。
re.compile(r'(^|\s)')
means whitespace or start. re.compile(r'(^|\s)')
表示空格或开始。
re.compile(r'($|\s)')
means whitespace or end. re.compile(r'($|\s)')
表示空格或结尾。
If you use 'th' as a token, result would be None
.如果您使用 'th' 作为标记,结果将为
None
。
Use tokens as ['first', 'second', 'th'], the result is following.使用tokens as ['first', 'second', 'th'],结果如下。
Author Title Text \
0 Smith ABC [This is the first sentence, This is the secon...
1 Green XYZ [Also a sentence, And the second sentence]
first second th
0 [This is the first sentence] [This is the second sentence] None
1 None [And the second sentence] None
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.