简体   繁体   English

列表的 pandas 列的列表比较

[英]List comparisons for pandas column of lists

I have a pandas dataframe representing a library.我有一个 pandas dataframe 代表一个库。 The columns represent meta data, such as author, title, year and text.这些列代表元数据,例如作者、标题、年份和文本。 The text column contains lists with the book text, where each list element represents a sentence in the book (see below)文本列包含带有书籍文本的列表,其中每个列表元素代表书中的一个句子(见下文)

     Author  Title   Text
0    Smith   ABC    ["This is the first sentence", "This is the second sentence"]
1    Green   XYZ    ["Also a sentence", "And the second sentence"]

I want to carry out some NLP analysis on the sentences.我想对句子进行一些 NLP 分析。 For individual examples I would use list comparisons, but how can I use list comparisons for the column in the most Pythonic way?对于个别示例,我将使用列表比较,但是如何以最 Pythonic 的方式对列使用列表比较?

What I want to do is eg make a new column with a list of sentences containing the word "the" , such as in this example: How to test if a string contains one of the substrings in a list, in pandas?我想要做的是例如创建一个包含单词"the"的句子列表的新列,例如在此示例中: How to test if a string contains a list in a list, in pandas?

However, they use a dataframe with a string column not a list column.但是,他们使用带有字符串列而不是列表列的 dataframe。

You can do this by using DataFrame.apply and regular expression.您可以使用DataFrame.apply和正则表达式来执行此操作。

import re
import pandas as pd

data = {
    'Author': ['Smith', 'Green'],
    'Title' : ['ABC', 'XYZ'],
    'Text' : [
        ["This is the first sentence", "This is the second sentence"],
        ["Also a sentence", "And the second sentence"]
    ]
}

df = pd.DataFrame(data)

tokens = [
    'first',
    'second',
    'th'
]

def find_token(text_list, re_pattern):
    result = [
        text
        for text in text_list
        if re.search(re_pattern, text.lower())
    ]
    if result:
        return result
    return

for token in tokens:
    re_pattern = re.compile(fr'(^|\s){token}($|\s)')
    df[token] = df['Text'].apply(lambda x: find_token(x, re_pattern))

re match with the token word .与标记word重新匹配。
So there must be a whitespace or start/end of sentence.所以必须有一个空格或句子的开始/结束。
re.compile(r'(^|\s)') means whitespace or start. re.compile(r'(^|\s)')表示空格或开始。
re.compile(r'($|\s)') means whitespace or end. re.compile(r'($|\s)')表示空格或结尾。

If you use 'th' as a token, result would be None .如果您使用 'th' 作为标记,结果将为None

Use tokens as ['first', 'second', 'th'], the result is following.使用tokens as ['first', 'second', 'th'],结果如下。

  Author Title                                               Text  \
0  Smith   ABC  [This is the first sentence, This is the secon...   
1  Green   XYZ         [Also a sentence, And the second sentence]   

                          first                         second    th  
0  [This is the first sentence]  [This is the second sentence]  None  
1                          None      [And the second sentence]  None  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM