简体   繁体   English

如何过滤列表的 Pandas Dataframe 列

[英]How to Filter a Pandas Dataframe Column of Lists

Goal: To filter rows based on the values of column of lists.目标:根据列表列的值过滤行。

Given:鉴于:

index指数 pos_order pos_order
3192304 3192304 ['VB', 'DT', 'NN', 'NN', 'NN', 'NN']
1579035 1579035 ['VB', 'PRP', 'VBP', 'NN', 'RB', 'IN', 'NNS', 'NN']
763020 763020 ['VB', 'VBP', 'PRP', 'JJ', 'IN', 'NN']
1289986 1289986 ['VB', 'NN', 'IN', 'CD', 'CD']
69194 69194 ['VB', 'DT', 'JJ', 'NN']
3068116 3068116 ['VB', 'JJ', 'IN', 'NN', 'NN']
1506722 1506722 ['VB', 'NN', 'NNS', 'NNP']
3438101 3438101 ['VB', 'VB', 'IN', 'DT', 'NNS', 'NNS', 'CC', 'NN', 'NN']
1376463 1376463 ['VB', 'DT', 'NN', 'NN']
1903231 1903231 ['VB', 'DT', 'PRP', 'VBP', 'JJ', 'IN', 'NNP', 'NNP']

I'd like to find a way to query this table to fetch rows where a given pattern is present.我想找到一种方法来查询此表以获取存在给定模式的行。 For example, if the pattern is ['IN', 'NN'] , I should get rows 763020 and 3068116, but not row 3438101. So to be clear, the order of the list elements also matters .例如,如果模式是['IN', 'NN'] ,我应该得到第 763020 和 3068116 行,而不是第 3438101 行。所以要清楚,列表元素的顺序也很重要

I tried going about it, this way:我试着这样做,这样:

def target_phrase(pattern_tested, pattern_to_match):
    if ''.join(map(str, pattern_to_match)) in ''.join(map(str, pattern_tested)):
        print (pattern_tested)
        return True
    else:
        return False

I can run this code using lists outside of pandas, but when I try using something like:我可以使用 pandas 之外的列表运行此代码,但是当我尝试使用类似的东西时:

target_phrase(df.loc[5]['pos_order'], ['IN', 'NN'])

the code fails.代码失败。

Any clue?有什么线索吗?

First, let me provide a simplified view of target_phrase :首先,让我提供一个target_phrase的简化视图:

def target_phrase(pattern_tested, pattern_to_match):
    return ''.join(map(str, pattern_to_match)) in ''.join(map(str, pattern_tested))

Why the code does not work?为什么代码不起作用? Because target_phrase expects the first argument to be a list, not a pandas dataframe.因为target_phrase期望第一个参数是一个列表,而不是 pandas dataframe。 The correct syntaxis is as follows:正确的语法如下:

df['pattern_matched'] = df.apply(lambda x: target_phrase(x['pos_order'], 
                                                         ['IN', 'NN']), axis=1)

This function applies target_phrase row-wise.此 function 按行应用target_phrase

As it turned out it was a combination of things, things that Kate and Serge together led me to figure out.事实证明,这是一系列事情的结合,Kate 和 Serge 一起让我想明白了。

As I had everything, the data types being compared were not similar.因为我拥有一切,被比较的数据类型并不相似。 I was comparing a string to a list.我正在将字符串与列表进行比较。 I had to add code to convert that string that looked like a list to a list--Serge's contribution.我必须添加代码来将看起来像列表的字符串转换为列表——Serge 的贡献。 Once that was done, I was able to successfully run lambda thanks to Kate.完成后,感谢 Kate,我能够成功运行 lambda。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM