简体   繁体   English

Python pandas 检查单元格中列表的最后一个元素是否包含特定字符串

[英]Python pandas check if the last element of a list in a cell contains specific string

my dataframe df:

index                        url
1           [{'url': 'http://bhandarkarscollegekdp.org/'}]
2             [{'url': 'http://cateringinyourhome.com/'}]
3                                                     NaN
4                  [{'url': 'http://muddyjunction.com/'}]
5                       [{'url': 'http://ecskouhou.jp/'}]
6                     [{'url': 'http://andersrice.com/'}]
7       [{'url': 'http://durager.cz/'}, {'url': 'http:andersrice.com'}]
8            [{'url': 'http://milenijum-osiguranje.rs/'}]
9       [{'url': 'http://form-kind.org/'}, {'url': 'https://osiguranje'},{'url': 'http://beseka.com.tr'}]

I would like to select the rows if the last item in the list of the row of url column contains 'https', while skipping missing values.如果 url 列的行列表中的最后一项包含“https”,我想选择行,同时跳过缺失值。

My current script我现在的剧本

df[df['url'].str[-1].str.contains('https',na=False)]

returns False values for all the rows while some of them actually contains https.为所有行返回 False 值,而其中一些实际上包含 https。

Can anybody help with it?有人可以帮忙吗?

I think you can first replace NaN to empty url and then use apply :我认为您可以先将NaN替换为empty url ,然后使用apply

df = pd.DataFrame({'url':[[{'url': 'http://bhandarkarscollegekdp.org/'}],
                          np.nan,
                         [{'url': 'http://cateringinyourhome.com/'}],  
                         [{'url': 'http://durager.cz/'}, {'url': 'https:andersrice.com'}]]},
                  index=[1,2,3,4])

print (df)
                                                 url
1     [{'url': 'http://bhandarkarscollegekdp.org/'}]
2                                                NaN
3        [{'url': 'http://cateringinyourhome.com/'}]
4  [{'url': 'http://durager.cz/'}, {'url': 'https...

df.loc[df.url.isnull(), 'url'] = [[{'url':''}]]
print (df)
                                                 url
1     [{'url': 'http://bhandarkarscollegekdp.org/'}]
2                                      [{'url': ''}]
3        [{'url': 'http://cateringinyourhome.com/'}]
4  [{'url': 'http://durager.cz/'}, {'url': 'https...

print (df.url.apply(lambda x: 'https' in x[-1]['url']))
1    False
2    False
3    False
4     True
Name: url, dtype: bool

First solution:第一个解决方案:

df.loc[df.url.notnull(), 'a'] = 
df.loc[df.url.notnull(), 'url'].apply(lambda x: 'https' in x[-1]['url'])

df.a.fillna(False, inplace=True)
print (df)
                                                 url      a
1     [{'url': 'http://bhandarkarscollegekdp.org/'}]  False
2                                                NaN  False
3        [{'url': 'http://cateringinyourhome.com/'}]  False
4  [{'url': 'http://durager.cz/'}, {'url': 'https...   True

not sure url is str or other types不确定 url 是 str 还是其他类型

you can do like this:你可以这样做:

"https" in str(df.url[len(df)-1])

or或者

str(df.ix[len(df)-1].url).__contains__("https")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM