[英]Pandas if colum contains string then get unique value from another column and drop rows from dataframe
I have a small problem.我有一个小问题。 I have this dataframe with 7 columns.
我有这个 7 列的 dataframe。 Two of them are 'IP' and 'URL'.
其中两个是“IP”和“URL”。
It is a web log data set, so I am trying to get unique IP of rows, where URL contains string "robots.txt" and then if this condition is applied drop the rows of the uniqueIP's from dataframe. It is a web log data set, so I am trying to get unique IP of rows, where URL contains string "robots.txt" and then if this condition is applied drop the rows of the uniqueIP's from dataframe.
I had a hard time trying to solve this.我很难解决这个问题。 I tried pandas groupby but cant solve it still.
我试过 pandas groupby 但仍然无法解决。 I am able to get unique ip's where url contains string "robots.txt" in this code:
我能够获得唯一的 ip,其中 url 在此代码中包含字符串“robots.txt”:
robots = data2[data2.url.str.contains('robots.txt', regex=True)]
len(robots[['ip']].drop_duplicates())
But after that I don't know how to drop these rows from dataframe.但在那之后我不知道如何从 dataframe 中删除这些行。 Does someone have some tips?
有人有一些提示吗? Thanks.
谢谢。
Here is the sample: https://i.stack.imgur.com/t6q39.png这是示例: https://i.stack.imgur.com/t6q39.png
Dataframe has around 30k rows. Dataframe 有大约 30k 行。 So desired output is to drop all rows from dataframe if string "robots.txt" is in url column.
因此,如果字符串“robots.txt”在 url 列中,则希望 output 从 dataframe 中删除所有行。 I can do that but trick is to remember values from column 'ip' when column 'url' contains that particular string and drop rows that are accessed through that particular ip address
我可以做到这一点,但诀窍是当“url”列包含特定字符串时记住“ip”列中的值,并删除通过该特定 ip 地址访问的行
Just negate your condition只是否定你的条件
robots_condition = data2.url.str.contains('robots.txt')
no_crawl_ips = data2.loc[robots_condition, 'ip'].unique()
data2 = data2[~robots_condition]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.