简体   繁体   中英

Delete rows from data frame based on value of the columns

I have dataframe (tag) with 9153 rows and 3 columns. Here the first 5 rows.

    pk  tag     tweet
0   148 unknown 9491
1   149 ignore  9513
2   150 real    8461
3   151 fake    8639
4   152 unknown 8385

What I am trying to do, is see if a tweet gets two tags and these tags are different from each other, like these ones,

          pk    tag     tweet
5287    5436    unknown 16600
8477    8626    real    16600

then I eliminate these tweets from the data frame. but if tweet get two similer tags then accept tweet and will not be deleted. To solve this problem, I created new data frame consist of tweet no. and its number of tags

x=pd.DataFrame(tag['tweet'].value_counts())
x.reset_index(inplace=True)

here first 5 rows of x data frame, some tweets get 3 or even more(up to 15) tags but I am only concerned with tweets got two tags

   index    tweet
0   8252    15
1   9200    15
2   8646    13
3   8774    13
4   8322    13

Then create list that have tweet no. which has only two tags

tweet_no=[]
for i in x.itertuples():
    if i.tweet==2:
        tweet_no.append(i.index)

but I stuck on how to compare if tweets have similar or different tags and deleted if they have different tags and accept if they have similar tags.

Try to get unique count for each tweet and then eliminate if count is greater than one

import pandas as pd

# your original data frame
original_data = pd.read_csv("your tweets csv file")

# Create temp data frame with only required columns
temp_data = original_data[["tweets", "tags"]]
temp_data = temp_data.groupby(["tweet"], as_index=False).agg({"tags": "nunique"})
# Tweet with only with a single
temp_data = temp_data[temp_data["tags"] == 1]["tweets"]

# Filter original data frame for the desired tweets
original_data = original_data[original_data["tweets"].isin(temp_data)]

=====================================
Sample example
data = pd.DataFrame(data={"tweet: [1, 2, 3, 1, 2, 3], "tags": ["a", "b", "c", "d", "b", "c"]})

data = data.groupby(["tweet"], as_index=False).agg({"tags": "nunique"})

# Tweet with only with a single
data = data[data["tags"] == 1]
=====================================

Hope this will resolve your problem

Assuming similar = the same, you can find an example below:

df = pd.DataFrame({'tag': ['1', '1', '2', '3', '3'],
                   'tweet': ['a', 'a', 'b', 'b', 'c']})
df = df.groupby('tweet').agg(['count', 'nunique'])
df.columns = df.columns.droplevel()
df[(df['count'] > 1) & (df['nunique'] == 1)]

Might as well drop the count column and only filter based on nunique . Cheers!

What you could do is: Join the counter table with the original table based on the tweet and sort them based on the tweet column.

tb_counter.columns = ['tweet', 'c']
tag_2 = tag.merge(tb_counter, how='left', on='tweet')
tag_2 = tag_2.sort_values('tweet')
tag_2.head()

在此处输入图片说明

Next, is to just find the tweets that only appears twice (column c ) and compare it to the prious tag or pk column using numpy .

import numpy as np
tag_2['same_and_2'] = np.where(((tag_2['c'] == 2) & (tag_2['pk'] != tag_2['pk'].shift())), 1, 0)
tag_2.head()

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM