Delete rows from data frame based on value of the columns

Question

I have dataframe (tag) with 9153 rows and 3 columns. Here the first 5 rows.

    pk  tag     tweet
0   148 unknown 9491
1   149 ignore  9513
2   150 real    8461
3   151 fake    8639
4   152 unknown 8385

What I am trying to do, is see if a tweet gets two tags and these tags are different from each other, like these ones,

          pk    tag     tweet
5287    5436    unknown 16600
8477    8626    real    16600

then I eliminate these tweets from the data frame. but if tweet get two similer tags then accept tweet and will not be deleted. To solve this problem, I created new data frame consist of tweet no. and its number of tags

x=pd.DataFrame(tag['tweet'].value_counts())
x.reset_index(inplace=True)

here first 5 rows of x data frame, some tweets get 3 or even more(up to 15) tags but I am only concerned with tweets got two tags

   index    tweet
0   8252    15
1   9200    15
2   8646    13
3   8774    13
4   8322    13

Then create list that have tweet no. which has only two tags

tweet_no=[]
for i in x.itertuples():
    if i.tweet==2:
        tweet_no.append(i.index)

but I stuck on how to compare if tweets have similar or different tags and deleted if they have different tags and accept if they have similar tags.

Answer 1

Try to get unique count for each tweet and then eliminate if count is greater than one

import pandas as pd

# your original data frame
original_data = pd.read_csv("your tweets csv file")

# Create temp data frame with only required columns
temp_data = original_data[["tweets", "tags"]]
temp_data = temp_data.groupby(["tweet"], as_index=False).agg({"tags": "nunique"})
# Tweet with only with a single
temp_data = temp_data[temp_data["tags"] == 1]["tweets"]

# Filter original data frame for the desired tweets
original_data = original_data[original_data["tweets"].isin(temp_data)]

=====================================
Sample example
data = pd.DataFrame(data={"tweet: [1, 2, 3, 1, 2, 3], "tags": ["a", "b", "c", "d", "b", "c"]})

data = data.groupby(["tweet"], as_index=False).agg({"tags": "nunique"})

# Tweet with only with a single
data = data[data["tags"] == 1]
=====================================

Hope this will resolve your problem

Answer 2

Assuming similar = the same, you can find an example below:

df = pd.DataFrame({'tag': ['1', '1', '2', '3', '3'],
                   'tweet': ['a', 'a', 'b', 'b', 'c']})
df = df.groupby('tweet').agg(['count', 'nunique'])
df.columns = df.columns.droplevel()
df[(df['count'] > 1) & (df['nunique'] == 1)]

Might as well drop the count column and only filter based on nunique . Cheers!

Answer 3

What you could do is: Join the counter table with the original table based on the tweet and sort them based on the tweet column.

tb_counter.columns = ['tweet', 'c']
tag_2 = tag.merge(tb_counter, how='left', on='tweet')
tag_2 = tag_2.sort_values('tweet')
tag_2.head()

Next, is to just find the tweets that only appears twice (column c ) and compare it to the prious tag or pk column using numpy .

import numpy as np
tag_2['same_and_2'] = np.where(((tag_2['c'] == 2) & (tag_2['pk'] != tag_2['pk'].shift())), 1, 0)
tag_2.head()

Delete rows from data frame based on value of the columns

Question

3 answers

solution1
0 2019-03-06 10:52:00

solution2
0 2019-03-06 10:54:38

solution3
0 2019-03-06 11:10:31

Delete rows from data frame based on value of the columns

Question

3 answers

solution1 0 2019-03-06 10:52:00

solution2 0 2019-03-06 10:54:38

solution3 0 2019-03-06 11:10:31

solution1
0 2019-03-06 10:52:00

solution2
0 2019-03-06 10:54:38

solution3
0 2019-03-06 11:10:31