I have dataframe (tag) with 9153 rows and 3 columns. Here the first 5 rows.
pk tag tweet
0 148 unknown 9491
1 149 ignore 9513
2 150 real 8461
3 151 fake 8639
4 152 unknown 8385
What I am trying to do, is see if a tweet gets two tags and these tags are different from each other, like these ones,
pk tag tweet
5287 5436 unknown 16600
8477 8626 real 16600
then I eliminate these tweets from the data frame. but if tweet get two similer tags then accept tweet and will not be deleted. To solve this problem, I created new data frame consist of tweet no. and its number of tags
x=pd.DataFrame(tag['tweet'].value_counts())
x.reset_index(inplace=True)
here first 5 rows of x data frame, some tweets get 3 or even more(up to 15) tags but I am only concerned with tweets got two tags
index tweet
0 8252 15
1 9200 15
2 8646 13
3 8774 13
4 8322 13
Then create list that have tweet no. which has only two tags
tweet_no=[]
for i in x.itertuples():
if i.tweet==2:
tweet_no.append(i.index)
but I stuck on how to compare if tweets have similar or different tags and deleted if they have different tags and accept if they have similar tags.
Try to get unique count for each tweet and then eliminate if count is greater than one
import pandas as pd
# your original data frame
original_data = pd.read_csv("your tweets csv file")
# Create temp data frame with only required columns
temp_data = original_data[["tweets", "tags"]]
temp_data = temp_data.groupby(["tweet"], as_index=False).agg({"tags": "nunique"})
# Tweet with only with a single
temp_data = temp_data[temp_data["tags"] == 1]["tweets"]
# Filter original data frame for the desired tweets
original_data = original_data[original_data["tweets"].isin(temp_data)]
=====================================
Sample example
data = pd.DataFrame(data={"tweet: [1, 2, 3, 1, 2, 3], "tags": ["a", "b", "c", "d", "b", "c"]})
data = data.groupby(["tweet"], as_index=False).agg({"tags": "nunique"})
# Tweet with only with a single
data = data[data["tags"] == 1]
=====================================
Hope this will resolve your problem
Assuming similar = the same, you can find an example below:
df = pd.DataFrame({'tag': ['1', '1', '2', '3', '3'],
'tweet': ['a', 'a', 'b', 'b', 'c']})
df = df.groupby('tweet').agg(['count', 'nunique'])
df.columns = df.columns.droplevel()
df[(df['count'] > 1) & (df['nunique'] == 1)]
Might as well drop the count column and only filter based on nunique
. Cheers!
What you could do is: Join the counter table with the original table based on the tweet and sort them based on the tweet column.
tb_counter.columns = ['tweet', 'c']
tag_2 = tag.merge(tb_counter, how='left', on='tweet')
tag_2 = tag_2.sort_values('tweet')
tag_2.head()
Next, is to just find the tweets that only appears twice (column c
) and compare it to the prious tag
or pk
column using numpy
.
import numpy as np
tag_2['same_and_2'] = np.where(((tag_2['c'] == 2) & (tag_2['pk'] != tag_2['pk'].shift())), 1, 0)
tag_2.head()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.