简体   繁体   中英

Filter one DataFrame by unique values in another DataFrame

I have 2 Python Dataframes:

The first Dataframe contains all data imported to the DataFrame, which consists of "prodcode", "sentiment", "summaryText", "reviewText",etc. of all initial Review Data.

DFF = DFF[['prodcode', 'summaryText', 'reviewText', 'overall', 'reviewerID', 'reviewerName', 'helpful','reviewTime', 'unixReviewTime', 'sentiment','textLength']]

which produces:


     prodcode                                 summaryText                                         reviewText  overall      reviewerID    ...       helpful   reviewTime unixReviewTime  sentiment textLength
0  B00002243X  Work Well - Should Have Bought Longer Ones  I needed a set of jumper cables for my new car...      5.0  A3F73SC1LY51OO    ...        [4, 4]  08 17, 2011     1313539200          2        516
1  B00002243X                            Okay long cables  These long cables work fine for my truck, but ...      4.0  A20S66SKYXULG2    ...        [1, 1]   09 4, 2011     1315094400          2        265
2  B00002243X                  Looks and feels heavy Duty  Can't comment much on these since they have no...      5.0  A2I8LFSN2IS5EO    ...        [0, 0]  07 25, 2013     1374710400          2       1142
3  B00002243X       Excellent choice for Jumper Cables!!!  I absolutley love Amazon!!!  For the price of ...      5.0  A3GT2EWQSO45ZG    ...      [19, 19]  12 21, 2010     1292889600          2       4739
4  B00002243X      Excellent, High Quality Starter Cables  I purchased the 12' feet long cable set and th...      5.0  A3ESWJPAVRPWB4    ...        [0, 0]   07 4, 2012     1341360000          2        415

The second Dataframe is a grouping of all prodcodes and the ratio of sentiment score / all reviews made for that product. It is the ratio for that review score over all reviews scores made, for that particular product.

df1 = (
    DFF.groupby(["prodcode", "sentiment"]).count()
    .join(DFF.groupby("prodcode").count(), "prodcode", rsuffix="_r"))[['reviewText', 'reviewText_r']]

df1['result'] = df1['reviewText']/df1['reviewText_r']
df1 = df1.reset_index()
df1 = df1.pivot("prodcode", 'sentiment', 'result').fillna(0)
df1 = round(df1 * 100)
df1.astype('int')

sorted_df2 = df1.sort_values(['0', '1', '2'], ascending=False)

which produces the following DF:

sentiment      0     1     2
prodcode                        
B0024E6QOO  80.0   0.0  20.0
B000GPV2QA  67.0  17.0  17.0
B0067DNSUI  67.0   0.0  33.0
B00192JH4S  62.0  12.0  25.0
B0087FSA0C  60.0  20.0  20.0
B0002KM5L0  60.0   0.0  40.0
B000DZBP60  60.0   0.0  40.0
B000PJCBOE  60.0   0.0  40.0
B0033A5PPO  57.0  29.0  14.0
B003POL69C  57.0  14.0  29.0
B0002Z9L8K  56.0  31.0  12.0

What I am now trying to do filter my first dataframe in two ways. The first, by the results of the second dataframe. By that, I mean I want the first dataframe to be filtered by the prodcode's from the second dataframe where df1.sentiment['0'] > 40. From that list, I want to filter the first dataframe by those rows where 'sentiment' from the first dataframe = 0.

At a high level, I am trying to obtain the prodcode, summaryText and reviewText in the first dataframe for Products that had high ratios in lower sentiment scores, and whose sentiment is 0.

Something like this :

assuming all the data you need is in df1 and no merges are needed.

m = list(DFF['prodcode'].loc[DFF['sentiment'] == 0] # create a list matching your criteria
df.loc[(df['0'] > 40) & (df['sentiment'].isin(m)] # filter according to your conditions 

I figured it out:

DF3 = pd.merge(DFF, df1, left_on='prodcode', right_on='prodcode')
print(DF3.loc[(DF3['0'] > 50.0) & (DF3['2'] < 50.0) & (DF3['sentiment'].isin(['0']))].sort_values('0', ascending=False))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM