簡體   English   中英

通過另一個DataFrame中的唯一值過濾一個DataFrame

[英]Filter one DataFrame by unique values in another DataFrame

我有2個Python數據框:

第一個數據框包含導入到該數據框的所有數據,其中包括“產品代碼”,“情感”,“ summaryText”,“ reviewText”等。 所有初始審核數據。

DFF = DFF[['prodcode', 'summaryText', 'reviewText', 'overall', 'reviewerID', 'reviewerName', 'helpful','reviewTime', 'unixReviewTime', 'sentiment','textLength']]

產生:


     prodcode                                 summaryText                                         reviewText  overall      reviewerID    ...       helpful   reviewTime unixReviewTime  sentiment textLength
0  B00002243X  Work Well - Should Have Bought Longer Ones  I needed a set of jumper cables for my new car...      5.0  A3F73SC1LY51OO    ...        [4, 4]  08 17, 2011     1313539200          2        516
1  B00002243X                            Okay long cables  These long cables work fine for my truck, but ...      4.0  A20S66SKYXULG2    ...        [1, 1]   09 4, 2011     1315094400          2        265
2  B00002243X                  Looks and feels heavy Duty  Can't comment much on these since they have no...      5.0  A2I8LFSN2IS5EO    ...        [0, 0]  07 25, 2013     1374710400          2       1142
3  B00002243X       Excellent choice for Jumper Cables!!!  I absolutley love Amazon!!!  For the price of ...      5.0  A3GT2EWQSO45ZG    ...      [19, 19]  12 21, 2010     1292889600          2       4739
4  B00002243X      Excellent, High Quality Starter Cables  I purchased the 12' feet long cable set and th...      5.0  A3ESWJPAVRPWB4    ...        [0, 0]   07 4, 2012     1341360000          2        415

第二個數據框是所有產品代碼和對該產品進行的所有評論/所有產品評論的比率的分組。 它是該評論分數與該特定產品做出的所有評論分數之比。

df1 = (
    DFF.groupby(["prodcode", "sentiment"]).count()
    .join(DFF.groupby("prodcode").count(), "prodcode", rsuffix="_r"))[['reviewText', 'reviewText_r']]

df1['result'] = df1['reviewText']/df1['reviewText_r']
df1 = df1.reset_index()
df1 = df1.pivot("prodcode", 'sentiment', 'result').fillna(0)
df1 = round(df1 * 100)
df1.astype('int')

sorted_df2 = df1.sort_values(['0', '1', '2'], ascending=False)

產生以下DF:

sentiment      0     1     2
prodcode                        
B0024E6QOO  80.0   0.0  20.0
B000GPV2QA  67.0  17.0  17.0
B0067DNSUI  67.0   0.0  33.0
B00192JH4S  62.0  12.0  25.0
B0087FSA0C  60.0  20.0  20.0
B0002KM5L0  60.0   0.0  40.0
B000DZBP60  60.0   0.0  40.0
B000PJCBOE  60.0   0.0  40.0
B0033A5PPO  57.0  29.0  14.0
B003POL69C  57.0  14.0  29.0
B0002Z9L8K  56.0  31.0  12.0

我現在嘗試用兩種方式過濾我的第一個數據幀。 第一個,由第二個數據幀的結果。 那樣的話,我的意思是我希望第一個數據幀由df1.sentiment ['0']> 40的第二個數據幀通過prodcode進行過濾。從該列表中,我想按那些'sentiment'的行過濾第一個數據幀。從第一個數據幀開始= 0

在較高的級別上,我正在嘗試在第一個數據框中獲取產品的prodcode,summaryText和reviewText,這些產品在較低的情感分數中具有較高的比率,並且其情感為0。

像這樣的東西:

假設您需要的所有數據都在df1中,則不需要合並。

m = list(DFF['prodcode'].loc[DFF['sentiment'] == 0] # create a list matching your criteria
df.loc[(df['0'] > 40) & (df['sentiment'].isin(m)] # filter according to your conditions 

我想到了:

DF3 = pd.merge(DFF, df1, left_on='prodcode', right_on='prodcode')
print(DF3.loc[(DF3['0'] > 50.0) & (DF3['2'] < 50.0) & (DF3['sentiment'].isin(['0']))].sort_values('0', ascending=False))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM