简体   繁体   中英

Subsetting and applying functions to an R dataframe using values from another dataframe?

I'm currently working with the two dataframes shown below.

df1 is a dataframe that contains data for individual online business reviews and the date that it was created. I also created features to identify the review as a good review or a bad review based on the review's star rating.

df2 is a dataframe that contains all of the individual visits or 'check ins' at each business and the date it occurred. business_id is the common feature between the two dataframes.

df1:

 review_id            business_id stars       date BadReview GoodReview  startdate    enddate
1: F2L5ZUhXGQV3eoXker3VhA ---kPU91CF4Lq2-WlRu9Lw     4 2020-06-04         0          1 2020-03-06 2020-09-02
2: rljdEt4_jgsqOgG-7myu-g ---kPU91CF4Lq2-WlRu9Lw     5 2020-03-18         0          1 2019-12-19 2020-06-16
3: fe-yp1-cpfsbL6BUuimP9Q --_9CAxgfXZmoFdNIRrhHA     3 2013-01-07         1          0 2012-10-09 2013-04-07
4: hCjfr9owNP4NfiDtXjgcQg --_9CAxgfXZmoFdNIRrhHA     4 2014-04-29         0          1 2014-01-29 2014-07-28
5: CvdQ_FAJlx8pXdau6p_TlA --_9CAxgfXZmoFdNIRrhHA     4 2010-07-13         0          1 2010-04-14 2010-10-11
6: EVfN8-qleIyBVbmLdV3tVg --_9CAxgfXZmoFdNIRrhHA     5 2013-02-14         0          1 2012-11-16 2013-05-15
   checkincount_before checkincount_after checkin_percentchange
1:              171189             169367             -1.064321
2:              171189             169367             -1.064321
3:              171189             169367             -1.064321
4:              171189             169367             -1.064321
5:              171189             169367             -1.064321
6:              171189             169367             -1.064321

df2:

  business_id              date      
  <chr>                    <date>    
1 --MbOh2O1pATkXa7xbU6LA   2013-04-21
2 --MbOh2O1pATkXa7xbU6LA_1 2013-05-02
3 --MbOh2O1pATkXa7xbU6LA_2 2013-05-04
4 --MbOh2O1pATkXa7xbU6LA_3 2013-05-18
5 --MbOh2O1pATkXa7xbU6LA_4 2013-05-20
6 --MbOh2O1pATkXa7xbU6LA_5 2013-05-22

I'm trying to calculate the 90 day moving average of business check ins before and after each review and then calculate the percent change for each review in df1 to determine the average percent change in business check-ins when a good review or bad review is made.

Here is what I've tried so far:

#convert to y-m-d date format
checkins$date <- as.Date(checkins$date)
influencer_reviews$date <- as.Date(influencer_reviews$date, format='%Y-%m-%d',tz= "UTC")

#create column for dates of 90 days before and after review date
influencer_reviews$startdate <- influencer_reviews$date - days(90)
influencer_reviews$enddate <- influencer_reviews$date + days(90)

#for each influencer_reviews$review_id, create new column 'checkincount_before' which sums the number of check ins between influence_reviews$date and influencer_reviews$startdate with mathching influencer_reviews$business_id in checkin df
influencer_reviews$checkincount_before <- sum(influencer_reviews$business_id %in% checkins$business_id & checkins$date > influencer_reviews$startdate & checkins$date < influencer_reviews$date)

#for each influencer_reviews$review_id, create new column 'checkincount_after' which sums the number of check ins between influence_reviews$date and influencer_reviews$enddate with matching influencer_reviews$business_id in checkin df
influencer_reviews$checkincount_after <- sum(influencer_reviews$business_id %in% checkins$business_id & checkins$date > influencer_reviews$date & checkins$date < influencer_reviews$enddate)

#in influencer_reviews, create new column 'checkin_percentchange' that executes the function = (checkincount_after-checkincount_before/checkincountbefore)*100
influencer_reviews$checkin_percentchange <- with(influencer_reviews, ((checkincount_after-checkincount_before)/(checkincount_before))*100)

#calculate mean of checkin_percentchange grouped by positive review and negative review
mean(influencer_reviews[influencer_reviews$GoodReview == '1', 'checkin_percentchange'])
mean(influencer_reviews[influencer_reviews$BadReview == '1', 'checkin_percentchange'])

When I run this chunk, I receive the following error for the lines where I try to sum the check ins between the 90 day date ranges:

Warning messages:
1: In `>.default`(checkins$date, influencer_reviews$startdate) :
  longer object length is not a multiple of shorter object length
2: In influencer_reviews$business_id %in% checkins$business_id & checkins$date >  :
  longer object length is not a multiple of shorter object length
3: In `<.default`(checkins$date, influencer_reviews$date) :
  longer object length is not a multiple of shorter object length

It still executes and fills the checkin count columns, but fills it with the same number for every row in df1 which is not correct.

Any idea on how I can accomplish what I'm trying to do and/or write this more efficiently? I'm relatively new at using R.

Thanks!

vector_3 <- Sum(vector_1 & vector_2) fills all cells in vector_3 with a single value which is the sum of the first values of vector_1 and vector_2.

Instead, You could use vector_3 <- vector_1 + vector_2 where each cell in vector_3 will be the sum of its corresponding cells in vector_1 and vector_2.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM