简体   繁体   中英

Flagging outliers of repeated measures within group (given there are thousands of groups)

I want to identify misreported values from a school district. There are 10,000 school districts, and I've collected 14 years worth of school districts and the average amount they spend per student each year. If the preceding five years, the values are within a range of 6 - $9000, but then in the following year, 2013, that school district (and its corresponding city) report a per student spending of $15,000; there's a good chance, for whatever reason, that that value is misreported. There are ways of tracking down the correct value, but that $15000 is likely misrepresentative and shouldn't be used.

I have created a large dataset to look at education expenditure (how much a school district spends per student) and crime rate, so I have repeating values of city, county, school district over a 14-year period. I've looked at the max/min for the dataset for school district per student spending (and have looked at scatterplots), and have identified anomalies. This made me realize that there could be misreported school district expenditures that aren't extreme for the dataset (though they are extreme among that particular school district).

If I wanted to flag values as misreported based on standard deviation, I could use:

flags<-
  dat%>%
    group_by(full_district_id)%>%
    mutate(sd.district_id = sd(EXPENDITURE_PER_STUDENT, na.rm = TRUE),
    flag = ifelse(full_district_id > 2* sd.district_id, "greater",
    ifelse(full_district_id< 2 * sd.district_id, "smaller", "nothing"))%>%
    ungroup()%>%
    filter(flag == "greater"|flag == "smaller")

but I think it would be better to look at the preceding year (or something like the five preceding years) to see whether that particular year is an anomaly. So if a value is greater that $4000 more than any of the preceding five years then that value would be flagged. I'm uncertain how to write a conditional that would say something like if over $4000 from the previous x years of school district expenditure, then flag this value. And then I review these values and look for their correct values.

I googled various things but none were really what I was looking for.

Here is a small chunk of my dataset, so you can get a feel for what is going on; though for reference, it is over 100,000 values. Thanks much!

dput of data

structure(list(year = c(2003, 2005, 2006, 2007, 2008, 2009, 2010, 
2011, 2012, 2013, 2014, 2015, 2016, 2003, 2005, 2006, 2007, 2009, 
2010, 2011), PLACE_ID = c("0100124", "0100124", "0100124", "0100124", 
"0100124", "0100124", "0100124", "0100124", "0100124", "0100124", 
"0100124", "0100124", "0100124", "0100460", "0100460", "0100460", 
"0100460", "0100460", "0100460", "0100460"), CITY = c("abbeville", 
"abbeville", "abbeville", "abbeville", "abbeville", "abbeville", 
"abbeville", "abbeville", "abbeville", "abbeville", "abbeville", 
"abbeville", "abbeville", "adamsville", "adamsville", "adamsville", 
"adamsville", "adamsville", "adamsville", "adamsville"), COUNTY_ID = c("01067", 
"01067", "01067", "01067", "01067", "01067", "01067", "01067", 
"01067", "01067", "01067", "01067", "01067", "01073", "01073", 
"01073", "01073", "01073", "01073", "01073"), full_district_id = c("0101740", 
"0101740", "0101740", "0101740", "0101740", "0101740", "0101740", 
"0101740", "0101740", "0101740", "0101740", "0101740", "0101740", 
"0101920", "0101920", "0101920", "0101920", "0101920", "0101920", 
"0101920"), EXPENDITURE_PER_STUDENT = c(6.91392685629849, 6.80427570954663, 
7.42387732749179, 7.80973129992738, 8.57273726639795, 8.14466546112116, 
7.91766361717101, 7.57727272727273, 7.50594166366583, 7.91607343574372, 
8.26783670354826, 8.4435736677116, 8.48149606299213, 5.93085371942087, 
6.31827864279556, 7.21194954512474, 7.96307522733535, 8.61417039862885, 
9.07166232181485, 8.87169548243168)), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -20L))

The following defines the outliers for each district_id and keeps the rows, in which the expenditure per student is not an outlier (defined for each district separately):

library(dplyr)
library(outliers)
View(df %>%
      group_by(full_district_id) %>%
       arrange(year)%>% 
        filter(!EXPENDITURE_PER_STUDENT %in% c(outlier(EXPENDITURE_PER_STUDENT))))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM