[英]How to aggregate a factor at each level of another factor, grouping by two other factors in disaggregated data
Say that there is descriptive data on candidates across election years, districts (or states), and party. 假设有关于选举年,地区(或州)和政党的候选人的描述性数据。 The data are currently dis-aggregated at the 'sub-district' level (say, voting precincts). 目前在“街道”级别(例如投票区)对数据进行分类。
Currently, when I try to aggregate the data to the district-level the various methods return counts that are inaccurate. 当前,当我尝试将数据汇总到地区级别时,各种方法返回的计数都是不准确的。 In other words, the aggregation is not adequately taking into account that the candidates appear in the data multiple times per year, per district. 换句话说,汇总没有充分考虑到候选者每年在每个区域中多次出现在数据中。 What I need is an aggregate count of the number of times a particular party appear in a particular district, regardless of the repeated/duplicated information at the precinct level. 我需要的是某个特定方出现在特定区域中的次数的总计数,而与区域级别上重复/重复的信息无关。 In other words, I need a result that shows the party count for the district-year dyad for each unique candidate-year dyad. 换句话说,我需要一个结果来显示每个唯一的候选年份对联的地区年份对联的党计数。 (Note: candidates may be repeated across election-years and/or districts, but may have different parties; Henry Clay in 1836 and 1840). (注意:候选人可能在选举年和/或地区之间重复,但可能有不同的政党;亨利·克莱(Henry Clay)于1836年和1840年)。
My question is: How do I aggregate data to obtain a count of a factor (party) at each level of another factor (district) by grouping two other factors (year and candidate-name [ID])? 我的问题是: 如何通过将其他两个因素(年份和候选人姓名[ID])分组来获得数据,以便在另一个因素(地区)的每个级别上获得一个因素(当事人)的计数?
year<-rbind("1836", "1836", "1836", "1836",
"1840", "1840", "1840", "1840",
"1844", "1844", "1844", "1844",
"1848", "1848", "1848", "1848")
candidate<-rbind("Henry Clay", "Henry Clay",
"Daniel Webster",
"Daniel Webster", "Henry Clay",
"Henry Clay", "Daniel Webster",
"Daniel Webster",
"Millard Fillmore",
"Millard Fillmore",
"Martin Van Buren",
"Martin Van Buren",
"Millard Fillmore",
"Millard Fillmore",
"Martin Van Buren",
"Martin Van Buren")
party<-rbind("Democratic-Republican",
"Democratic-Republican", "Whig",
"Whig", "National Republican",
"National Republican", "Whig",
"Whig", "Know-Nothing",
"Know-Nothing", "Democrat",
"Democrat", "Know-Nothing",
"Know-Nothing", "Democrat",
"Democrat")
district<-rbind("Alaska", "Alaska", "Vermont",
"Vermont", "Alaska", "Alaska",
"Vermont", "Vermont", "Alaska",
"Alaska", "Vermont", "Vermont",
"Alaska", "Alaska", "Vermont",
"Vermont")
precinct<-rbind("Pre1", "Pre2", "Pre1", "Pre2",
"Pre1", "Pre2", "Pre1", "Pre2",
"Pre1", "Pre2", "Pre1", "Pre2",
"Pre1", "Pre2", "Pre1", "Pre2")
sample<-as.data.frame(cbind(year, candidate, party, district,
precinct))
Examples of Different Methods of Aggregating Data: 汇总数据的不同方法的示例:
party.counts1<-data.frame(table(sample$V3, sample$V1, sample$V4))
Attempt 2a is close to final result needed, but returns counts that do not specify factor-level (party) and are still 'over-counting' party-district data based on precinct-level appearance of the party-candidate in a given year. 尝试2a接近所需的最终结果,但返回的计数未指定要素级别(参与方),但仍基于给定年份中的参与方级别的出现而“高估”参与方数据。
party.counts2<-aggregate(sample$V3, by=list(sample$V4, sample$V1), FUN=length)
party.counts2a<-aggregate(sample$V3~sample$V1:sample$V4:sample$V2, data=sample, FUN=length)
Reshape example displays similar problem as previous aggregate 2a attempt. 重塑示例显示与先前的汇总2a尝试类似的问题。
library(reshape2)
mdata <- melt(sample, id.vars=c("V1", "V2", "V4", "V5"), measure.vars=c("V3"))
party.counts3<-dcast(mdata, value~V1:V2:V4, length)
Again, my question is: How do I aggregate data to obtain a count of a factor (party) at each level of another factor (district) by grouping two other factors (year and candidate-name [ID])? 再次,我的问题是: 如何通过将其他两个因素(年份和候选人姓名[ID])分组来汇总数据,以获得另一个因素(地区)每个级别的一个因素(当事人)的数量?
So far, this is a solution but it is not very tidy. 到目前为止,这是一个解决方案,但不是很整洁。 For instance, the count variable that is constructed is mis-labeled in the final object as the omitted variable in the aggregation command (here; V2). 例如,构造的计数变量在最终对象中被错误标记为聚合命令(此处为V2)中的省略变量。 Also, the result is contained in a separate object (party.counts) rather than merged with the original data (object labelled sample, above). 而且,结果包含在单独的对象(party.counts)中,而不是与原始数据合并(上面标记为sample的对象)。
cross.tab<-unique(sample[c("V3", "V4", "V1", "V2")])
party.counts<-aggregate(. ~ V3:V4:V1, cross.tab, length)
Any assistance or advice for generalizability and/or vectorization as well as ease of incorporation into the prior (original) data structure is appreciated. 任何对概括性和/或向量化以及易于并入先前的(原始)数据结构的帮助或建议都将受到赞赏。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.