简体   繁体   English

如何在另一个因素的每个级别上汇总一个因素,并按分类数据中的其他两个因素分组

[英]How to aggregate a factor at each level of another factor, grouping by two other factors in disaggregated data

Say that there is descriptive data on candidates across election years, districts (or states), and party. 假设有关于选举年,地区(或州)和政党的候选人的描述性数据。 The data are currently dis-aggregated at the 'sub-district' level (say, voting precincts). 目前在“街道”级别(例如投票区)对数据进行分类。

Currently, when I try to aggregate the data to the district-level the various methods return counts that are inaccurate. 当前,当我尝试将数据汇总到地区级别时,各种方法返回的计数都是不准确的。 In other words, the aggregation is not adequately taking into account that the candidates appear in the data multiple times per year, per district. 换句话说,汇总没有充分考虑到候选者每年在每个区域中多次出现在数据中。 What I need is an aggregate count of the number of times a particular party appear in a particular district, regardless of the repeated/duplicated information at the precinct level. 我需要的是某个特定方出现在特定区域中的次数的总计数,而与区域级别上重复/重复的信息无关。 In other words, I need a result that shows the party count for the district-year dyad for each unique candidate-year dyad. 换句话说,我需要一个结果来显示每个唯一的候选年份对联的地区年份对联的党计数。 (Note: candidates may be repeated across election-years and/or districts, but may have different parties; Henry Clay in 1836 and 1840). (注意:候选人可能在选举年和/或地区之间重复,但可能有不同的政党;亨利·克莱(Henry Clay)于1836年和1840年)。

My question is: How do I aggregate data to obtain a count of a factor (party) at each level of another factor (district) by grouping two other factors (year and candidate-name [ID])? 我的问题是: 如何通过将其他两个因素(年份和候选人姓名[ID])分组来获得数据,以便在另一个因素(地区)的每个级别上获得一个因素(当事人)的计数?

Sample of Data Structure: 数据结构样本:

year<-rbind("1836", "1836", "1836", "1836", 
            "1840", "1840", "1840", "1840", 
            "1844", "1844", "1844", "1844", 
            "1848", "1848", "1848", "1848")

candidate<-rbind("Henry Clay", "Henry Clay", 
                 "Daniel Webster", 
                 "Daniel Webster", "Henry Clay", 
                 "Henry Clay", "Daniel Webster", 
                 "Daniel Webster", 
                 "Millard Fillmore", 
                 "Millard Fillmore", 
                 "Martin Van Buren", 
                 "Martin Van Buren", 
                 "Millard Fillmore", 
                 "Millard Fillmore", 
                 "Martin Van Buren", 
                 "Martin Van Buren")

party<-rbind("Democratic-Republican", 
             "Democratic-Republican", "Whig", 
             "Whig", "National Republican", 
             "National Republican", "Whig", 
             "Whig", "Know-Nothing", 
             "Know-Nothing", "Democrat", 
             "Democrat", "Know-Nothing", 
             "Know-Nothing", "Democrat", 
             "Democrat")

district<-rbind("Alaska", "Alaska", "Vermont", 
                "Vermont", "Alaska", "Alaska", 
                "Vermont", "Vermont", "Alaska", 
                "Alaska", "Vermont", "Vermont", 
                "Alaska", "Alaska", "Vermont", 
                "Vermont")

precinct<-rbind("Pre1", "Pre2", "Pre1", "Pre2", 
                "Pre1", "Pre2", "Pre1", "Pre2", 
                "Pre1", "Pre2", "Pre1", "Pre2", 
                "Pre1", "Pre2", "Pre1", "Pre2")

sample<-as.data.frame(cbind(year, candidate, party, district, 
              precinct))

Examples of Different Methods of Aggregating Data: 汇总数据的不同方法的示例:

table

party.counts1<-data.frame(table(sample$V3, sample$V1, sample$V4))

aggregate: 骨料:

Attempt 2a is close to final result needed, but returns counts that do not specify factor-level (party) and are still 'over-counting' party-district data based on precinct-level appearance of the party-candidate in a given year. 尝试2a接近所需的最终结果,但返回的计数未指定要素级别(参与方),但仍基于给定年份中的参与方级别的出现而“高估”参与方数据。

party.counts2<-aggregate(sample$V3, by=list(sample$V4, sample$V1), FUN=length)

party.counts2a<-aggregate(sample$V3~sample$V1:sample$V4:sample$V2, data=sample, FUN=length)

reshape 重塑

Reshape example displays similar problem as previous aggregate 2a attempt. 重塑示例显示与先前的汇总2a尝试类似的问题。

library(reshape2)
mdata <- melt(sample, id.vars=c("V1", "V2", "V4", "V5"), measure.vars=c("V3"))

party.counts3<-dcast(mdata, value~V1:V2:V4, length)

Again, my question is: How do I aggregate data to obtain a count of a factor (party) at each level of another factor (district) by grouping two other factors (year and candidate-name [ID])? 再次,我的问题是: 如何通过将其他两个因素(年份和候选人姓名[ID])分组来汇总数据,以获得另一个因素(地区)每个级别的一个因素(当事人)的数量?

So far, this is a solution but it is not very tidy. 到目前为止,这是一个解决方案,但不是很整洁。 For instance, the count variable that is constructed is mis-labeled in the final object as the omitted variable in the aggregation command (here; V2). 例如,构造的计数变量在最终对象中被错误标记为聚合命令(此处为V2)中的省略变量。 Also, the result is contained in a separate object (party.counts) rather than merged with the original data (object labelled sample, above). 而且,结果包含在单独的对象(party.counts)中,而不是与原始数据合并(上面标记为sample的对象)。

cross.tab<-unique(sample[c("V3", "V4", "V1", "V2")])

party.counts<-aggregate(. ~ V3:V4:V1, cross.tab, length)

Any assistance or advice for generalizability and/or vectorization as well as ease of incorporation into the prior (original) data structure is appreciated. 任何对概括性和/或向量化以及易于并入先前的(原始)数据结构的帮助或建议都将受到赞赏。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 对按另一个因子分组的因子的每个级别进行计数 - Performing a count of each level of a factor grouping by another factor 计算每个因素按另一个因素分组的数量 - Count number of each factor grouping by another factor 如何为每列分配因子并根据因子水平计算行均值 - How to assign factors for each column and calculate rowmeans based on the factor level 得到没有。 按因素分组后每个因素水平的观察值 - get the no. of observations in every level of factor after grouping by factors 如何 plot 每个级别的一个因素 - How to plot each level of a factor 基于另一列的每个因子水平的比例数据框 - proportion data frame for each factor level based on another column 将因子列表向量拆分为每个因子级别的列的数据框 - Splitting a vector of lists of factors into dataframe with column for each factor level 如何在“聚合”function 中包装“因子级别名称”? - How to wrap `factor level name` in `aggregate` function? 我如何在没有循环的情况下通过数据帧中该级别中另一个因子的子集来操作因子级别内的数据 - How can i manipulate data within a factor level by a subset of another factor in that level in a dataframe without loops 根据每一行中其他因子的值生成一个新的因子变量 - generate a new factor variable depending on the values of other factors in each row
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM