繁体   English   中英

试图通过 dataframe 和 append 将值循环到列表中,但 for 循环不起作用

[英]Trying to loop through dataframe and append a value to a list, but for loops aren't working

这是我的数据的简要介绍

  X      name sex X1880 X1881
1 1      Mary   F  7065  6919
2 2      Anna   F  2604  2698
3 3      Emma   F  2003  2034
4 4 Elizabeth   F  1939  1852
5 5    Minnie   F  1746  1653

每个“X----”代表一个年份(截至2010年),“姓名”栏代表一个孩子的唯一姓名,因此任何姓名与年份对应的数字就是“X年”出生的孩子数---”具有指定的名称(例如,1880 年出生的玛丽有 7065 名)。

我想遍历涵盖 1931 年到 2010 年的列,找到该年出生的孩子总数,然后找到该年出生的孩子的总数,其名字以字母表中的每个字母开头。 最后,我想获得名字以每个字母开头的每年出生的孩子的百分比,并将其存储到一个列表中,这样我就可以在同一图表上的所有字母/所有年份的 plot 趋势线。

这是我的代码

allnames <- read.csv("SSA-longtail-names.csv")
girls <- subset(allnames, allnames$sex=="F")
year_columns <- as.vector(names(girls)[54:134])


percs <- list()

years <- length(year_columns)
letters <- length(LETTERS)

for (i in range(1:years)){
  total = sum(girls[year_columns[i]])
  for (n in range(1:letters)){
    l <- toString(LETTERS[n])
    sub <- girls[(grep(l, girls$name)),year_columns[i]]
    sub_total <- sum(sub[year_columns[i]])
    percent <- (sub_total / total) * 100
    percs <- append(percs, percent)
  }
}

但是 for 循环只有 go 通过 8 次迭代,并且列表 percs(应该存储计算的百分比)充满了 NA。 任何人都可以提出解决这些循环的方法,或者更简单的方法来完成这项任务吗?

这是一种使用dplyrtidyrstringr通过旋转年份列来制作长数据表的方法。

library(dplyr)
library(tidyr)
library(stringr)
data2 <- data %>% 
  pivot_longer(cols = c(-X, -name, -sex), names_to = "year", values_to = "births") %>%
  complete.cases() %>%  # remove NA rows
  mutate(year = as.integer(str_remove(year, "X")), 
         first_letter = str_sub(name, start = 1, end = 1) %>%
  filter(year >= 1931 & year <= 2010)

现在您可以执行以下操作:

data3 <- data2 %>%
  group_by(first_letter, year) %>%
  summarize(total = sum(births))

这为您提供了一个包含三列的 data.frame:

first_letter   year   total
A              1880   17972
A              1881   16426
# etc.

现在您可以进行一些绘图,例如使用ggplot2

library(ggplot2)
# this only looks at the English vowels to make a manageable example
ggplot(data = data3 %>% filter(first_letter %in% c("A", "E", "I", "O", "U"), 
       aes(x = year, y = total, color = first_letter)) +
  geom_line()

  

我已将解决方案分为您描述的三个部分。 如果您只关注百分比,则可以忽略第一部分(总计)并将第二部分和第三部分结合起来:

library(dplyr)
library(stringr)
library(tidyr)

data <- tibble(name = c('Mary', 'Anna', 'Emma', 'Elizabeth', 'Minnie'),
               sex = rep('F', 5),
               X1880 = c(7065, 2604, 2003, 1939, 1746),
               X1881 = c(6919, 2698, 2034, 1852, 1653))

total <- data %>%
  summarise(across(X1880:X1881, sum)) %>%
  pivot_longer(everything(), names_to = 'year', values_to = 'total')

total

#   year  total
#   <chr> <dbl>
# 1 X1880 15357
# 2 X1881 15156

totalPerLetter <- data %>%
  mutate(letter = str_extract(name, '^.')) %>%
  select(letter, starts_with('X')) %>%
  pivot_longer(-letter, names_to = 'year', values_to = 'count') %>%
  group_by(letter, year) %>%
  mutate(count = sum(count)) %>%
  distinct()

totalPerLetter

#   letter year  count
#   <chr>  <chr> <dbl>
# 1 M      X1880  8811
# 2 M      X1881  8572
# 3 A      X1880  2604
# 4 A      X1881  2698
# 5 E      X1880  3942
# 6 E      X1881  3886

pctPerLetter <- totalPerLetter %>%
  group_by(year) %>%
  mutate(total = sum(count)) %>%
  ungroup() %>%
  mutate(percent = count/(total/100))

pctPerLetter

#   letter year  count total percent
#   <chr>  <chr> <dbl> <dbl>   <dbl>
# 1 M      X1880  8811 15357    57.4
# 2 M      X1881  8572 15156    56.6
# 3 A      X1880  2604 15357    17.0
# 4 A      X1881  2698 15156    17.8
# 5 E      X1880  3942 15357    25.7
# 6 E      X1881  3886 15156    25.6

如前所述,考虑将数据重塑为长格式(数据分析中用于合并、清理、聚合、建模和绘图的更好格式)。

重塑

girls_long <- reshape(girls, varying = names(girls)[4:ncol(girls)], times = names(girls)[4:ncol(girls)],
                      idvar = c("X", "name", "sex"),
                      v.names = "count", timevar = "year", ids=NULL,
                      new.row.names = 1:1E5, direction = "long")

girls_long$year <- as.integer(gsub("X", "", girls_long$year))
girls_long
#    X      name   sex  year count
# 1  1      Mary FALSE  1880  7065
# 2  2      Anna FALSE  1880  2604
# 3  3      Emma FALSE  1880  2003
# 4  4 Elizabeth FALSE  1880  1939
# 5  5    Minnie FALSE  1880  1746
# 6  1      Mary FALSE  1881  6919
# 7  2      Anna FALSE  1881  2698
# 8  3      Emma FALSE  1881  2034
# 9  4 Elizabeth FALSE  1881  1852
# 10 5    Minnie FALSE  1881  1653

聚合

# Total number of children born in that year
total_df <- aggregate(name ~ year, girls_long, FUN=length)
total_df
#   year count
# 1 1880 15357
# 2 1881 15156

# Total number of children born in that year whose name begins with each letter of the alphabet
girls_long$name_letter <- substring(girls_long$name, 1, 1)
girls_agg <- aggregate(cbind(count=name) ~ name_letter + year, girls_long, FUN=length)
girls_agg
#   name_letter year count
# 1           A 1880  2604
# 2           E 1880  3942
# 3           M 1880  8811
# 4           A 1881  2698
# 5           E 1881  3886
# 6           M 1881  8572

# Percent of children born in each year whose name begins with each letter
girls_agg$percent <- with(girls_agg, count / ave(count, year, FUN=sum))
girls_agg
#   name_letter year count   percent
# 1           A 1880  2604 0.1695644
# 2           E 1880  3942 0.2566908
# 3           M 1880  8811 0.5737449
# 4           A 1881  2698 0.1780153
# 5           E 1881  3886 0.2564001
# 6           M 1881  8572 0.5655846

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM