簡體   English   中英

試圖通過 dataframe 和 append 將值循環到列表中,但 for 循環不起作用

[英]Trying to loop through dataframe and append a value to a list, but for loops aren't working

這是我的數據的簡要介紹

  X      name sex X1880 X1881
1 1      Mary   F  7065  6919
2 2      Anna   F  2604  2698
3 3      Emma   F  2003  2034
4 4 Elizabeth   F  1939  1852
5 5    Minnie   F  1746  1653

每個“X----”代表一個年份(截至2010年),“姓名”欄代表一個孩子的唯一姓名,因此任何姓名與年份對應的數字就是“X年”出生的孩子數---”具有指定的名稱(例如,1880 年出生的瑪麗有 7065 名)。

我想遍歷涵蓋 1931 年到 2010 年的列,找到該年出生的孩子總數,然后找到該年出生的孩子的總數,其名字以字母表中的每個字母開頭。 最后,我想獲得名字以每個字母開頭的每年出生的孩子的百分比,並將其存儲到一個列表中,這樣我就可以在同一圖表上的所有字母/所有年份的 plot 趨勢線。

這是我的代碼

allnames <- read.csv("SSA-longtail-names.csv")
girls <- subset(allnames, allnames$sex=="F")
year_columns <- as.vector(names(girls)[54:134])


percs <- list()

years <- length(year_columns)
letters <- length(LETTERS)

for (i in range(1:years)){
  total = sum(girls[year_columns[i]])
  for (n in range(1:letters)){
    l <- toString(LETTERS[n])
    sub <- girls[(grep(l, girls$name)),year_columns[i]]
    sub_total <- sum(sub[year_columns[i]])
    percent <- (sub_total / total) * 100
    percs <- append(percs, percent)
  }
}

但是 for 循環只有 go 通過 8 次迭代,並且列表 percs(應該存儲計算的百分比)充滿了 NA。 任何人都可以提出解決這些循環的方法,或者更簡單的方法來完成這項任務嗎?

這是一種使用dplyrtidyrstringr通過旋轉年份列來制作長數據表的方法。

library(dplyr)
library(tidyr)
library(stringr)
data2 <- data %>% 
  pivot_longer(cols = c(-X, -name, -sex), names_to = "year", values_to = "births") %>%
  complete.cases() %>%  # remove NA rows
  mutate(year = as.integer(str_remove(year, "X")), 
         first_letter = str_sub(name, start = 1, end = 1) %>%
  filter(year >= 1931 & year <= 2010)

現在您可以執行以下操作:

data3 <- data2 %>%
  group_by(first_letter, year) %>%
  summarize(total = sum(births))

這為您提供了一個包含三列的 data.frame:

first_letter   year   total
A              1880   17972
A              1881   16426
# etc.

現在您可以進行一些繪圖,例如使用ggplot2

library(ggplot2)
# this only looks at the English vowels to make a manageable example
ggplot(data = data3 %>% filter(first_letter %in% c("A", "E", "I", "O", "U"), 
       aes(x = year, y = total, color = first_letter)) +
  geom_line()

  

我已將解決方案分為您描述的三個部分。 如果您只關注百分比,則可以忽略第一部分(總計)並將第二部分和第三部分結合起來:

library(dplyr)
library(stringr)
library(tidyr)

data <- tibble(name = c('Mary', 'Anna', 'Emma', 'Elizabeth', 'Minnie'),
               sex = rep('F', 5),
               X1880 = c(7065, 2604, 2003, 1939, 1746),
               X1881 = c(6919, 2698, 2034, 1852, 1653))

total <- data %>%
  summarise(across(X1880:X1881, sum)) %>%
  pivot_longer(everything(), names_to = 'year', values_to = 'total')

total

#   year  total
#   <chr> <dbl>
# 1 X1880 15357
# 2 X1881 15156

totalPerLetter <- data %>%
  mutate(letter = str_extract(name, '^.')) %>%
  select(letter, starts_with('X')) %>%
  pivot_longer(-letter, names_to = 'year', values_to = 'count') %>%
  group_by(letter, year) %>%
  mutate(count = sum(count)) %>%
  distinct()

totalPerLetter

#   letter year  count
#   <chr>  <chr> <dbl>
# 1 M      X1880  8811
# 2 M      X1881  8572
# 3 A      X1880  2604
# 4 A      X1881  2698
# 5 E      X1880  3942
# 6 E      X1881  3886

pctPerLetter <- totalPerLetter %>%
  group_by(year) %>%
  mutate(total = sum(count)) %>%
  ungroup() %>%
  mutate(percent = count/(total/100))

pctPerLetter

#   letter year  count total percent
#   <chr>  <chr> <dbl> <dbl>   <dbl>
# 1 M      X1880  8811 15357    57.4
# 2 M      X1881  8572 15156    56.6
# 3 A      X1880  2604 15357    17.0
# 4 A      X1881  2698 15156    17.8
# 5 E      X1880  3942 15357    25.7
# 6 E      X1881  3886 15156    25.6

如前所述,考慮將數據重塑為長格式(數據分析中用於合並、清理、聚合、建模和繪圖的更好格式)。

重塑

girls_long <- reshape(girls, varying = names(girls)[4:ncol(girls)], times = names(girls)[4:ncol(girls)],
                      idvar = c("X", "name", "sex"),
                      v.names = "count", timevar = "year", ids=NULL,
                      new.row.names = 1:1E5, direction = "long")

girls_long$year <- as.integer(gsub("X", "", girls_long$year))
girls_long
#    X      name   sex  year count
# 1  1      Mary FALSE  1880  7065
# 2  2      Anna FALSE  1880  2604
# 3  3      Emma FALSE  1880  2003
# 4  4 Elizabeth FALSE  1880  1939
# 5  5    Minnie FALSE  1880  1746
# 6  1      Mary FALSE  1881  6919
# 7  2      Anna FALSE  1881  2698
# 8  3      Emma FALSE  1881  2034
# 9  4 Elizabeth FALSE  1881  1852
# 10 5    Minnie FALSE  1881  1653

聚合

# Total number of children born in that year
total_df <- aggregate(name ~ year, girls_long, FUN=length)
total_df
#   year count
# 1 1880 15357
# 2 1881 15156

# Total number of children born in that year whose name begins with each letter of the alphabet
girls_long$name_letter <- substring(girls_long$name, 1, 1)
girls_agg <- aggregate(cbind(count=name) ~ name_letter + year, girls_long, FUN=length)
girls_agg
#   name_letter year count
# 1           A 1880  2604
# 2           E 1880  3942
# 3           M 1880  8811
# 4           A 1881  2698
# 5           E 1881  3886
# 6           M 1881  8572

# Percent of children born in each year whose name begins with each letter
girls_agg$percent <- with(girls_agg, count / ave(count, year, FUN=sum))
girls_agg
#   name_letter year count   percent
# 1           A 1880  2604 0.1695644
# 2           E 1880  3942 0.2566908
# 3           M 1880  8811 0.5737449
# 4           A 1881  2698 0.1780153
# 5           E 1881  3886 0.2564001
# 6           M 1881  8572 0.5655846

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM