[英]Trying to loop through dataframe and append a value to a list, but for loops aren't working
這是我的數據的簡要介紹
X name sex X1880 X1881
1 1 Mary F 7065 6919
2 2 Anna F 2604 2698
3 3 Emma F 2003 2034
4 4 Elizabeth F 1939 1852
5 5 Minnie F 1746 1653
每個“X----”代表一個年份(截至2010年),“姓名”欄代表一個孩子的唯一姓名,因此任何姓名與年份對應的數字就是“X年”出生的孩子數---”具有指定的名稱(例如,1880 年出生的瑪麗有 7065 名)。
我想遍歷涵蓋 1931 年到 2010 年的列,找到該年出生的孩子總數,然后找到該年出生的孩子的總數,其名字以字母表中的每個字母開頭。 最后,我想獲得名字以每個字母開頭的每年出生的孩子的百分比,並將其存儲到一個列表中,這樣我就可以在同一圖表上的所有字母/所有年份的 plot 趨勢線。
這是我的代碼
allnames <- read.csv("SSA-longtail-names.csv")
girls <- subset(allnames, allnames$sex=="F")
year_columns <- as.vector(names(girls)[54:134])
percs <- list()
years <- length(year_columns)
letters <- length(LETTERS)
for (i in range(1:years)){
total = sum(girls[year_columns[i]])
for (n in range(1:letters)){
l <- toString(LETTERS[n])
sub <- girls[(grep(l, girls$name)),year_columns[i]]
sub_total <- sum(sub[year_columns[i]])
percent <- (sub_total / total) * 100
percs <- append(percs, percent)
}
}
但是 for 循環只有 go 通過 8 次迭代,並且列表 percs(應該存儲計算的百分比)充滿了 NA。 任何人都可以提出解決這些循環的方法,或者更簡單的方法來完成這項任務嗎?
這是一種使用dplyr
、 tidyr
和stringr
通過旋轉年份列來制作長數據表的方法。
library(dplyr)
library(tidyr)
library(stringr)
data2 <- data %>%
pivot_longer(cols = c(-X, -name, -sex), names_to = "year", values_to = "births") %>%
complete.cases() %>% # remove NA rows
mutate(year = as.integer(str_remove(year, "X")),
first_letter = str_sub(name, start = 1, end = 1) %>%
filter(year >= 1931 & year <= 2010)
現在您可以執行以下操作:
data3 <- data2 %>%
group_by(first_letter, year) %>%
summarize(total = sum(births))
這為您提供了一個包含三列的 data.frame:
first_letter year total
A 1880 17972
A 1881 16426
# etc.
現在您可以進行一些繪圖,例如使用ggplot2
library(ggplot2)
# this only looks at the English vowels to make a manageable example
ggplot(data = data3 %>% filter(first_letter %in% c("A", "E", "I", "O", "U"),
aes(x = year, y = total, color = first_letter)) +
geom_line()
我已將解決方案分為您描述的三個部分。 如果您只關注百分比,則可以忽略第一部分(總計)並將第二部分和第三部分結合起來:
library(dplyr)
library(stringr)
library(tidyr)
data <- tibble(name = c('Mary', 'Anna', 'Emma', 'Elizabeth', 'Minnie'),
sex = rep('F', 5),
X1880 = c(7065, 2604, 2003, 1939, 1746),
X1881 = c(6919, 2698, 2034, 1852, 1653))
total <- data %>%
summarise(across(X1880:X1881, sum)) %>%
pivot_longer(everything(), names_to = 'year', values_to = 'total')
total
# year total
# <chr> <dbl>
# 1 X1880 15357
# 2 X1881 15156
totalPerLetter <- data %>%
mutate(letter = str_extract(name, '^.')) %>%
select(letter, starts_with('X')) %>%
pivot_longer(-letter, names_to = 'year', values_to = 'count') %>%
group_by(letter, year) %>%
mutate(count = sum(count)) %>%
distinct()
totalPerLetter
# letter year count
# <chr> <chr> <dbl>
# 1 M X1880 8811
# 2 M X1881 8572
# 3 A X1880 2604
# 4 A X1881 2698
# 5 E X1880 3942
# 6 E X1881 3886
pctPerLetter <- totalPerLetter %>%
group_by(year) %>%
mutate(total = sum(count)) %>%
ungroup() %>%
mutate(percent = count/(total/100))
pctPerLetter
# letter year count total percent
# <chr> <chr> <dbl> <dbl> <dbl>
# 1 M X1880 8811 15357 57.4
# 2 M X1881 8572 15156 56.6
# 3 A X1880 2604 15357 17.0
# 4 A X1881 2698 15156 17.8
# 5 E X1880 3942 15357 25.7
# 6 E X1881 3886 15156 25.6
如前所述,考慮將數據重塑為長格式(數據分析中用於合並、清理、聚合、建模和繪圖的更好格式)。
重塑
girls_long <- reshape(girls, varying = names(girls)[4:ncol(girls)], times = names(girls)[4:ncol(girls)],
idvar = c("X", "name", "sex"),
v.names = "count", timevar = "year", ids=NULL,
new.row.names = 1:1E5, direction = "long")
girls_long$year <- as.integer(gsub("X", "", girls_long$year))
girls_long
# X name sex year count
# 1 1 Mary FALSE 1880 7065
# 2 2 Anna FALSE 1880 2604
# 3 3 Emma FALSE 1880 2003
# 4 4 Elizabeth FALSE 1880 1939
# 5 5 Minnie FALSE 1880 1746
# 6 1 Mary FALSE 1881 6919
# 7 2 Anna FALSE 1881 2698
# 8 3 Emma FALSE 1881 2034
# 9 4 Elizabeth FALSE 1881 1852
# 10 5 Minnie FALSE 1881 1653
聚合
# Total number of children born in that year
total_df <- aggregate(name ~ year, girls_long, FUN=length)
total_df
# year count
# 1 1880 15357
# 2 1881 15156
# Total number of children born in that year whose name begins with each letter of the alphabet
girls_long$name_letter <- substring(girls_long$name, 1, 1)
girls_agg <- aggregate(cbind(count=name) ~ name_letter + year, girls_long, FUN=length)
girls_agg
# name_letter year count
# 1 A 1880 2604
# 2 E 1880 3942
# 3 M 1880 8811
# 4 A 1881 2698
# 5 E 1881 3886
# 6 M 1881 8572
# Percent of children born in each year whose name begins with each letter
girls_agg$percent <- with(girls_agg, count / ave(count, year, FUN=sum))
girls_agg
# name_letter year count percent
# 1 A 1880 2604 0.1695644
# 2 E 1880 3942 0.2566908
# 3 M 1880 8811 0.5737449
# 4 A 1881 2698 0.1780153
# 5 E 1881 3886 0.2564001
# 6 M 1881 8572 0.5655846
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.