I have ~250,000 rows of firm-specific annual data(2000-2019) with and industry SIC code for each firm. The aim is to sum the value in each variable column for every individual SIC code based on the year. The data looks like this for the first couple of rows:
>head(compustat)
gvkey datadate fyear indfmt consol popsrc datafmt curcd at capx ceq emp ni revt xrd costat sic
1 1004 20000531 1999 INDL C D STD USD 740.998 22.344 339.515 2.9 35.163 1024.333 NA A 5080
2 1004 20010531 2000 INDL C D STD USD 701.854 13.134 340.212 2.5 18.531 874.255 NA A 5080
3 1004 20020531 2001 INDL C D STD USD 710.199 12.112 310.235 2.2 -58.939 638.721 NA A 5080
4 1004 20030531 2002 INDL C D STD USD 686.621 9.930 294.988 2.1 -12.410 606.337 NA A 5080
For the columns "at", "capx", "ceq", "emp", "ni", "revt", "xrd" I want the total sum for all firms with identical SIC codes for each year. So my output would be the total value of all variables within the same industry SIC, for every year between 2000 and 2019.
Could someone help me achieve this?
Thanks,
Try this tidyverse
solution. You can follow as strategy selecting the desired variables, set a group_by()
and then use summarise_all()
to compute the total sum. Your shared data is small but it should work with your larger data. Here the code:
library(tidyverse)
#Code
df %>%
#Filter years
filter(fyear>=2000 & fyear<=2019) %>%
#Select variables
select(sic,fyear,at,capx,ceq,emp,ni,revt,xrd) %>%
#Group by sic and year
group_by(sic,fyear) %>%
#Compute total
summarise_all(sum,na.rm=T)
Output:
# A tibble: 3 x 9
# Groups: sic [1]
sic fyear at capx ceq emp ni revt xrd
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 5080 2000 702. 13.1 340. 2.5 18.5 874. 0
2 5080 2001 710. 12.1 310. 2.2 -58.9 639. 0
3 5080 2002 687. 9.93 295. 2.1 -12.4 606. 0
Some data used:
#Data
df <- structure(list(gvkey = c(1004L, 1004L, 1004L, 1004L), datadate = c(20000531L,
20010531L, 20020531L, 20030531L), fyear = 1999:2002, indfmt = c("INDL",
"INDL", "INDL", "INDL"), consol = c("C", "C", "C", "C"), popsrc = c("D",
"D", "D", "D"), datafmt = c("STD", "STD", "STD", "STD"), curcd = c("USD",
"USD", "USD", "USD"), at = c(740.998, 701.854, 710.199, 686.621
), capx = c(22.344, 13.134, 12.112, 9.93), ceq = c(339.515, 340.212,
310.235, 294.988), emp = c(2.9, 2.5, 2.2, 2.1), ni = c(35.163,
18.531, -58.939, -12.41), revt = c(1024.333, 874.255, 638.721,
606.337), xrd = c(NA, NA, NA, NA), costat = c("A", "A", "A",
"A"), sic = c(5080L, 5080L, 5080L, 5080L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
You can use dplyr
library to achieve this: Considering you have a dataframe dw
like this:
dw <- read.table(header=T, text='
gvkey datadate fyear indfmt consol popsrc datafmt curcd at capx ceq emp ni revt xrd costat sic
1004 20000531 1999 INDL C D STD USD 740.998 22.344 339.515 2.9 35.163 1024.333 NA A 5080
1004 20010531 2000 INDL C D STD USD 701.854 13.134 340.212 2.5 18.531 874.255 NA A 5080
1004 20020531 2001 INDL C D STD USD 710.199 12.112 310.235 2.2 -58.939 638.721 NA A 5080
1004 20010531 2000 INDL C D STD USD 701.854 13.134 340.212 2.5 18.531 874.255 NA A 5080
1004 20020531 2008 INDL C D STD USD 710.199 12.112 310.235 2.2 -58.939 638.721 NA A 5080
1004 20030531 2002 INDL C D STD USD 686.621 9.930 294.988 2.1 -12.410 606.337 NA A 5080
1004 20030531 2002 INDL C D STD USD 686.621 9.930 294.988 2.1 -12.410 606.337 NA A 5080
')
The following code can group it by sic and fyear and then select the rows where fyear is greater than 2000.
library(dplyr)
df = as.data.frame(dw %>% group_by(sic, fyear) %>% summarise(capx=sum(capx), ceq=sum(ceq),emp=sum(emp), ni=sum(ni), revt=sum(revt), xrd=sum(xrd)))
df = df[df$fyear >=2000, ]
print(df)
The final output looks this:
sic fyear capx ceq emp ni revt xrd
5080 2000 26.268 680.424 5.0 37.062 1748.510 NA
5080 2001 12.112 310.235 2.2 -58.939 638.721 NA
5080 2002 19.860 589.976 4.2 -24.820 1212.674 NA
5080 2008 12.112 310.235 2.2 -58.939 638.721 NA
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.