简体   繁体   中英

Aggregating firm specific data on an industry level based on SIC codes

I have ~250,000 rows of firm-specific annual data(2000-2019) with and industry SIC code for each firm. The aim is to sum the value in each variable column for every individual SIC code based on the year. The data looks like this for the first couple of rows:

>head(compustat)
  gvkey datadate fyear indfmt consol popsrc datafmt curcd      at   capx     ceq emp      ni     revt xrd costat  sic
1  1004 20000531  1999   INDL      C      D     STD   USD 740.998 22.344 339.515 2.9  35.163 1024.333  NA      A 5080
2  1004 20010531  2000   INDL      C      D     STD   USD 701.854 13.134 340.212 2.5  18.531  874.255  NA      A 5080
3  1004 20020531  2001   INDL      C      D     STD   USD 710.199 12.112 310.235 2.2 -58.939  638.721  NA      A 5080
4  1004 20030531  2002   INDL      C      D     STD   USD 686.621  9.930 294.988 2.1 -12.410  606.337  NA      A 5080

For the columns "at", "capx", "ceq", "emp", "ni", "revt", "xrd" I want the total sum for all firms with identical SIC codes for each year. So my output would be the total value of all variables within the same industry SIC, for every year between 2000 and 2019.

Could someone help me achieve this?

Thanks,

Try this tidyverse solution. You can follow as strategy selecting the desired variables, set a group_by() and then use summarise_all() to compute the total sum. Your shared data is small but it should work with your larger data. Here the code:

library(tidyverse)
#Code
df %>%
  #Filter years
  filter(fyear>=2000 & fyear<=2019) %>%
  #Select variables
  select(sic,fyear,at,capx,ceq,emp,ni,revt,xrd) %>%
  #Group by sic and year
  group_by(sic,fyear) %>%
  #Compute total
  summarise_all(sum,na.rm=T)

Output:

# A tibble: 3 x 9
# Groups:   sic [1]
    sic fyear    at  capx   ceq   emp    ni  revt   xrd
  <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1  5080  2000  702. 13.1   340.   2.5  18.5  874.     0
2  5080  2001  710. 12.1   310.   2.2 -58.9  639.     0
3  5080  2002  687.  9.93  295.   2.1 -12.4  606.     0

Some data used:

#Data
df <- structure(list(gvkey = c(1004L, 1004L, 1004L, 1004L), datadate = c(20000531L, 
20010531L, 20020531L, 20030531L), fyear = 1999:2002, indfmt = c("INDL", 
"INDL", "INDL", "INDL"), consol = c("C", "C", "C", "C"), popsrc = c("D", 
"D", "D", "D"), datafmt = c("STD", "STD", "STD", "STD"), curcd = c("USD", 
"USD", "USD", "USD"), at = c(740.998, 701.854, 710.199, 686.621
), capx = c(22.344, 13.134, 12.112, 9.93), ceq = c(339.515, 340.212, 
310.235, 294.988), emp = c(2.9, 2.5, 2.2, 2.1), ni = c(35.163, 
18.531, -58.939, -12.41), revt = c(1024.333, 874.255, 638.721, 
606.337), xrd = c(NA, NA, NA, NA), costat = c("A", "A", "A", 
"A"), sic = c(5080L, 5080L, 5080L, 5080L)), class = "data.frame", row.names = c("1", 
"2", "3", "4"))

You can use dplyr library to achieve this: Considering you have a dataframe dw like this:

dw <- read.table(header=T, text='
gvkey datadate fyear indfmt consol popsrc datafmt curcd      at   capx     ceq emp      ni     revt xrd costat  sic
1004 20000531  1999   INDL      C      D     STD   USD 740.998 22.344 339.515 2.9  35.163 1024.333  NA      A 5080
1004 20010531  2000   INDL      C      D     STD   USD 701.854 13.134 340.212 2.5  18.531  874.255  NA      A 5080
1004 20020531  2001   INDL      C      D     STD   USD 710.199 12.112 310.235 2.2 -58.939  638.721  NA      A 5080
1004 20010531  2000   INDL      C      D     STD   USD 701.854 13.134 340.212 2.5  18.531  874.255  NA      A 5080
1004 20020531  2008   INDL      C      D     STD   USD 710.199 12.112 310.235 2.2 -58.939  638.721  NA      A 5080
1004 20030531  2002   INDL      C      D     STD   USD 686.621  9.930 294.988 2.1 -12.410  606.337  NA      A 5080
1004 20030531  2002   INDL      C      D     STD   USD 686.621  9.930 294.988 2.1 -12.410  606.337  NA      A 5080
')

The following code can group it by sic and fyear and then select the rows where fyear is greater than 2000.

library(dplyr)
df = as.data.frame(dw %>% group_by(sic, fyear) %>% summarise(capx=sum(capx), ceq=sum(ceq),emp=sum(emp), ni=sum(ni), revt=sum(revt), xrd=sum(xrd)))
df = df[df$fyear >=2000, ]
print(df)

The final output looks this:

   sic fyear   capx     ceq emp      ni     revt xrd
  5080  2000 26.268 680.424 5.0  37.062 1748.510  NA
  5080  2001 12.112 310.235 2.2 -58.939  638.721  NA
  5080  2002 19.860 589.976 4.2 -24.820 1212.674  NA
  5080  2008 12.112 310.235 2.2 -58.939  638.721  NA

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM