[英]How to convert one data frame to another in R?
我已經下載了txt。 可通過以下鏈接找到來自肯尼斯·R·法文庫的文件: http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/det_48_ind_port.html 。
我需要使用這些所謂的SIC代碼將我的樣本根據行業因素分為不同的投資組合。 下載的文件如下所示:
1 Food
0100-0199 Agric production - crops
0200-0299 Agric production - livestock
0700-0799 Agricultural services
0900-0999 Fishing, hunting & trapping
2000-2009 Food and kindred products
2010-2019 Meat products
2020-2029 Dairy products
2030-2039 Canned-preserved fruits-vegs
2040-2046 Flour and other grain mill products
2047-2047 Dog and cat food
2048-2048 Prepared feeds for animals
2050-2059 Bakery products
2060-2063 Sugar and confectionery products
2064-2068 Candy and other confectionery
2070-2079 Fats and oils
2080-2080 Beverages
2082-2082 Malt beverages
2083-2083 Malt
2084-2084 Wine
2085-2085 Distilled and blended liquors
2086-2086 Bottled-canned soft drinks
2087-2087 Flavoring syrup
2090-2092 Misc food preps
2095-2095 Roasted coffee
2096-2096 Potato chips
2097-2097 Manufactured ice
2098-2099 Misc food preparations
5140-5149 Wholesale - groceries & related prods
5150-5159 Wholesale - farm products
5180-5182 Wholesale - beer, wine
5191-5191 Wholesale - farm supplies
2 Mines
1000-1009 Metal mining
1010-1019 Iron ores
1020-1029 Copper ores
1030-1039 Lead and zinc ores
1040-1049 Gold & silver ores
1060-1069 Ferroalloy ores
1080-1089 Mining services
1090-1099 Misc metal ores
1200-1299 Bituminous coal
1400-1499 Mining and quarrying non-metalic minerals
5050-5052 Wholesale - metals and minerals
3 Oil
1300-1300 Oil and gas extraction
1310-1319 Crude petroleum & natural gas
1320-1329 Natural gas liquids
1380-1380 Oil and gas field services
1381-1381 Drilling oil & gas wells
1382-1382 Oil-gas field exploration
1389-1389 Oil and gas field services
2900-2912 Petroleum refining
5170-5172 Wholesale - petroleum and petro prods
4 Clths
2200-2269 Textile mill products
2270-2279 Floor covering mills
2280-2284 Yarn and thread mills
2290-2295 Misc textile goods
2296-2296 Tire cord and fabric
2297-2297 Nonwoven fabrics
2298-2298 Cordage and twine
2299-2299 Misc textile products
2300-2390 Apparel and other finished products
2391-2392 Curtains, home furnishings
2393-2395 Textile bags, canvas products
2396-2396 Auto trim
2397-2399 Misc textile products
3020-3021 Rubber and plastics footwear
3100-3111 Leather tanning and finishing
3130-3131 Boot, shoe cut stock, findings
3140-3149 Footware except rubber
3150-3151 Leather gloves and mittens
3963-3965 Fasteners, buttons, needles, pins
5130-5139 Wholesale - apparel
我想做的是創建一個數據框,第一列給出行業名稱(例如,食品,采礦和礦產等),第二列給出與該行業相關的所有SIC代碼(標准工業代碼)(因為大多數SIC代碼都是通過5130-5139的方式給出的,這使其變得更難一點)。
此數據框將使我的分析更易於實施。
任何建議都是非常可取的。
這將產生一個2列數據幀df.new
,其中包含第2列中的逗號分隔代碼:
df <- read.fwf("Siccodes48.txt", widths=c(3, 7, 60), stringsAsFactors=FALSE)
df <- df[!is.na(df$V3), ]
library(zoo)
df$V1 <- na.locf(df$V1)
l <- split(df, df$V1)
l <- setNames(lapply(l, function(x) {
m <- regexec("([0-9]{4})-([0-9]{4}) .*", x$V3[-1]) # omit headline
r <- regmatches(x$V3[-1], m)
fromTo <- t(sapply(r, "[", 2:3))
paste(sprintf("%04d", unlist(mapply(":", fromTo[, 1], fromTo[, 2]))), collapse=", ")
}), sapply(l, "[", 1, 3))
df.new <- data.frame(name=names(l), sic=unlist(l))
這個怎么樣?
df<-readLines("Siccodes48.txt")
df<-data.frame(t=df[df!=""]) # delete blanks and make data frame
df$prefix<-c(substr(df$t,1,10)) # break out the prefix (first 10 char)
df$index<-cumsum(df$prefix!=" ") # make an index
ind<-df[df$prefix!=" ",] # make an index table
ind$desc<-substring(ind$t,11,100) # parse descriptions
final<-merge(ind[,c("index","desc")], # merge the index table
df[df$prefix==" ",c("index","t")], # with all non-title rows of the list
by="index") # by index
head(final,10)
index desc t
1 1 Agriculture 0100-0199 Agric production - crops
2 1 Agriculture 0200-0299 Agric production - livestock
3 1 Agriculture 0700-0799 Agricultural services
4 1 Agriculture 0910-0919 Commercial fishing
5 1 Agriculture 2048-2048 Prepared feeds for animals
6 2 Food Products 2000-2009 Food and kindred products
7 2 Food Products 2010-2019 Meat products
8 2 Food Products 2020-2029 Dairy products
9 2 Food Products 2030-2039 Canned-preserved fruits-vegs
10 2 Food Products 2040-2046 Flour and other grain mill products
您還可以添加此代碼,以將代碼分成單獨的列:
final$codes<-substr(gsub(" "," ",final$t),2,10)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.