[英]Data cleaning of dollar values and percentage in R
我一直在尋找R中的許多軟件包,以幫助我將美元值轉換為漂亮的數值。 我似乎找不到(例如在plyr包中)。 我要尋找的基本內容是簡單地刪除$符號,並分別為“百萬”和“數千”翻譯“ M”和“ K”。
要復制,我可以在下面使用以下代碼:
require(XML)
theurl <- "http://www.kickstarter.com/help/stats"
html <- htmlParse(theurl)
allProjects <- readHTMLTable(html)[[1]]
names(allProjects) <- c("Category","LaunchedProjects","TotalDollars","SuccessfulDollars","UnsuccessfulDollars","LiveDollars","LiveProjects","SuccessRate")
數據如下所示:
> tail(allProjects)
Category LaunchedProjects TotalDollars SuccessfulDollars UnsuccessfulDollars LiveDollars
8 Food 3,069 $16.79 M $13.18 M $2.78 M $822.64 K
9 Theater 4,155 $13.45 M $12.01 M $1.22 M $217.86 K
10 Comics 2,242 $12.88 M $11.07 M $941.31 K $862.18 K
11 Fashion 2,799 $9.62 M $7.59 M $1.44 M $585.98 K
12 Photography 2,794 $6.76 M $5.48 M $1.06 M $220.75 K
13 Dance 1,185 $3.43 M $3.13 M $225.82 K $71,322
LiveProjects SuccessRate
8 189 39.27%
9 111 64.09%
10 134 46.11%
11 204 27.24%
12 83 36.81%
13 40 70.22%
我最終編寫了自己的函數:
dollarToNumber <- function(vectorInput) {
result <- c()
for (dollarValue in vectorInput) {
if (is.factor(dollarValue)) {
dollarValue = levels(dollarValue)
}
dollarValue <- gsub("(\\$|,)","",dollarValue)
if(grepl(" K",dollarValue)) {
dollarValue <- as.numeric(gsub(" K","",dollarValue)) * 1000
} else if (grepl(" M",dollarValue)) {
dollarValue <- as.numeric(gsub(" M","",dollarValue)) * 1000000
}
if (!is.numeric(dollarValue)) {
dollarValue <- as.numeric(dollarValue)
}
result <- append(result,dollarValue)
}
result
}
然后我用它來得到我想要的東西:
allProjects <- transform(allProjects,
LaunchedProjects = as.numeric(gsub(",","",levels(LaunchedProjects))),
TotalDollars = dollarToNumber(TotalDollars),
SuccessfulDollars = dollarToNumber(SuccessfulDollars),
UnsuccessfulDollars = dollarToNumber(UnsuccessfulDollars),
LiveDollars = dollarToNumber(LiveDollars),
LiveProjects = as.numeric(LiveProjects),
SuccessRate = as.numeric(gsub("%","",SuccessRate))/100)
這將給我以下結果:
> str(allProjects)
'data.frame': 13 obs. of 8 variables:
$ Category : Factor w/ 13 levels "Art","Comics",..: 6 8 4 9 12 11 1 7 13 2 ...
$ LaunchedProjects : num 10006 1185 1860 20025 2242 ...
$ TotalDollars : num 1.11e+08 9.68e+07 6.89e+07 6.66e+07 4.31e+07 ...
$ SuccessfulDollars : num 90990000 84960000 59020000 59390000 34910000 ...
$ UnsuccessfulDollars: num 16640000 7900000 6830000 5480000 3700000 ...
$ LiveDollars : num 3090000 3970000 3010000 1750000 4470000 ...
$ LiveProjects : num 13 7 6 11 3 10 8 4 1 2 ...
$ SuccessRate : num 0.394 0.338 0.382 0.541 0.334 ...
我是R的新手,我覺得我編寫的代碼太丑陋了,肯定有更好的方法可以做到這一點,而無需重新發明輪子呢? 我曾經使用過apply,apaply,ddply函數,但均未成功(我也嘗試不使用for循環...)。 最重要的是,在處理SuccessRate列時,我找不到R中的as.percentage函數之類的東西。我缺少什么?
任何指導將不勝感激!
使R與您可能習慣的其他語言不同的一件事是,最好以“向量化”的方式執行操作,一次對整個向量進行操作,而不是遍歷每個單獨的值。 因此,您可以在沒有for
循環的情況下重寫dollarToNumber
函數:
dollarToNumber_vectorised <- function(vector) {
# Want the vector as character rather than factor while
# we're doing text processing operations
vector <- as.character(vector)
vector <- gsub("(\\$|,)","", vector)
# Create a numeric vector to store the results in, this will give you
# warning messages about NA values being introduced because the " K" values
# can't be converted directly to numeric
result <- as.numeric(vector)
# Find all the "$N K" values, and modify the result at those positions
k_positions <- grep(" K", vector)
result[k_positions] <- as.numeric(gsub(" K","", vector[k_positions])) * 1000
# Same for the "$ M" value
m_positions <- grep(" M", vector)
result[m_positions] <- as.numeric(gsub(" M","", vector[m_positions])) * 1000000
return(result)
}
它仍然提供與原始功能相同的輸出:
> dollarToNumber_vectorised(allProjects$LiveDollars)
[1] 3100000 3970000 3020000 1760000 4510000 762650 510860 823370 218590 865940
[11] 587670 221110 71934
# Don't worry too much about this warning
Warning message:
In dollarToNumber_vectorised(allProjects$LiveDollars) :
NAs introduced by coercion
> dollarToNumber(allProjects$LiveDollars)
[1] 3100000 3970000 3020000 1760000 4510000 762650 510860 823370 218590 865940
[11] 587670 221110 71934
使用parse
和eval
解決方案:
ToNumber <- function(X)
{
A <- gsub("%","*1e-2",gsub("K","*1e+3",gsub("M","*1e+6",gsub("\\$|,","",as.character(X)),fixed=TRUE),fixed=TRUE),fixed=TRUE)
B <- try(sapply(A,function(a){eval(parse(text=a))}),silent=TRUE)
if (is.numeric(B)) return (as.numeric(B)) else return(X)
}
#----------------------------------------------------------------------
# Example:
X <-
read.table( header=TRUE,
text =
'Category LaunchedProjects TotalDollars SuccessfulDollars UnsuccessfulDollars LiveDollars LiveProjects SuccessRate
Food 3,069 "$16.79 M" "$13.18 M" "$2.78 M" "$822.64 K" 189 39.27%
Theater 4,155 "$13.45 M" "$12.01 M" "$1.22 M" "$217.86 K" 111 64.09%
Comics 2,242 "$12.88 M" "$11.07 M" "$941.31 K" "$862.18 K" 134 46.11%
Fashion 2,799 "$9.62 M" "$7.59 M" "$1.44 M" "$585.98 K" 204 27.24%
Photography 2,794 "$6.76 M" "$5.48 M" "$1.06 M" "$220.75 K" 83 36.81%
Dance 1,185 "$3.43 M" "$3.13 M" "$225.82 K" "$71,322" 40 70.22%' )
numX <- as.data.frame(lapply(as.list(X),ToNumber))
options(width=1000)
print(numX,row.names=FALSE)
# Category LaunchedProjects TotalDollars SuccessfulDollars UnsuccessfulDollars LiveDollars LiveProjects SuccessRate
# Food 3069 16790000 13180000 2780000 822640 189 0.3927
# Theater 4155 13450000 12010000 1220000 217860 111 0.6409
# Comics 2242 12880000 11070000 941310 862180 134 0.4611
# Fashion 2799 9620000 7590000 1440000 585980 204 0.2724
# Photography 2794 6760000 5480000 1060000 220750 83 0.3681
# Dance 1185 3430000 3130000 225820 71322 40 0.7022
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.