簡體   English   中英

R中美元值和百分比的數據清除

[英]Data cleaning of dollar values and percentage in R

我一直在尋找R中的許多軟件包,以幫助我將美元值轉換為漂亮的數值。 我似乎找不到(例如在plyr包中)。 我要尋找的基本內容是簡單地刪除$符號,並分別為“百萬”和“數千”翻譯“ M”和“ K”。

要復制,我可以在下面使用以下代碼:

require(XML)
theurl <- "http://www.kickstarter.com/help/stats"
html <- htmlParse(theurl)

allProjects <- readHTMLTable(html)[[1]]
names(allProjects) <-  c("Category","LaunchedProjects","TotalDollars","SuccessfulDollars","UnsuccessfulDollars","LiveDollars","LiveProjects","SuccessRate")

數據如下所示:

> tail(allProjects)
      Category LaunchedProjects TotalDollars SuccessfulDollars UnsuccessfulDollars LiveDollars
8         Food            3,069     $16.79 M          $13.18 M             $2.78 M   $822.64 K
9      Theater            4,155     $13.45 M          $12.01 M             $1.22 M   $217.86 K
10      Comics            2,242     $12.88 M          $11.07 M           $941.31 K   $862.18 K
11     Fashion            2,799      $9.62 M           $7.59 M             $1.44 M   $585.98 K
12 Photography            2,794      $6.76 M           $5.48 M             $1.06 M   $220.75 K
13       Dance            1,185      $3.43 M           $3.13 M           $225.82 K     $71,322
   LiveProjects SuccessRate
8           189      39.27%
9           111      64.09%
10          134      46.11%
11          204      27.24%
12           83      36.81%
13           40      70.22%

我最終編寫了自己的函數:

dollarToNumber <- function(vectorInput) {
  result <- c()
  for (dollarValue in vectorInput) {
    if (is.factor(dollarValue)) {  
      dollarValue = levels(dollarValue)
    }
    dollarValue <- gsub("(\\$|,)","",dollarValue)
    if(grepl(" K",dollarValue)) {
      dollarValue <- as.numeric(gsub(" K","",dollarValue)) * 1000
    } else if (grepl(" M",dollarValue)) {
      dollarValue <- as.numeric(gsub(" M","",dollarValue)) * 1000000
    }  
    if (!is.numeric(dollarValue)) {
      dollarValue <- as.numeric(dollarValue)
    }
    result <- append(result,dollarValue)
  }
    result
}

然后我用它來得到我想要的東西:

 allProjects <- transform(allProjects,
                          LaunchedProjects = as.numeric(gsub(",","",levels(LaunchedProjects))),
                          TotalDollars = dollarToNumber(TotalDollars),
                          SuccessfulDollars = dollarToNumber(SuccessfulDollars),
                          UnsuccessfulDollars = dollarToNumber(UnsuccessfulDollars),
                          LiveDollars = dollarToNumber(LiveDollars),
                          LiveProjects = as.numeric(LiveProjects),
                          SuccessRate = as.numeric(gsub("%","",SuccessRate))/100)

這將給我以下結果:

> str(allProjects)
'data.frame':   13 obs. of  8 variables:
 $ Category           : Factor w/ 13 levels "Art","Comics",..: 6 8 4 9 12 11 1 7 13 2 ...
 $ LaunchedProjects   : num  10006 1185 1860 20025 2242 ...
 $ TotalDollars       : num  1.11e+08 9.68e+07 6.89e+07 6.66e+07 4.31e+07 ...
 $ SuccessfulDollars  : num  90990000 84960000 59020000 59390000 34910000 ...
 $ UnsuccessfulDollars: num  16640000 7900000 6830000 5480000 3700000 ...
 $ LiveDollars        : num  3090000 3970000 3010000 1750000 4470000 ...
 $ LiveProjects       : num  13 7 6 11 3 10 8 4 1 2 ...
 $ SuccessRate        : num  0.394 0.338 0.382 0.541 0.334 ...

我是R的新手,我覺得我編寫的代碼太丑陋了,肯定有更好的方法可以做到這一點,而無需重新發明輪子呢? 我曾經使用過apply,apaply,ddply函數,但均未成功(我也嘗試不使用for循環...)。 最重要的是,在處理SuccessRate列時,我找不到R中的as.percentage函數之類的東西。我缺少什么?

任何指導將不勝感激!

使R與您可能習慣的其他語言不同的一件事是,最好以“向量化”的方式執行操作,一次對整個向量進行操作,而不是遍歷每個單獨的值。 因此,您可以在沒有for循環的情況下重寫dollarToNumber函數:

dollarToNumber_vectorised <- function(vector) {
  # Want the vector as character rather than factor while
  # we're doing text processing operations
  vector <- as.character(vector)
  vector <- gsub("(\\$|,)","", vector)
  # Create a numeric vector to store the results in, this will give you
  # warning messages about NA values being introduced because the " K" values
  # can't be converted directly to numeric
  result <- as.numeric(vector)
  # Find all the "$N K" values, and modify the result at those positions
  k_positions <- grep(" K", vector)
  result[k_positions] <- as.numeric(gsub(" K","", vector[k_positions])) * 1000
  # Same for the "$ M" value
  m_positions <- grep(" M", vector)
  result[m_positions] <- as.numeric(gsub(" M","", vector[m_positions])) * 1000000
  return(result)
}

它仍然提供與原始功能相同的輸出:

> dollarToNumber_vectorised(allProjects$LiveDollars)
 [1] 3100000 3970000 3020000 1760000 4510000  762650  510860  823370  218590  865940
[11]  587670  221110   71934
# Don't worry too much about this warning
Warning message:
In dollarToNumber_vectorised(allProjects$LiveDollars) :
  NAs introduced by coercion
> dollarToNumber(allProjects$LiveDollars)
 [1] 3100000 3970000 3020000 1760000 4510000  762650  510860  823370  218590  865940
[11]  587670  221110   71934

使用parseeval解決方案:

ToNumber <- function(X)
{
  A <- gsub("%","*1e-2",gsub("K","*1e+3",gsub("M","*1e+6",gsub("\\$|,","",as.character(X)),fixed=TRUE),fixed=TRUE),fixed=TRUE)
  B <- try(sapply(A,function(a){eval(parse(text=a))}),silent=TRUE)
  if (is.numeric(B)) return (as.numeric(B)) else return(X)
}

#----------------------------------------------------------------------
# Example:
X <-
  read.table( header=TRUE,
              text = 
   'Category LaunchedProjects TotalDollars SuccessfulDollars UnsuccessfulDollars LiveDollars  LiveProjects SuccessRate
        Food            3,069    "$16.79 M"         "$13.18 M"            "$2.78 M"  "$822.64 K" 189      39.27%
     Theater            4,155    "$13.45 M"         "$12.01 M"            "$1.22 M"  "$217.86 K" 111      64.09%
      Comics            2,242    "$12.88 M"         "$11.07 M"          "$941.31 K"  "$862.18 K" 134      46.11%
     Fashion            2,799     "$9.62 M"          "$7.59 M"            "$1.44 M"  "$585.98 K" 204      27.24%
 Photography            2,794     "$6.76 M"          "$5.48 M"            "$1.06 M"  "$220.75 K"  83      36.81%
       Dance            1,185     "$3.43 M"          "$3.13 M"          "$225.82 K"    "$71,322"  40      70.22%' )

numX <- as.data.frame(lapply(as.list(X),ToNumber))

options(width=1000)
print(numX,row.names=FALSE)

#    Category LaunchedProjects TotalDollars SuccessfulDollars UnsuccessfulDollars LiveDollars LiveProjects SuccessRate
#        Food             3069     16790000          13180000             2780000      822640          189      0.3927
#     Theater             4155     13450000          12010000             1220000      217860          111      0.6409
#      Comics             2242     12880000          11070000              941310      862180          134      0.4611
#     Fashion             2799      9620000           7590000             1440000      585980          204      0.2724
# Photography             2794      6760000           5480000             1060000      220750           83      0.3681
#       Dance             1185      3430000           3130000              225820       71322           40      0.7022

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM