簡體   English   中英

導入包含逗號、千位分隔符和尾隨減號的 CSV 數據

[英]Importing CSV data containing commas, thousand separators and trailing minus sign

R 2.13.1 在 Mac OS X 上。我正在嘗試導入一個數據文件,該文件有一個點作為千位分隔符和逗號作為小數點,以及尾隨減號作為負值。

基本上,我正在嘗試從以下轉換:

"A|324,80|1.324,80|35,80-"

  V1    V2     V3    V4
1  A 324.80 1324.8 -35.80

現在,以交互方式執行以下兩項工作:

gsub("\\.","","1.324,80")
[1] "1324,80"

gsub("(.+)-$","-\\1", "35,80-")
[1] "-35,80"

並將它們結合起來:

gsub("\\.", "", gsub("(.+)-$","-\\1","1.324,80-"))
[1] "-1324,80"

但是,我無法從 read.data 中刪除千位分隔符:

setClass("num.with.commas")

setAs("character", "num.with.commas", function(from) as.numeric(gsub("\\.", "", sub("(.+)-$","-\\1",from))) )
mydata <- "A|324,80|1.324,80|35,80-"

mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))

Warning messages:
1: In asMethod(object) : NAs introduced by coercion
2: In asMethod(object) : NAs introduced by coercion
3: In asMethod(object) : NAs introduced by coercion

mytable
  V1 V2 V3 V4
1  A NA NA NA

請注意,如果我從“\\.”更改到 function 中的“,”,事情看起來有點不同:

setAs("character", "num.with.commas", function(from) as.numeric(gsub(",", "", sub("(.+)-$","-\\1",from))) )

mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))

mytable
  V1    V2     V3    V4
1  A 32480 1.3248 -3580

我認為問題在於帶有 dec="," 的 read.data 將傳入的 "," 轉換為 "." 在調用 as(from, "num.with.commas") 之前,輸入字符串可以是例如“1.324.80”。

我希望 as("1.123,80-","num.with.commas") 返回 -1123.80 和 as("1.100.123,80", "num.with.commas") 返回 1100123.80。

如何讓我的 num.with.commas 替換輸入字符串中除最后一個小數點之外的所有內容?

更新:首先,我添加了負前瞻並讓 as() 在控制台中工作:

setAs("character", "num.with.commas", function(from) as.numeric(gsub("(?!\\.\\d\\d$)\\.", "", gsub("(.+)-$","-\\1",from), perl=TRUE)) )
as("1.210.123.80-","num.with.commas")
[1] -1210124
as("10.123.80-","num.with.commas")
[1] -10123.8
as("10.123.80","num.with.commas")
[1] 10123.8

但是, read.table 仍然有同樣的問題。 在我的 function 中添加一些 print() 表明 num.with.commas 實際上得到了逗號而不是重點。

所以我目前的解決方案是從“,”替換為“。” 在 num.with.commas 中。

setAs("character", "num.with.commas", function(from) as.numeric(gsub(",","\\.",gsub("(?!\\.\\d\\d$)\\.", "", gsub("(.+)-$","-\\1",from), perl=TRUE))) )
mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))
mytable
  V1    V2      V3    V4
1  A 324.8 1101325 -35.8

您應該先刪除所有句點,然后將逗號更改為小數點,然后再使用 as.numeric() 進行強制。 您可以稍后使用 options(OutDec=",") 控制小數點的打印方式。 我不認為 R 在內部使用逗號作為小數分隔符,即使在它們是常規的語言環境中也是如此。

> tst <- c("A","324,80","1.324,80","35,80-")
> 
> as.numeric( sub("\\,", ".", sub("(.+)-$","-\\1", gsub("\\.", "", tst)) ) )
[1]     NA  324.8 1324.8  -35.8
Warning message:
NAs introduced by coercion 

這是帶有正則表達式和替換的解決方案

mydata <- "A|324,80|1.324,80|35,80-"
# Split data
mydata2 <- strsplit(mydata,"|",fixed=TRUE)[[1]]
# Remove commas
mydata3 <- gsub(",","",mydata2,fixed=TRUE)
# Move negatives to front of string
mydata4 <- gsub("^(.+)-$","-\\1",mydata3)
# Convert to numeric
mydata.cleaned <- c(mydata4[1],as.numeric(mydata4[2:4]))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM