简体   繁体   中英

Subset DataTable based on Numeric Values in Columns

this is my first stack overflow post so please bear with me...

I wish to subset a datatable in R so that only rows containing digits are included - and rows containing characters are excluded. For example the table

 Date   Temperature
    41941.6656  1921
    41941.6656  1921
    41941.66561 1921
    41941.66563 1921
    41941.66564 1921
    41941.pypito    1921
    41941.66566 xWRET
    41941.66567 1921

should become

Date    Temperature
41941.6656  1921
41941.6656  1921
41941.66561 1921
41941.66563 1921
41941.66564 1921
41941.66567 1921

where xWRET represents some random character string that my stupid instrument often

I have tried is.numeric() and grep() however I cant get either to work.

I feel this should be very straightforward!

Thanks!

I think the more general and efficient usage of data.table would be using .SD and lapply (in case you want to convert more than just two columns and you don't want to specify them manually)

Either

library(data.table)
na.omit(setDT(df)[, lapply(.SD, function(x) as.numeric(as.character(x)))])
#        Date Temperature
# 1: 41941.67        1921
# 2: 41941.67        1921
# 3: 41941.67        1921
# 4: 41941.67        1921
# 5: 41941.67        1921
# 6: 41941.67        1921

Or

setDT(df)[, names(df) := lapply(.SD, function(x) as.numeric(as.character(x)))][complete.cases(df)]
#        Date Temperature
# 1: 41941.67        1921
# 2: 41941.67        1921
# 3: 41941.67        1921
# 4: 41941.67        1921
# 5: 41941.67        1921
# 6: 41941.67        1921

If one is not yet familiar with data.table and the relative complexity its idiom, you could use base functions as well.

Assuming you are getting your data.frame columns as character.

df$Date <- as.numeric(df$Date)
df$Temperature <- as.numeric(df$Temperature)

or

df1 <- sapply(df, as.numeric)

and finally

df[complete.cases(df), ]
df1[complete.cases(df1), ]

to get

         Date Temperature
[1,] 41941.67        1921
[2,] 41941.67        1921
[3,] 41941.67        1921
[4,] 41941.67        1921
[5,] 41941.67        1921
[6,] 41941.67        1921

more on intro to data.table here

This is most certainly not the best way to do it, but here it goes:

library(data.table) # use this package, it is great for perfomance
lines="
41941.6656 1921
41941.6656 1921
41941.66561 1921
41941.66563 1921
41941.66564 1921
41941.pypito 1921
41941.66566 xWRET
41941.66567 1921"
con <- textConnection(lines)
d = data.table(read.table(con,stringsAsFactors = FALSE,
           sep=" ", 
           col.names=c("Date", "Temperature"), 
           fill=FALSE, 
           strip.white=TRUE))
close(con)
d<-d[!is.na(as.numeric(Temperature)) & !is.na(as.numeric(substr(Date,start=7,stop=7)))]

This to me, returns the following:

          Date Temperature
1:  41941.6656        1921
2:  41941.6656        1921
3: 41941.66561        1921
4: 41941.66563        1921
5: 41941.66564        1921
6: 41941.66567        1921

Is this what you want?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM