简体   繁体   English

将字符转换为R中的数字值

[英]Converting a character to a numeric value in R

I have a file that I read in into R and is translated to a dataframe (called CA1) to have the structure as followed: 我有一个文件,我读入R并被转换为数据框(称为CA1),具有如下结构:

   Station_ID Guage_Type   Lat   Long     Date Time_Zone Time_Frame H0 H1 H2 H3 H4 H5  H6  H7  H8  H9 H10 H11 H12 H13 H14 H15 H16 H17 H18 H19 H20 H21 H22 H23
 1    4457700         HI 41.52 124.03 19480701         8        LST  0  0  0  0  0  0   0   0   0   0   0   0 MIS MIS MIS MIS MIS MIS MIS MIS MIS MIS MIS MIS
 2    4457700         HI 41.52 124.03 19480705         8        LST  0  1  1  1  1  1   2   2   2   4   5   5   4   7   1   1   0   0  10  13   5   1   1   3
 3    4457700         HI 41.52 124.03 19480706         8        LST  1  1  1  0  1  1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 4    4457700         HI 41.52 124.03 19480727         8        LST  3  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 5    4457700         HI 41.52 124.03 19480801         8        LST  0  0  0  0  0  0   0   0   0   0   0   0 MIS MIS MIS MIS MIS MIS MIS MIS MIS MIS MIS MIS
 6    4457700         HI 41.52 124.03 19480817         8        LST  0  0  0  0  0  0 ACC ACC ACC ACC ACC ACC   6   1   0   0   0   0   0   0   0   0   0   0

H0 through H23 are read in as character() since there will be cases when the value will not be numeric and will have values such as MIS, ACC, or DEL. H0到H23作为字符()读入,因为有些情况下,该值不是数字,并且将具有诸如MIS,ACC或DEL之类的值。

My question: is there a way to typecast the values for each column H0 through H23 to be numeric and have the character values (MIS, ACC, DEL) as NA or NAN which I can test for it if they are (is.nan or is.na) so I can run some numeric models on it. 我的问题:有没有办法将每列H0到H23的值分类为数字,并且字符值(MIS,ACC,DEL)为NA或NAN,如果它们是(is.nan或is.na)所以我可以运行一些数字模型。 Or would it be best to have the character values to be changed to an identifier, such as -9999? 或者最好将字符值更改为标识符,例如-9999?

I have tried many ways. 我尝试了很多方法。 I have found a few on this site but none of work. 我在这个网站上找到了一些但没有工作。 Such as: 如:

 for (i in 8:31)
 {
     CA1[6,i] <- as.numeric(as.character(CA1[6,i]))
 }

which of course gives warnings but as I test if two specific values is_numeric() (CA1[6,8] and CA1[6,19]) I get a false statement for both. 这当然会发出警告,但是当我测试两个特定值is_numeric()(CA1 [6,8]和CA1 [6,19])时,我得到两个错误的声明。 The first I don't understand why, but the second I do since it is a "". 第一个我不明白为什么,但第二个我做,因为它是一个“”。 However, I can test that with is.na(CA1[6,19]) and returns true, which is just fine for me to know it is not numeric. 但是,我可以使用is.na(CA1 [6,19])测试它并返回true,这对我来说很好,因为我知道它不是数字。

A second way I tried is: 我尝试的第二种方式是:

 for (i in 8:31)
 {
     CA1[6,i] <- as.numeric(levels(CA1[6,i]))[CA1[6,i]]
 }

which I get the same results as before. 我得到了与以前相同的结果。

Is there a way of doing what I am trying to do in an efficient manner? 有没有办法以有效的方式做我想做的事情? Your help is greatly appreciated. 非常感谢您的帮助。 Thank you 谢谢

When you read in the data, you can typically specify what the column types are. 读入数据时,通常可以指定列类型。 For example, read.table / read.csv have a colClasses argument. 例如, read.table / read.csv有一个colClasses参数。

# Something like this
read.table('foo.txt', header=TRUE, colClasses=c('integer', 'factor', 'numeric', numeric', 'Date'))

See ?read.table for more information. 有关更多信息,请参阅?read.table

The immediate problem is each column of a data frame can only contain values of one type. 当前的问题是数据框的每一列只能包含一种类型的值。 The 6 in CA1[6,i] in your code means that only a single value is being converted in each column, so, when it is inserted after conversion, it has to be coerced back to a string to match the rest of the column. 6CA1[6,i]在代码意味着只有一个单一的值被转换在每列中,所以,当它被转换之后插入时,它必须被强制回字符串到柱的其余部分相匹配。

You can solve this by converting the whole column in one go, so that the column is entirely replaced. 您可以通过一次转换整个列来解决此问题,以便完全替换该列。 ie remove the 6 : 即删除6

 for (i in 8:31)
 {
     CA1[,i] <- as.numeric(as.character(CA1[,i]))
 }

Following on Tommy's answer, you potentially could deal with this issue when reading in the data. 根据Tommy的回答,您在阅读数据时可能会处理此问题。 If "MIS" , "ACC" and "DEL" always denote missing values, you could use the na.strings argument in read.table . 如果"MIS""ACC""DEL"始终表示缺少值,则可以在read.table使用na.strings参数。

read.table('foo.txt', header=TRUE, na.strings = c("MIS", "ACC", "DEL"))

If there are other character strings that always denote missing values, then you could add them to the above vector. 如果有其他字符串始终表示缺失值,则可以将它们添加到上面的向量中。

However, if, for example, "MIS" appears in the column Time_Frame and it has a meaning other than to denote a missing value, then DO NOT TAKE THIS APPROACH!! 但是,例如,如果"MIS"出现在Time_Frame列中,并且它具有除表示缺失值之外的含义,则不要采取这种方法!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM