简体   繁体   English

根据R中矩阵中的数据创建虚拟变量

[英]Creating a dummy variable according to data in a matrix in R

I have a dataframe with 1000 observations belonging to n different countries.我有一个数据框,其中包含属于 n 个不同国家的 1000 个观测值。 Each country has more than 1 observation and the number of observations of each country differ.每个国家有 1 个以上的观测值,每个国家的观测值数量不同。 I need to create a column with numbers going from (1 to n-1), with each number corresponding to a different country.我需要创建一个数字从(1 到 n-1)的列,每个数字对应一个不同的国家。 That is, I am creating a dummy variable and I don't care which country has which number.也就是说,我正在创建一个虚拟变量,我不在乎哪个国家有哪个数字。 I just need to create such dummies.我只需要创建这样的假人。 My data are something like this我的数据是这样的

  Region     x
1    be1 71615
4  be211 54288
5  be112 51158
6  it213 69856
8  it221 71412
9  uk222 79537
10 de101 94827
11 de10a 98273
12 dea10 92827
..    ..    ..

Each country has its own "code" in the column Region, for instance beXXXX correpsonds to Belgium, ukXXX to the United Kingdom and so on.每个国家/地区在 Region 列中都有自己的“代码”,例如 beXXXX 对应于比利时,ukXXX 对应于英国等。 Hence I suppose I could exploit the initial 2 letters in the column Region to create my dummies.因此,我想我可以利用 Region 列中的前两个字母来创建我的假人。 I know from here that the command grep() could do the job, but I need to have a script which automatically switches from 1 to n-1 whenever the initial letters of the Region change.我从这里知道命令grep()可以完成这项工作,但是我需要一个脚本,每当 Region 的首字母发生变化时,它会自动从 1 切换到 n-1。

The expected output should be like this预期的输出应该是这样的

 Region     x   Dummy
1    be1 71615      1
4  be211 54288      1
5  be112 51158      1
6  it213 69856      2
8  it221 71412      2
9  uk222 79537      3
10 de101 94827      4
11 de10a 98273      4
12 dea10 92827      4
..    ..    ..     ..

and in this case 1 corresponds to "be" (Belgium), 2 to "it" (Italy) and so on for the ´n´countries in my sample.在这种情况下,1 对应于“be”(比利时),2 对应于“it”(意大利),依此类推,对于我的样本中的“n”个国家。

How about creating a factor variable (you can show the underlying integer codes with as.integer ).如何创建一个因子变量(您可以使用as.integer显示底层整数代码)。 We use regexec and regmatches to extract the letter codes that occur at the beginning of the Region variable (ignoring letters that occur later) and turn them into the factor...我们使用regexecregmatches提取出现在Region变量开头的字母代码(忽略后面出现的字母),并将它们转化为因子...

#  Data with an extra row (row number 11)
df <- read.table( text = "  Region     x
1    be1 71615
4  be211 54288
5  be112 51158
6  it213 69856
8  it221 71412
9  uk222 79537
11  uk222a 79537
10 de101 94827" , h = T , stringsAsFactors = FALSE )

levs <- regmatches( df$Region , regexec( "^[a-z]+" , df$Region ) )

df$Country <- as.integer( factor( levs , levels = unique(levs ) ) )

   Region     x Country
1     be1 71615       1
4   be211 54288       1
5   be112 51158       1
6   it213 69856       2
8   it221 71412       2
9   uk222 79537       3
11 uk222a 79537       3
10  de101 94827       4

unlist( regmatches( df$Region , regexec( "^[a-z]+" , df$Region ) ) )
[1] "be" "be" "be" "it" "it" "uk" "uk" "de"

Another option using gsub is :使用gsub另一个选择是:

gsub('.*(^[a-z]{2}).*','\\1',c('de111', 'de11a','dea11'))
"de" "de" "de"

Then you use factor and as.integer as showed in the previous answer.然后你使用factoras.integer如上一个答案所示。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM