[英]Creating a dummy variable according to data in a matrix in R
I have a dataframe with 1000 observations belonging to n different countries.我有一个数据框,其中包含属于 n 个不同国家的 1000 个观测值。 Each country has more than 1 observation and the number of observations of each country differ.
每个国家有 1 个以上的观测值,每个国家的观测值数量不同。 I need to create a column with numbers going from (1 to n-1), with each number corresponding to a different country.
我需要创建一个数字从(1 到 n-1)的列,每个数字对应一个不同的国家。 That is, I am creating a dummy variable and I don't care which country has which number.
也就是说,我正在创建一个虚拟变量,我不在乎哪个国家有哪个数字。 I just need to create such dummies.
我只需要创建这样的假人。 My data are something like this
我的数据是这样的
Region x
1 be1 71615
4 be211 54288
5 be112 51158
6 it213 69856
8 it221 71412
9 uk222 79537
10 de101 94827
11 de10a 98273
12 dea10 92827
.. .. ..
Each country has its own "code" in the column Region, for instance beXXXX correpsonds to Belgium, ukXXX to the United Kingdom and so on.每个国家/地区在 Region 列中都有自己的“代码”,例如 beXXXX 对应于比利时,ukXXX 对应于英国等。 Hence I suppose I could exploit the initial 2 letters in the column Region to create my dummies.
因此,我想我可以利用 Region 列中的前两个字母来创建我的假人。 I know from here that the command
grep()
could do the job, but I need to have a script which automatically switches from 1 to n-1 whenever the initial letters of the Region change.我从这里知道命令
grep()
可以完成这项工作,但是我需要一个脚本,每当 Region 的首字母发生变化时,它会自动从 1 切换到 n-1。
The expected output should be like this预期的输出应该是这样的
Region x Dummy
1 be1 71615 1
4 be211 54288 1
5 be112 51158 1
6 it213 69856 2
8 it221 71412 2
9 uk222 79537 3
10 de101 94827 4
11 de10a 98273 4
12 dea10 92827 4
.. .. .. ..
and in this case 1 corresponds to "be" (Belgium), 2 to "it" (Italy) and so on for the ´n´countries in my sample.在这种情况下,1 对应于“be”(比利时),2 对应于“it”(意大利),依此类推,对于我的样本中的“n”个国家。
How about creating a factor variable (you can show the underlying integer codes with as.integer
).如何创建一个因子变量(您可以使用
as.integer
显示底层整数代码)。 We use regexec
and regmatches
to extract the letter codes that occur at the beginning of the Region
variable (ignoring letters that occur later) and turn them into the factor...我们使用
regexec
和regmatches
提取出现在Region
变量开头的字母代码(忽略后面出现的字母),并将它们转化为因子...
# Data with an extra row (row number 11)
df <- read.table( text = " Region x
1 be1 71615
4 be211 54288
5 be112 51158
6 it213 69856
8 it221 71412
9 uk222 79537
11 uk222a 79537
10 de101 94827" , h = T , stringsAsFactors = FALSE )
levs <- regmatches( df$Region , regexec( "^[a-z]+" , df$Region ) )
df$Country <- as.integer( factor( levs , levels = unique(levs ) ) )
Region x Country
1 be1 71615 1
4 be211 54288 1
5 be112 51158 1
6 it213 69856 2
8 it221 71412 2
9 uk222 79537 3
11 uk222a 79537 3
10 de101 94827 4
unlist( regmatches( df$Region , regexec( "^[a-z]+" , df$Region ) ) )
[1] "be" "be" "be" "it" "it" "uk" "uk" "de"
Another option using gsub
is :使用
gsub
另一个选择是:
gsub('.*(^[a-z]{2}).*','\\1',c('de111', 'de11a','dea11'))
"de" "de" "de"
Then you use factor
and as.integer
as showed in the previous answer.然后你使用
factor
和as.integer
如上一个答案所示。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.