简体   繁体   English

R - 基于在 df 中搜索矢量元素创建新列

[英]R - Creating new columns based on searching vector elements in a df

I would like to add columns to a df where the newly added columns are based on searching values of a vector in an existing column of the df.我想将列添加到 df,其中新添加的列基于在 df 的现有列中搜索向量的值。

My original dataset contains webdata where rows represent pages visited for each customer;我的原始数据集包含网络数据,其中行代表每个客户访问过的页面; the pages visited are stored in df$URL.访问的页面存储在 df$URL 中。 I have a separate vector of web page URLS, each element in this vector needs to be added as a column with a value indicating whether that customer's page visit in the original df (df$URL) matches the to be added column (=vector element).我有一个单独的 web 页面 URLS 向量,这个向量中的每个元素都需要添加为一个列,其值指示原始 df (df$URL) 中该客户的页面访问是否与要添加的列匹配(=vector 元素).

Basically: I want to create a column for each element of the vector (where column name = vector element) with values (0/1) based on searching the rows of the URL column of the df to add a 1 on a match, or 0 otherwise.基本上:我想基于搜索 df 的 URL 列的行以在匹配项上添加 1 来为向量的每个元素(其中列名 = 向量元素)创建一个具有值 (0/1) 的列0 否则。

All of the vector elements in urlnames occur in df$URL (but not for every row), but df$URL contains more URLs than are in the vector (basically the vector contains only some top visited URL pages). urlnames 中的所有向量元素都出现在 df$URL 中(但不是每一行),但 df$URL 包含的 URL 比向量中的要多(基本上向量只包含一些访问次数最多的 URL 页面)。

urlnames <- c("/home", "/login", "/contact")

df <- data.frame("URL" = c("/home", "/login", "/contact", "/chat", "/product-page"))

Manually I would do something like (with dplyr):手动我会做类似的事情(使用 dplyr):

df %<>%
  mutate(home = ifelse(URL == "/home", 1, 0))

Basically the variable name and ifelse criterium should be replaced with the vector element.基本上,变量名和 ifelse 条件应替换为向量元素。 I don't know if there's more efficient/neat ways of doing this.我不知道是否有更有效/更简洁的方法来做到这一点。

I really want to learn how to do such things automatically rather than having to do manual mutate calls for each of these variables.我真的很想学习如何自动执行此类操作,而不必对每个变量进行手动 mutate 调用。

(BTW I would also appreciate input with potential issues the url slashes could create in creating column names, eg /home as a variable) (顺便说一句,我也很感激输入 url 斜杠在创建列名时可能产生的潜在问题,例如 /home 作为变量)

Hope I've been clear enough to explain my issue, apologies if not - it's my first post and I'm (obviously) new to R. Thank you!希望我已经足够清楚地解释我的问题,如果没有,我深表歉意 - 这是我的第一篇文章,我(显然)是 R 的新手。谢谢!

Try table :尝试table

table(1:nrow(df),df$URL)

#    /chat /contact /home /login /product-page
#  1     0        0     1      0             0
#  2     0        0     0      1             0
#  3     0        1     0      0             0
#  4     1        0     0      0             0
#  5     0        0     0      0             1

You can drop the columns you don't want afterwards and coerce to a data.frame if needed.您可以随后删除不需要的列,并在需要时强制转换为data.frame

There are tons of ways to remove the columns.有很多方法可以删除列。 One consists of replace ing the values which are different from urlnames with NA and reapplying the above.一个包括用NA replaceurlnames不同的值并重新应用上述内容。 Something like:就像是:

table(1:nrow(df),droplevels(replace(df$URL,which(!df$URL %in% urlnames),NA)))

Something like this, using lapply ?像这样的东西,使用lapply

setNames(as.data.frame(lapply(urlnames, function(x) +(x==df$URL))), urlnames)
#>   /home /login /contact
#> 1     1      0        0
#> 2     0      1        0
#> 3     0      0        1
#> 4     0      0        0
#> 5     0      0        0

What happens here is that we use lapply to create a list of vectors, with one vector of each member of urlnames .这里发生的是我们使用lapply创建一个向量列表,其中每个成员都有一个向量urlnames Each vector is filled with 1s and 0s depending on whether the element of urlnames was found at each position in df$URL .每个向量都用 1 和 0 填充,具体取决于是否在df$URL中的每个urlnames处找到了 urlnames 的元素。 We then turn the list into a data frame and set its column names to the urlnames然后我们将列表变成一个数据框并将其列名设置为urlnames

Longer answer (a bit late to the party) and not as succinct, eloquent or efficient as those above but can be used for partial matches with only minor adjustments (removing the paste0 function encassing the urlnames):更长的答案(晚会有点晚)并且不像上面那些那样简洁,eloquent 或高效但可以用于部分匹配,只需稍作调整(删除paste0 function 包含 urlnames):

setNames(as.data.frame( 
  lapply(paste0("^", urlnames, "$"), function(x){
      +Vectorize(grepl)(x, df$URL)
    }
  ), row.names = NULL), urlnames)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM