[英]Using gsub function with multiple criteria in R
關注問題在數據框中搜索唯一值並使用它們創建表
這是我的數據的樣子
UUID Source
1 Jane http//mywebsite.com44bb00?utm_source=ADW&utm_medium=banner&utm_campaign=Monk&gclid1234
2 Mike http//mywebsite.com44bb00?utm_source=Google&utm_medium=cpc&utm_campaign=DOG&gclid1234
3 John http//mywebsite.com44bb00?utm_source=Yahoo&utm_medium=banner&utm_campaign=DOG&gclid1234
4 Sarah http//mywebsite.com44bb00?utm_source=Facebookdw&utm_medium=cpc&utm_campaign=CAT&gclid1234
5 Michael http//mywebsite.com44bb00?utm_source=Twitter&utm_medium=GDNr&utm_campaign=CAT&gclid1234
6 Bob http//mywebsite.com44bb00?utm_source=ADW&utm_medium=GDN&utm_campaign=DOG&gclid1234
7 Mark http//mywebsite.com44bb00?utm_source=Twitter&utm_medium=banner&utm_campaign=MONK&gclid1234
8 Anna http//mywebsite.com44bb00?utm_source=Facebook&utm_medium=banner&utm_campaign=MONK&gclid1234
這是我想要實現的目標輸出
NAME UTM_SOURCE UTM_MEDIUM UTM_CAMPAIGN
1 Jane ADW banner Monk
2 Mike Google cpc DOG
3 John Yahoo banner DOG
4 Sarah Faceboo cpc CAT
5 Michael Twitter GDN CAT
6 Bob ADW GDN DOG
7 Mark Twitter banner MONK
8 Anna Facebook banner MONK
換句話說,我想要的是根據標准獲得特定的信息。 示例:在數據框中搜索值“utmsource =”,找到后,復制“=”和“&”符號之間的任何信息。 對於用戶no1(Jame),如果查看原始文件,則她的源URL包含值“utm_source = ADW”。 在輸出文件中,“ADW”位被提取並插入名為“utm_source”的新列中。 所有其他用戶和其他維度的相同原則(utm_medium&utm_campaign)
我明白gsub
函數可以幫助我。 這是我到目前為止所嘗試的:
> file1 <- read.csv("C:/Users/Dumitru Ostaciu/Desktop/Users.csv")
> file1 <- transform(file1, Source = as.character(Source))
> file2 <- gsub(".*\\?utm_source=", "", file1$Source)
這就是我得到的結果
UUID SOURCE
1 ADW&utm_medium=banner&utm_campaign=Monk&gclid1234
2 Google&utm_medium=cpc&utm_campaign=DOG&gclid1234
3 Yahoo&utm_medium=banner&utm_campaign=DOG&gclid1234
4 Facebookdw&utm_medium=cpc&utm_campaign=CAT&gclid1234
5 Twitter&utm_medium=GDNr&utm_campaign=CAT&gclid1234
6 ADW&utm_medium=GDN&utm_campaign=DOG&gclid1234
7 Twitter&utm_medium=banner&utm_campaign=MONK&gclid1234
8 Facebook&utm_medium=banner&utm_campaign=MONK&gclid1234
我有兩個問題:
1)在我得到的輸出中,該函數復制了值“utm_source-”之后的所有內容。 如何添加另一個維度以使公式僅復制“=”和“&”之間的內容
2)我如何保留最初在第一列(UUID),Jane,Mike,John等中的值?
你需要做兩件事:
gsub
從源中刪除網站名稱 strsplit
分隔每次出現的剩余字符串?
讀入數據:
x <- read.table(text="
UUID Source
1 Jane http//mywebsite.com44bb00?utm_source=ADW&utm_medium=banner&utm_campaign=Monk&gclid1234
2 Mike http//mywebsite.com44bb00?utm_source=Google&utm_medium=cpc&utm_campaign=DOG&gclid1234
3 John http//mywebsite.com44bb00?utm_source=Yahoo&utm_medium=banner&utm_campaign=DOG&gclid1234
4 Sarah http//mywebsite.com44bb00?utm_source=Facebookdw&utm_medium=cpc&utm_campaign=CAT&gclid1234
5 Michael http//mywebsite.com44bb00?utm_source=Twitter&utm_medium=GDNr&utm_campaign=CAT&gclid1234
6 Bob http//mywebsite.com44bb00?utm_source=ADW&utm_medium=GDN&utm_campaign=DOG&gclid1234
7 Mark http//mywebsite.com44bb00?utm_source=Twitter&utm_medium=banner&utm_campaign=MONK&gclid1234
8 Anna http//mywebsite.com44bb00?utm_source=Facebook&utm_medium=banner&utm_campaign=MONK&gclid1234", header=TRUE, stringsAsFactors=FALSE)
使用strsplit
分隔每個源字符串?
:
z <- matrix(
unlist(strsplit(gsub(".*\\?", "", x$Source), "\\&")),
ncol=4, byrow=TRUE)
cbind(x$UUID, gsub(".*=", "", z))
[,1] [,2] [,3] [,4] [,5]
[1,] "Jane" "ADW" "banner" "Monk" "gclid1234"
[2,] "Mike" "Google" "cpc" "DOG" "gclid1234"
[3,] "John" "Yahoo" "banner" "DOG" "gclid1234"
[4,] "Sarah" "Facebookdw" "cpc" "CAT" "gclid1234"
[5,] "Michael" "Twitter" "GDNr" "CAT" "gclid1234"
[6,] "Bob" "ADW" "GDN" "DOG" "gclid1234"
[7,] "Mark" "Twitter" "banner" "MONK" "gclid1234"
[8,] "Anna" "Facebook" "banner" "MONK" "gclid1234"
然后轉換為數據框並添加名稱:
z <- matrix(
unlist(strsplit(gsub(".*\\?", "", x$Source), "\\&")),
ncol=4, byrow=TRUE)
z <- cbind(x$UUID, gsub(".*=", "", z))
z <- as.data.frame(z[, -5])
names(z) <- c("UUID", "UTM_SOURCE", "UTM_MEDIUM", "UTM_CAMPAIGN")
z
UUID UTM_SOURCE UTM_MEDIUM UTM_CAMPAIGN
1 Jane ADW banner Monk
2 Mike Google cpc DOG
3 John Yahoo banner DOG
4 Sarah Facebookdw cpc CAT
5 Michael Twitter GDNr CAT
6 Bob ADW GDN DOG
7 Mark Twitter banner MONK
8 Anna Facebook banner MONK
我就是這樣做的
> file1 <- read.csv("C:/Users/Dumitru Ostaciu/Desktop/Users.csv")
> file1 <- transform(file1, Source = as.character(Source))
> z <- matrix(
unlist(strsplit(gsub(".*\\?", "", file1$Source), "\\&")),
ncol=4, byrow=TRUE)
> file2 <- cbind(file1$UUID, gsub(".*=", "", z))
這是我得到的結果
V1 V2 V3 V4 V5
1 3 ADW banner Monk gclid1234
2 7 Google cpc DOG gclid1234
3 4 Yahoo banner DOG gclid1234
4 8 Facebookdw cpc CAT gclid1234
5 6 Twitter GDNr CAT gclid1234
6 2 ADW GDN DOG gclid1234
7 5 Twitter banner MONK gclid1234
8 1 Facebook banner MONK gclid1234
我需要指出,我的真實數據將有500.000行,在第一列中將有一個唯一的名稱。
如何修復此問題以使名稱顯示在V1中? 我的錯是什么?
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.