[英]How can I assign unique IDs based on a record identifier in R?
我的使命:根據電影數據統計預算和收入數字。
我正在讀取文本文件中的數據,該文件基本上采用以下格式:
MV,Movie 1 Name
BT,Budget for Movie 1
GR,Gross Revenue Movie 1
但數據可能包含也可能不包含BT或GR,或者有時包含多個數據,例如:
MV,Movie1
BT,1000000
GR,500000 (week1)
GR,500000 (week2)
GR,500000 (week3)
GR,500000 (week1)
MV,Movie2
BT,10000
GR,50000 (week1)
GR,500000 (week2)
MV,Movie3
MV,Movie4
BT,1000000
我想創建的是一個數據框,如下所示:
mID recType recData
1 MV Movie1
1 BT 1000000
1 GR 500000 (week1)
1 GR 500000 (week2)
1 GR 500000 (week3)
1 GR 500000 (week1)
2 MV Movie2
2 BT 10000
2 GR 50000 (week1)
2 GR 500000 (week2)
3 MV Movie3
4 MV Movie4
4 BT 1000000
我的程序員說只是用java或.NET編寫一個數據清理應用程序來清理數據,然后再把它帶入R,但我想看看互聯網的集體智慧是否可以幫助我。
為超過90K的電影編寫一個循環,在處理過程中非常討厭。
注意:最終目標是使用此數據作為電影盈利能力分類的主要來源,並將其與外部文件,演員和其他數據交叉引用。
(IMDB需要更好的數據設置)
謝謝!
嘗試
df1$mID <- cumsum(grepl('^Movie', df1$recData))
#df1$mID <- cumsum(df1$recType=='MV')
df1[,c(3,1:2)]
# mID recType recData
#1 1 MV Movie1
#2 1 BT 1000000
#3 1 GR 500000 (week1)
#4 1 GR 500000 (week2)
#5 1 GR 500000 (week3)
#6 1 GR 500000 (week1)
#7 2 MV Movie2
#8 2 BT 10000
#9 2 GR 50000 (week1)
#10 2 GR 500000 (week2)
#11 3 MV Movie3
#12 4 MV Movie4
#13 4 BT 1000000
或者使用data.table
(會更快)
library(data.table)
setDT(df1)[, mID:= cumsum(recType=='MV')][]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.