简体   繁体   中英

R split() function size increase issue

I have the following data set

> head(data)
  X    UserID NPS V3 V4 V5                                   Event              V7          Element                            ElementValue 
1 1 254727216  10  0 19 10 nps.agent.14b.no other attempt was made 10/4/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
2 2 298379949   0  0 28 11 nps.agent.14b.no other attempt was made 9/30/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
3 3 254710917   0  0 20 12 nps.agent.14b.no other attempt was made 9/15/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
4 4 238919392   7  0 17  9 nps.agent.14b.no other attempt was made 9/17/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
5 5 144693025  10  0 18 10 nps.agent.14b.no other attempt was made 9/17/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
6 6 249978568   5  0 21 12 nps.agent.14b.no other attempt was made 9/18/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made

When I split the data set as:

data_splitted <- split(data,data$UserID)

The problem here is huge increase in size which exceeds my ram when i try this with the whole data set instead of this sample

> format(object.size(data),units="Mb")
[1] "0.2 Mb"
> format(object.size(data_splitted),units="Mb")
[1] "45.7 Mb"

Any insights regarding why is this happening and if any way to tackle this would be appreciated.

Try this:

data$UserID <- as.character(data$UserID)
data_splitted <- split(data,data$UserID)

What happenned in your case is that since the ID was numerical, the number was used as an index (position) in the created list, which is obviously not right. Since id's go pretty high in numbers, R filled the gaps with as many empty lists (hence the huge object size). By making the id a character variable, we avoid this.

Another way which would leave the id variable intact inside the 1-line dataframes would be:

data_splitted <- list()
for(i in 1:nrow(data))
  data_splitted[[as.character(data$UserID[i])]] <- data[i,]

To access the elements in the newly created list, you'll need to quote the numbers if you use the $ operator:

data_splitted$"144693025"
data_splitter[["144693025"]]

Another option would be to add characters in front of the numerical id. For instance:

data$UserID <- paste0("id",data$UserID)
data_splitted <- split(data,data$UserID)

Which makes accessing list-items more convenient:

data_splitted$id144693025
data_splitted$id238919392

Use a factor instead of string if you have lots of similar strings. (And if you don't need to process their content, don't store them at all, or only store eg the hostnames, again as factors. You could use grep with a regex and only capture-fields for eg hostname and error-code, and throw away everything else).

Next thing, make your life in splitting easy, by changing, or postprocessing your logfile, from:

nps.agent.14b.no other attempt was made

to:

nps.agent.14b:no other attempt was made

Now you simply split on ':' (or '|') Look at some best practices for logfiles, there has been tons of good stuff written on that. If every line is guaranteed to have one and only one hostname and one error-code, maybe store them as separate Hostname and ErrorCode fields.

So, your code should be as simple as:

> as.factor(strsplit(s, ':')
[1] 'nps.agent.14b'             'no other attempt was made'

Again, if you don't have any need to process 'no other attempt was made', then don't store it. Or your logfile message could compress that to 'NEA'. Or just throw it away if it doesn't convey any extra information.

I suggest you revisit your logfile format and aggressively make it as concise and informative as possible.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM