简体   繁体   中英

How to efficiently remove (or add) leading zeros on IP addresses in R?

Two dataframes in R each contain fields for IP addresses. In each dataframe, these fields are "factors". The user intends to merge the two dataframes based on these IP addresses as well as a few other fields. The problem is that each dataframe has different formats for the IPs:

Dataframe A examples: 123.456.789.123, 123.012.001.123, 987.001.010.100

The same IPs in Dataframe B would be formatted as:

Dataframe B examples: 123.456.789.123, 123.12.1.123, 987.1.10.100

What is the best (most efficient) way to either remove the leading zeros from A or add them to B so they can be used in a merge? The operation will be performed over millions of records so 'most efficient' is in consideration of compute time (needs to be relatively quick).

You can use sprintf to format the sections. For instance, you could do the following, for a given numeric value a :

b <- sprintf("%.3d", a) 

So, for an IP address, try this function:

printPadded <- function(x){
  retStr = paste(sprintf("%.3d",unlist(lapply(strsplit(x,"\\.", perl = TRUE), as.numeric))), collapse = ".")
  return(retStr)
}

Here are two examples:

> printPadded("1.2.3.4")
[1] "001.002.003.004"

> lapply(c("1.2.3.4","5.67.100.9"), printPadded)
[[1]]
[1] "001.002.003.004"

[[2]]
[1] "005.067.100.009"

To go in the other direction, we can remove leading zeros, using gsub on the splitted values in the printPadded function. For my money, I'd recommend not removing the leading zeros. It's not necessary to remove zeros (or to pad them), but fixed width formats are easier to read and to sort (ie for those sorting functions that are lexicographic).


Update 1: Just a speed suggestion: if you are dealing with a lot of IP addresses, and really want to speed this up, you might look at multicore methods, such as mclapply . The plyr package is also useful, with ddply() as one option. These also support parallel backends, via .parallel = TRUE . Still, a few million IP addresses shouldn't take very long even on a single core.

Another way is like this:

my @ipparts = split(/\./, $ip);
for my $ii (0..$#ipparts)
{
    $ipparts[$ii] = $ipparts[$ii]+0;
}
$ip = join(".", @ipparts);

Nicer than a whole lot of divisions that sprintf requires.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM