简体   繁体   English

字符串(const char *,size_t)转换为int吗?

[英]String (const char*, size_t) to int?

What's the fastest way to convert a string represented by (const char*, size_t) to an int? 将(const char *,size_t)表示的字符串转换为int的最快方法是什么?

The string is not null-terminated. 该字符串不是以空值结尾的。 Both these ways involve a string copy (and more) which I'd like to avoid. 这两种方式都涉及一个字符串复制(以及更多),我想避免这样做。

And yes, this function is called a few million times a second. 是的,此功能每秒被调用几百万次。 :p :p

int to_int0(const char* c, size_t sz)
{
    return atoi(std::string(c, sz).c_str());
}

int to_int1(const char* c, size_t sz)
{
    return boost::lexical_cast<int>(std::string(c, sz));
}

Given a counted string like this, you may be able to gain a little speed by doing the conversion yourself. 给定一个这样的字符串,您可以自己进行转换以加快速度。 Depending on how robust the code needs to be, this may be fairly difficult though. 但是,根据代码需要具有的鲁棒性,这可能相当困难。 For the moment, let's assume the easiest case -- that we're sure the string is valid, containing only digits, (no negative numbers for now) and the number it represents is always within the range of an int. 目前,让我们假设最简单的情况-我们确定字符串是有效的,仅包含数字(目前没有负数),并且它表示的数字始终在int范围内。 For that case: 对于这种情况:

int to_int2(char const *c, size_t sz) { 
    int retval = 0;
    for (size_t i=0; i<sz; i++)
        retval *= 10;
        retval += c[i] -'0';
    }
    return retval;
}

From there, you can get about as complex as you want -- handling leading/trailing whitespace, '-' (but doing so correctly for the maximally negative number in 2's complement isn't always trivial [edit: see Nawaz's answer for one solution to this]), digit grouping, etc. 从那里,您可以随心所欲地进行复杂处理-处理前导/尾随空格,'-'(但是对于2的补码中的最大负数正确地执行操作并不总是那么简单[编辑:请参阅Nawaz的答案作为一种解决方案到此]),数字分组等。

Another slow version, for uint32: 对于uint32的另一个版本:

void str2uint_aux(unsigned& number, unsigned& overflowCtrl, const char*& ch)
{
    unsigned digit = *ch - '0';
    ++ch;

    number = number * 10 + digit;

    unsigned overflow = (digit + (256 - 10)) >> 8;
    // if digit < 10 then overflow == 0
    overflowCtrl += overflow;
}

unsigned str2uint(const char* s, size_t n)
{
    unsigned number = 0;
    unsigned overflowCtrl = 0;

    // for VC++10 the Duff's device is faster than loop
    switch (n)
    {
    default:
        throw std::invalid_argument(__FUNCTION__ " : `n' too big");

    case 10: str2uint_aux(number, overflowCtrl, s);
    case  9: str2uint_aux(number, overflowCtrl, s);
    case  8: str2uint_aux(number, overflowCtrl, s);
    case  7: str2uint_aux(number, overflowCtrl, s);
    case  6: str2uint_aux(number, overflowCtrl, s);
    case  5: str2uint_aux(number, overflowCtrl, s);
    case  4: str2uint_aux(number, overflowCtrl, s);
    case  3: str2uint_aux(number, overflowCtrl, s);
    case  2: str2uint_aux(number, overflowCtrl, s);
    case  1: str2uint_aux(number, overflowCtrl, s);
    }

    // here we can check that all chars were digits
    if (overflowCtrl != 0)
        throw std::invalid_argument(__FUNCTION__ " : `s' is not a number");

    return number;
}

Why it's slow? 为什么这么慢? Because it processes chars one-by-one. 因为它一一处理字符。 If we'd had a guarantee that we can access bytes upto s+16 , we'd can use vectorization for *ch - '0' and digit + 246 . 如果我们保证可以访问直到s+16字节,则可以对*ch - '0'digit + 246使用向量化。
Like in this code: 像下面的代码:

    uint32_t digitsPack = *(uint32_t*)s - '0000';
    overflowCtrl |= digitsPack | (digitsPack + 0x06060606); // if one byte is not in range [0;10), high nibble will be non-zero
    number = number * 10 + (digitsPack >> 24) & 0xFF;
    number = number * 10 + (digitsPack >> 16) & 0xFF;
    number = number * 10 + (digitsPack >> 8) & 0xFF;
    number = number * 10 + digitsPack & 0xFF;
    s += 4;

Small update for range checking: 范围检查的小更新:
the first snippet has redundant shift (or mov ) on every iteration, so it should be 第一个代码段在每次迭代中都有多余的shift(或mov ),因此应该

unsigned digit = *s - '0';
overflowCtrl |= (digit + 256 - 10);
...
if (overflowCtrl >> 8 != 0) throw ...

Fastest: 最快的:

int to_int(char const *s, size_t count)
{
     int result = 0;
     size_t i = 0 ;
     if ( s[0] == '+' || s[0] == '-' ) 
          ++i;
     while(i < count)
     {
          if ( s[i] >= '0' && s[i] <= '9' )
          {
              //see Jerry's comments for explanation why I do this
              int value = (s[0] == '-') ? ('0' - s[i] ) : (s[i]-'0');
              result = result * 10 + value;
          }
          else
              throw std::invalid_argument("invalid input string");
          i++;
     }
     return result;
} 

Since in the above code, the comparison (s[0] == '-') is done in every iteration, we can avoid this by calculating result as negative number in the loop, and then return result if s[0] is indeed '-' , otherwise return -result (which makes it a positive number, as it should be): 由于在上面的代码中,比较(s[0] == '-')在每次迭代中均完成,因此可以通过在循环中将result计算为负数来避免这种情况,然后在s[0]确实为真的情况下返回result '-' ,否则返回-result (使它成为正数,应该是):

int to_int(char const *s, size_t count)
{
     size_t i = 0 ;
     if ( s[0] == '+' || s[0] == '-' ) 
          ++i;
     int result = 0;
     while(i < count)
     {
          if ( s[i] >= '0' && s[i] <= '9' )
          {
              result = result * 10  - (s[i] - '0');  //assume negative number
          }
          else
              throw std::invalid_argument("invalid input string");
          i++;
     }
     return s[0] == '-' ? result : -result; //-result is positive!
} 

That is an improvement! 那是一个进步!


In C++11, you could however use any function from std::stoi family. 但是在C ++ 11中,您可以使用std::stoi系列中的任何函数。 There is also std::to_string family. 还有std::to_string家族。

If you run the function that often, I bet you parse the same number many times. 如果您经常运行该函数,我敢打赌您会多次解析相同的数字。 My suggestion is to BCD encode the string into a static char buffer (you know it's not going to be very long, since atoi only can handle +-2G) when there's less than X digits (X=8 for 32 bit lookup, X=16 for 64 bit lookup) then place a cache in a hash map. 我的建议是在少于X个数字(32位查找时,X = 8,X = 8)时,将字符串BCD编码到静态char缓冲区中(您知道这不会太长,因为atoi只能处理+ -2G)。 16(对于64位查找),然后将缓存放入哈希映射。

When you're done with the first version, you can probably find nice optimizations, such as skipping the BCD encoding entirely and just using X characters in the string (when length of string <= X) for lookup in the hash table. 完成第一个版本后,您可能会发现不错的优化方法,例如完全跳过BCD编码,仅在字符串中使用X个字符(当字符串的长度<= X时)在哈希表中查找。 If the string is longer, you fallback to atoi . 如果字符串较长,则回atoi

Edit : ... or fallback instead of atoi to Jerry Coffin's solution, which is as fast as they come. 编辑 :...或回退,而不是杰瑞·科芬(Jerry Coffin)解决方案的解决方案,该解决方案很快就可以解决。

You'll have to either write custom routine or use 3rd party library if you're dead set on avoiding string copy. 如果您死于避免字符串复制,则必须编写自定义例程或使用3rd party库。

You probably don't want to write atoi from scratch (it is still possible to make a bug here), so I'd advise to grab existing atoi from public domain or BSD-licensed code and modify it. 您可能不希望从头开始编写atoi(仍然可能在这里犯一个错误),所以我建议您从公共领域或BSD许可的代码中获取现有的atoi并进行修改。 For example, you can get existing atoi from FreeBSD cvs tree . 例如,您可以从FreeBSD cvs tree获取现有的atoi。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM