漢明重量（數量為1）將C與組件混合

Question

我正在計算數字1的數量，是數組的數字。

首先我在C lenguaje中有一個代碼（工作正常）：

int popcount2(int* array, int len){
    int i;
    unsigned x;
    int result=0;
    for (i=0; i<len; i++){
        x = array[i];
        do{
           result+= x & 0x1;
           x>>= 1;
       } while(x);
    }
return result;
}

現在我需要使用3-6行代碼將do-while循環轉換為Assembly。 我寫了一些代碼，但結果不正確。（我是裝配世界的新手）

int popcount3(int* array, int len){
int  i;
unsigned x;
int result=0;   
for (i=0; i<len; i++){
    x = array[i];
    asm(
    "ini3:               \n"
        "adc $0,%[r]     \n"
        "shr %[x]        \n"
        "jnz ini3        \n"

        : [r]"+r" (result)
        : [x] "r" (x)       );
  }
}

我在英特爾處理器上使用GCC（在Linux上）。

Answer 1

您開始使用非常低效的算法 - 如果您使用更好的算法，那么您可能不需要浪費時間使用匯編程序。 請參閱Hacker's Delight和/或Bit Twiddling Hacks以獲得更有效的方法。

另請注意，較新的x86 CPU具有POPCNT指令，該指令在一條指令中執行上述所有操作（您也可以通過內部函數調用它，因此不需要asm）。

最后gcc有一個內置的：__ __builtin_popcount ，它__builtin_popcount您的所有需求 - 它將在較新的CPU上使用POPCNT在較舊的CPU上使用等效的asm。

Answer 2

當我需要創建一個popcount時，我最終使用了@PaulR提到的Bit Twiddling Hacks中的5和3的方法。 但如果我想用循環來做這個，可能是這樣的：

#include <stdio.h>
#include <stdlib.h>

int popcount2(int v) {
   int result = 0;
   int junk;

   asm (
        "shr $1, %[v]      \n\t"   // shift low bit into CF
        "jz done           \n"     // and skip the loop if that was the only set bit
     "start:               \n\t"
        "adc $0, %[result] \n\t"   // add CF (0 or 1) to result
        "shr $1, %[v]      \n\t"
        "jnz start         \n"     // leave the loop after shifting out the last bit
     "done:                \n\t"
        "adc $0, %[result] \n\t"   // and add that last bit

        : [result] "+r" (result), "=r" (junk)
        : [v] "1" (v)
        : "cc"
   );

   return result;
}

int main(int argc, char *argv[])
{
   for (int x=0; x < argc-1; x++)
   {
      int v = atoi(argv[x+1]);

      printf("%d %d\n", v, popcount2(v));
   }
}

adc幾乎總是比CF上的分支更有效。

"=r" (junk)是一個虛擬輸出操作數，與v （ "1"約束）位於同一寄存器中。 我們用這個告訴編譯器asm語句會破壞v輸入。 我們可以使用[v] "+r"(v)來獲得讀寫操作數，但我們不希望更新C變量v 。

請注意，此實現的循環跳閘計數是最高設置位的位置。 （ bsr ，或32 - clz(v) ）。 @ rcgldr的實現在每次迭代時清除最低設置位通常會在設置位數較低時更快，但它們並非都接近整數的底部。

Answer 3

使用3-6行代碼進行匯編。

此示例使用4指令循環：

popcntx proc    near
        mov     ecx,[esp+4]             ;ecx = value to popcnt
        xor     eax,eax                 ;will be popcnt
        test    ecx,ecx                 ;br if ecx == 0
        jz      popc1
popc0:  lea     edx,[ecx-1]             ;edx = ecx-1
        inc     eax                     ;eax += 1
        and     ecx,edx                 ;ecx &= (ecx-1)
        jnz     short popc0
popc1:  ret
popcntx endp

此示例使用3指令循環，但在大多數處理器上它將比4指令循環版本慢。

popcntx proc    near
        mov     eax,[esp+4]             ;eax = value to popcnt
        mov     ecx,32                  ;ecx = max # 1 bits
        test    eax,eax                 ;br if eax == 0
        jz      popc1
popc0:  lea     edx,[eax-1]             ;eax &= (eax-1)
        and     eax,edx
        loopnz  popc0
popc1:  neg     ecx
        lea     eax,[ecx+32]
        ret
popcntx endp

這是一個替代的非循環示例：

popcntx proc    near
        mov     ecx,[esp+4]             ;ecx = value to popcnt
        mov     edx,ecx                 ;edx = ecx
        shr     edx,1                   ;mov upr 2 bit field bits to lwr
        and     edx,055555555h          ; and mask them
        sub     ecx,edx                 ;ecx = 2 bit field counts
                                        ; 0->0, 1->1, 2->1, 3->1
        mov     eax,ecx
        shr     ecx,02h                 ;mov upr 2 bit field counts to lwr
        and     eax,033333333h          ;eax = lwr 2 bit field counts
        and     ecx,033333333h          ;edx = upr 2 bit field counts
        add     ecx,eax                 ;ecx = 4 bit field counts
        mov     eax,ecx
        shr     eax,04h                 ;mov upr 4 bit field counts to lwr
        add     eax,ecx                 ;eax = 8 bit field counts
        and     eax,00f0f0f0fh          ; after the and
        imul    eax,eax,01010101h       ;eax bit 24->28 = bit count
        shr     eax,018h                ;eax bit 0->4 = bit count
        ret
popcntx endp

Answer 4

最好的想法你可以做的是使用Paul R建議的內置popcount函數，但是因為你需要在匯編中編寫它，這對我popcount ：

asm (
"start:                  \n"
        "and %0, %1      \n"
        "jz end          \n"
        "shr $0, %1      \n"
        "jnc start       \n"
        "inc %1          \n"
        "jmp start       \n"
"end:                    \n"
        : "+g" (result),
          "+r" (x)
        :
        : "cc"
);

在前兩行，您只需檢查x的內容（如果它為零Jump Zero則轉到結束）。 比你將x向右移一步並且：

在移位操作結束時， CF標志包含從destinationoperand移出的最后一位。 *

如果沒有CF設置，只需開始（ Jump Not Carry ），否則增加結果然后開始。

美麗的裝配思考就是你可以用很多方式做事......

asm (
"start:                  \n"
        "shr $1, %1      \n"
        "jnc loop_cond   \n"
        "inc %0          \n"
        "and %1, %1      \n"
"loop_cond:              \n"
        "jnz start       \n"

        : "+g" (result),
          "+r" (x)
        :
        : "cc"
);

在這里你再次使用SHift Right指令，如果沒有CF只是轉到循環條件。

否則再次遞增結果並調用二進制AND （ INC 確實修改ZF ）。

使用`LOOP`和`ECX`

我很好奇如何在3條指令中執行此操作（我認為如果不可能，您的老師不會給出3的下限）並且我意識到x86也有LOOP指令：

每次執行LOOP指令時，計數寄存器遞減，然后檢查0.如果計數為0，則循環終止，程序繼續執行LOOP指令之后的指令。 如果計數不為零，則對目標（目標）操作數執行近跳轉，這可能是循環開始時的指令。 *

您可以使用GCC輸入約束添加輸入參數：

c - c寄存器。

asm (
"start:              \n"
    "shr $1, %1      \n"
    "adc $0, %0      \n"
    "loop start      \n"

    : "+g" (result)
    : "r" (x),
      "c" (8)             // Assuming 8b type (char)
);

只是為了確保它編譯為正確的組裝：

0x000000000040051f <+25>:   mov    $0x8,%ecx
0x0000000000400524 <+30>:   mov    -0x8(%rbp),%eax
0x0000000000400527 <+33>:   shr    %edx
0x0000000000400529 <+35>:   adc    $0x0,%eax
0x000000000040052c <+38>:   loop   0x400527 <main+33>

我認為第一個應該有更好的性能，特別是如果只有1位設置，這種方法總是進行k*8次迭代 。

SSE4和單指令

我知道你必須使用循環，但只是為了好玩...使用SSE4擴展你可以通過一個指令POPCNT來做到這一點：

該指令計算第二個操作數（源）中設置為1的位數，並返回第一個操作數（目標寄存器）中的計數。 *

我想（我的筆記本上有一個相當舊的CPU，所以我不能為你測試）你只需要一個簡單的指令就可以做到這一點：

asm (   
    "POPCNT %1, %0   \n"
    : "=r" (result)
    : "mr" (x)
    : "cc"                                                                                                                                       
);

（如果你試試這個並且你有SSE4擴展，請告訴我它是否有效）

性能

我已經測量了將我的第一種和第二種方法與David Wohlferd相比，需要花費100,000,000次手數的時間。 ^{[原始數據]}

+--------------+------------+------------+------------+
|              | 0x00000000 | 0x80000001 | 0xffffffff |
+--------------+------------+------------+------------+
| 1st solution |  0.543     |  5.040     |  3.833     |
| LOOP         | 11.530     | 11.523     | 11.523     |
| Davids       |  0.750     |  4.893     |  4.890     |
+--------------+------------+------------+------------+

如果有人可以將這3個與SSE4的POPCNT指令進行比較，我會很高興。

漢明重量（數量為1）將C與組件混合

問題描述

4 個解決方案

解決方案1
4 2014-11-20 22:18:04

解決方案2
3 2014-11-21 02:28:41

解決方案3
2 2014-11-21 13:02:33

解決方案4
1 已采納 2014-11-20 23:45:22

使用`LOOP`和`ECX`

SSE4和單指令

性能

漢明重量（數量為1）將C與組件混合

問題描述

4 個解決方案

解決方案1 4 2014-11-20 22:18:04

解決方案2 3 2014-11-21 02:28:41

解決方案3 2 2014-11-21 13:02:33

解決方案4 1 已采納 2014-11-20 23:45:22

使用LOOP和ECX

SSE4和單指令

性能

解決方案1
4 2014-11-20 22:18:04

解決方案2
3 2014-11-21 02:28:41

解決方案3
2 2014-11-21 13:02:33

解決方案4
1 已采納 2014-11-20 23:45:22

使用`LOOP`和`ECX`