简体   繁体   中英

Find First Specific Byte in a Byte[] Array c#

I have a byte array and wish to find the first occurance (if any) of a specific byte.

Can you guys help me with a nice, elegant and efficient way to do it?

 /// Summary
/// Finds the first occurance of a specific byte in a byte array.
/// If not found, returns -1.
public int GetFirstOccurance(byte byteToFind, byte[] byteArray)
{

}
public static int GetFirstOccurance(byte byteToFind, byte[] byteArray)
{
   return Array.IndexOf(byteArray,byteToFind);
}

It will return -1 if not found

Or as Sam pointed out, an extension method:

public static int GetFirstOccurance(this byte[] byteArray, byte byteToFind)
{
   return Array.IndexOf(byteArray,byteToFind);
}

Or to make it generic:

public static int GetFirstOccurance<T>(this T[] array, T element)
{
   return Array.IndexOf(array,element);
}

Then you can just say:

int firstIndex = byteArray.GetFirstOccurance(byteValue);

Since you mentioned efficiency, here is some heavily optimized C# code I've written which uses native addressing and maximal qword-aligned reading to cut the number of memory accesses by a factor of 8. I would be surprised if there is any faster way to scan for a byte in memory in .NET .

This returns the index of the first occurrence of byte 'v' within the range of memory starting at offset i (relative to address src ), and continuing for length c . Returns -1 if byte v is not found.

// fast IndexOf byte in memory. (To use this with managed byte[] array, see below)
public unsafe static int IndexOfByte(byte* src, byte v, int i, int c)
{
    ulong t;
    byte* p, pEnd;

    for (p = src + i; ((long)p & 7) != 0; c--, p++)
        if (c == 0)
            return -1;
        else if (*p == v)
            return (int)(p - src);

    ulong r = v; r |= r << 8; r |= r << 16; r |= r << 32;

    for (pEnd = p + (c & ~7); p < pEnd; p += 8)
    {
        t = *(ulong*)p ^ r;
        t = (t - 0x0101010101010101) & ~t & 0x8080808080808080;
        if (t != 0)
        {
            t &= (ulong)-(long)t;
            return (int)(p - src) + dbj8[t * 0x07EDD5E59A4E28C2 >> 58];
        }
    }

    for (pEnd += c & 7; p < pEnd; p++)
        if (*p == v)
            return (int)(p - src);

    return -1;
}

Don't be alarmed by the one multiplication you see; it's only executed a maximum of once per call of this function in order to do a final deBruijn lookup . The read-only lookup table used for that is a simple shared list of 64 byte values which requires one-time only initialization:

// elsewhere in the static class...

readonly static sbyte[] dbj8 =
{
     7, -1, -1, -1, -1,  5, -1, -1, -1,  4, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1,  6, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1,  3, -1, -1, -1, -1, -1, -1,  1, -1,  2,  0, -1, -1,
};

The -1 values are never accessed and can be left at zero if desired instead, as shown in the following alternative to the preceding table initialization code, if you prefer:

static MyStaticClass()
{
    dbj8 = new sbyte[64];  // initialize the lookup table (alternative to the above)
    dbj8[0x00] = 7;
    dbj8[0x18] = 6;
    dbj8[0x05] = 5;
    dbj8[0x09] = 4;
    dbj8[0x33] = 3;
    dbj8[0x3C] = 2;
    dbj8[0x3A] = 1;
 /* dbj8[0x3D] = 0; */
}

readonly static sbyte[] dbj8, dbj16;

For completeness, here is how to use the function with the method prototype provided by the OP in the original question.

/// Finds the first occurrence of a specific byte in a byte array.
/// If not found, returns -1.
public static unsafe int GetFirstOccurance(byte byteToFind, byte[] byteArray)
{
    fixed (byte* p = byteArray)
        return IndexOfByte(p, byteToFind, 0, byteArray.Length);
}

Discussion
My code is a bit intricate, so detailed examination is left as an exercise for the interested reader. You can study another take on the general approach of gang-wise memory searching in the .NET internal method Buffer.IndexOfByte , but that code has significant drawbacks compared to mine:

  • Most significantly, the .NET code only scans 4 bytes at time instead of 8 as in mine.
  • It's a non-public method, so you'd need to use reflection to call it.
  • The .NET code has a "performance leak" where the t1 != 0 check gives a false positive , and the four checks that follow are wasted. Note their "fall-through" case: due to this false-positive, they need four final checks--thus allowing fall-through--to maintain correctness, instead of just three.
  • The .NET code's false-positive is caused by an inherently inferior bitwise computation based on overflow of the carry bit from one byte to the next. This leads to two's complement asymmetries (evidenced by their use of constants 0x7efefeff or 0x81010100 ) and the occasional "left-wise egress" (ie, loss) of information regarding the most-significant byte, which is the real problem here. In contrast, I use an underflow computation which keeps each byte's computation independent of its neighbors'. My method gives a conclusive result in all cases with no false-positive or "fall-through" processing.
  • My code uses a branchless technique for the final lookup. A handful of non-branching logical operations (plus one multiplication in this case) is generally believed to favor performance over extended if-else structures, since the latter can disrupt CPU predictive caching . This issue is more important for my 8-byte scanner because without using lookup I'd have twice as many if-else conditions in the final check, as compared to a 4-byte gang-wise scanner.

Of course if you're not concerned with all this minutiae you can just copy and use the code; I've unit-tested it quite exhaustively and verified correct behavior for all well-formed inputs. So while the core functionality is ready to use, you'll probably want to add argument checking.


[edit:]

String.IndexOf(String s, Char char, int ix_start, int count) ... fast!

Because the above method has worked so successfully in my projects, I extended it to cover 16-bit searching. Here is the same code adapted to search for a 16-bit short, ushort, or char primitive instead of byte . This adapted method was also independently verified against its own respective unit-test methodology adapted from above.

static MyStaticClass()
{
    dbj16 = new sbyte[64];
 /* dbj16[0x3A] = 0; */
    dbj16[0x33] = 1;
    dbj16[0x05] = 2;
    dbj16[0x00] = 3;
}
readonly static sbyte[] dbj16;

public static int IndexOf(ushort* src, ushort v, int i, int c)
{
    ulong t;
    ushort* p, pEnd;

    for (p = src + i; ((long)p & 7) != 0; c--, p++)
        if (c == 0)
            return -1;
        else if (*p == v)
            return (int)(p - src);

    ulong r = ((ulong)v << 16) | v;
    r |= r << 32;

    for (pEnd = p + (c & ~7); p < pEnd; p += 4)
    {
        t = *(ulong*)p ^ r;
        t = (t - 0x0001000100010001) & ~t & 0x8000800080008000;
        if (t != 0)
        {
            i = dbj16[(t & (ulong)-(long)t) * 0x07EDD5E59A4E28C2 >> 58];
            return (int)(p - src) + i;
        }
    }

    for (pEnd += c & 7; p < pEnd; p++)
        if (*p == v)
            return (int)(p - src);

    return -1;
}

And below are the various overloads for accessing this for the remaining 16-bit primitives, plus String (last one shown):

public static int IndexOf(this char[] rg, char v) => IndexOf(rg, v, 0, rg.Length);
public static int IndexOf(this char[] rg, char v, int i, int c = -1)
{
    if (rg != null && (c = c < 0 ? rg.Length - i : c) > 0)
        fixed (char* p = rg)
            return IndexOf((ushort*)p, v, i, c < 0 ? rg.Length - i : c);
    return -1;
}

public static int IndexOf(this short[] rg, short v) => IndexOf(rg, v, 0, rg.Length);
public static int IndexOf(this short[] rg, short v, int i, int c = -1)
{
    if (rg != null && (c = c < 0 ? rg.Length - i : c) > 0)
        fixed (short* p = rg)
            return IndexOf((ushort*)p, (ushort)v, i, c < 0 ? rg.Length - i : c);
    return -1;
}

public static int IndexOf(this ushort[] rg, ushort v) => IndexOf(rg, v, 0, rg.Length);
public static int IndexOf(this ushort[] rg, ushort v, int i, int c = -1)
{
    if (rg != null && (c = c < 0 ? rg.Length - i : c) > 0)
        fixed (ushort* p = rg)
            return IndexOf(p, v, i, c < 0 ? rg.Length - i : c);
    return -1;
}
public static int IndexOf(String s, Char ch, int i = 0, int c = -1)
{
    if (s != null && (c = c < 0 ? s.Length - i : c) > 0)
        fixed (char* p = s)
            return IndexOf((ushort*)p, ch, i, c);
    return -1;
}

Notice that the String overload is not marked as an extension method since this higher-performance replacement version of the function would never be called that way (built-in methods with the same name always take precedence over extension methods). To use it as an extension on String instances, you can change the name of this method. As an example, IndexOf__(this String s,...) would cause it to appear next to the built-in method name in Intellisense listings, perhaps a helpful reminder to opt-in. Otherwise, if you don't need extension syntax, you can just make sure you call this optimized version directly as a static method of its own class when you want to use it instead of s.IndexOf(Char ch) .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM