简体   繁体   中英

Is this a correct and portable way of checking if 2 c-strings overlap in memory?

Might not be the most efficient way, but is it correct and portable?

int are_overlapping(const char *a, const char *b) {
  return (a + strlen(a) == b + strlen(b));
}

To clarify: what I'm looking for is overlap in memory , not in the actual content. For example:

const char a[] = "string";
const char b[] = "another string";
are_overlapping(a, b); // should return 0
are_overlapping(a, a + 3); // should return 1

Yes, your code is correct. If two strings end at the sample place they by definition overlapped - they share the same null terminator. Either both strings are identical, or one is a substring of the other.

Everything about your program is perfectly well-defined behaviour, so assuming standards-compliant compilers, it should be perfectly portable.

The relevant bit in the standard is from 6.5.9 Equality operators (emphasis mine):

Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object , or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.

Thinking about zdan's comments on my previous post (which will probably shortly be deleted), I've come to the conclusion that checking endpoints is sufficient.

If there's any overlap, the null terminator will make the two strings not be distinct. Let's look at some possibilities.

If you start with

a 0x10000000 "Hello" and somehow add
b 0x10000004 "World",

you'll have a single word: HellWorld, since the W would overwrite the \\0. They would end at the same endpoint.

If somehow you write to the same starting point:

a 0x10000000 "Hello" and
b 0x10000000 "Jupiter"

You'll have the word Jupiter, and have the same endpoint.

Is there a case where you can have the same endpoint and not have overlap? Kind of.

a = 0x1000000 "Four" and
b = 0x1000004 "".

That will give an overlap as well.

I can't think of any time you'll have overlap where you won't have matching endpoints - assuming that you're writing null terminated strings into memory .

So, the short answer: Yes, your check is sufficient.

It is probably not relevant to your use case, as your question is specifically about C-strings, but the code will not work in the case that the data has embedded NUL bytes in the strings.

char a[] = "abcd\0ABCD";
char *b = a + 5;

Other than that, your solution is straight forward and correct. It works since you are only using == for the pointer comparison, and according to the standard (from C11 6.5.9/6)

Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.

However, the relational operators are more strict (from C11 6.5.8/5):

When two pointers are compared, the result depends on the relative locations in the address space of the objects pointed to. If two pointers to object types both point to the same object, or both point one past the last element of the same array object, they compare equal. If the objects pointed to are members of the same aggregate object, pointers to structure members declared later compare greater than pointers to members declared earlier in the structure, and pointers to array elements with larger subscript values compare greater than pointers to elements of the same array with lower subscript values. All pointers to members of the same union object compare equal. If the expression P points to an element of an array object and the expression Q points to the last element of the same array object, the pointer expression Q+1 compares greater than P. In all other cases, the behavior is undefined.

The last sentence is the kicker.

Some have taken exception to the fact that your code may compute the length of the overlap twice, and have attempted to take precautions to avoid it. However, the efficiency of reducing that compute is countered with an extra pointer comparison per iteration, or involves undefined or implementation defined behavior. Assuming you want a portable and compliant solution, the actual average gain is likely nil, and not worth the effort.

This solution is still the same worst-case performance, but is optimized for hits -- you don't have to parse both strings.

char * temp_a = a;
char * temp_b = b;

while (*temp_a != '\0') {

    if (temp_a++ == b) 
        return 1;

}

// check for b being an empty string
if (temp_a == b) return 1;

/* but if b was larger, we aren't done, so you have to try from b now */
while (*temp_b != '\0') {
    if (temp_b++ == a)
        return 1;
}

/* don't need the a==b check again here

return 0;

Apparently, only pointer equality (not inequality) is portable in C, so the following solutions aren't portable -- everything below is from before I knew that.

Your solution is valid, but why calculate strlen on the second string? You know the start and end point of one string, just see if the other is between them (inclusive). saves you a pass through the second string -- O(M+N) to O(M)

char * lower_addr_string = a < b ? a : b
char * higher_addr_string = a > b ? a : b
length = strlen(lower_addr_string)
return higher_addr_string >= lower_addr_string && higher_addr_string <= lower_addr_string + length;

alternatively, do the string parsing yourself..

char * lower_addr_string = a < b ? a : b
char * higher_addr_string = a > b ? a : b
while(*lower_addr_string != '\0') {
    if (lower_addr_string == higher_addr_string)
        return 1;
    ++lower_addr_string;
}
/* check the last character */
if (lower_addr_string == higher_addr_string)
    return 1;
return 0;

Yes, your check is correct, but it is certainly not the most efficient (if by "efficiency" you mean the computational efficiency). The obvious intuitive inefficiency in your implementation is based on the fact that when the strings actually overlap, the strlen calls will iterate over their common portion twice .

For the sake of formal efficiency, one might use a slightly different approach

int are_overlapping(const char *a, const char *b) 
{
  if (a > b) /* or `(uintptr_t) a > (uintptr_t) b`, see note below! */
  {
    const char *t = a; 
    a = b; 
    b = t;
  }

  while (a != b && *a != '\0')
    ++a;

  return a == b;
}

An important note about this version is that it performs relational comparison of two pointers that are not guaranteed to point to the same array, which formally leads to undefined behavior. It will work in practice on a system with flat memory model, but might draw criticism from a pedantic code reviewer. To formally work around this issue one might convert the pointers to uintptr_t before performing relational comparisons. That way the undefined behavior gets converted to implementation-defined behavior with proper semantics for our purposes in most (if not all) traditional implementations with flat memory model.

This approach is free from the "double counting" problem: it only analyzes the non-overlapping portion of the string that is located "earlier" in memory. Of course, in practice the benefits of this approach might prove to be non-existent. It will depend on both the quality of your strlen implementation and one the properties of the actual input.

For example, in this situation

const char *str = "Very very very long string, say 64K characters long......";

are_overlapped(str, str + 1);

my version will detect the overlap much faster than yours. My version will do it in 1 iteration of the cycle, while your version will spend 2 * 64K iterations (assuming a naive implementation of strlen ).

If you decide to dive into the realm of questionable pointer comparisons, the above idea can also be reimplemented as

int are_overlapping(const char *a, const char *b) 
{
  if (a > b)
  {
    const char *t = a; 
    a = b; 
    b = t;
  }

  return b <= a + strlen(a);
}

This implementation does not perform an extra pointer comparison on each iteration. The price we pay for that is that it always iterates to the end of one of the strings instead of terminating early. Yet it is still more efficient than your implementation, since it calls strlen only once.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM