简体   繁体   中英

How to make char* kmers from an extremely long C++ string

I have a string object that is really long and I'd like to refer to windows in it [0, 19], [1, 20], ....., [980, 1000] as char x[20] .

Let's call our string foo . I've tried

x = &foo[i]

and iterating, but I get an incompatible type error, because &foo[i] is of type char * .

How can I refer to that 20-char block of the memory of our string foo, using a char x[20] ?

More philosophically, what is the difference between char *x , and char x[20] if the later is not null terminated?

One objective is not to have to have 2x the memory requirement by creating brand new memory blocks for all the new chars.

char * p is a pointer to memory somewhere that should contain characters. There is no end of data implied. char a[100] is a 100 character section of memory. The end of a is known to the compiler to access multi-dimensional arrays and for error checking function parameters. &a[0] or just a (address of element 0 of a) is basically the same as char*.

The user of a or p must know the length somehow:

1) a length parameter supplied in addition. Ex: sizeof(a) (which is in bytes). I also like to use numof(a) which is could of elements instead of byte size by adding: #define numof(X) (sizeof(X)/sizeof(*X)) Instead of length, you can also use another pointer to the end to stop at.

2) some content or rule that tells the user of p when to stop. Ex: *p == 0 (NULL)

This a powerful source of flexibility in C/C++ (and also danger if misused).


a) Change user of array to also have a length limit or pointer to end to stop at. You may also need null termination checking in case the last block is undersized.

b) Process data only one block at a time. Then you only new 1 additional 20 char array. Or, if you can assure no other threads are using the array at the same time, you can temporarily change the null termination:

// array is assumed to be a multiple of 20 plus 1 more for null
char * ptr = array;
while ( ptr < array + sizeof(array)-1 )
{
  char * end = ptr + 20; // we will stop here
  char save_char = *end; // save the character there
  *end = 0; // put in temporary null
  ProcessBlock( ptr ); // now null terminated !
  *end = save_char; // restore the array
  ptr = end; // end of this block is start of next
}

Take a look at the StringRef class from llvm. Essentially, it just holds two pointers, a begin and an end. You can do something like this, for example:

std::string source = "... something really long ...";
const char * b = source.c_str();
llvm::StringRef window(b + 100, 20);

window is now an entity that refers to a portion of source . You can call begin() and end() on it to get iterators. You can print it just like a normal string, like this:

std::cout << window;

It comes with a variety of other common string operations as you can see in the docs .

Just do

int window_size = 20;
for (size_t i = 0; i < foo.size() - window_size; ++i)
{
    const char* x = foo.data() + i;
    // Do something with x[0] to x[window_size - 1]
}

The reason you were getting the "incompatible type error" is that x and &foo[i] are of different types. Consider this:

  • foo is of type char[] (ie array of char )
  • therefore foo[i] is of type char
  • therefore &foo[i] is of type char* (ie pointer to char )

The difference between char* x and char x[20] is that in the first case x is a pointer to char and in the second case it is an array of char . In the first case you may make the pointer point at any char in your process' memory. In the second, x can often behave like a pointer, but it always points at the beginning of the array.

Assuming foo size is a multiple of window size, you can iterate through the windows like this:

char foo[FOO_SIZE];
for (unsigned i = 0; i < FOO_SIZE; i += WINDOW_SIZE) {
    char first_char = foo[i];
    char last_char = foo[i + WINDOW_SIZE - 1]; // Warning: if foo size is not multiple of window size, this may exceed foo in the last window
}

Also, your own code is probably fine, just remove the ampersand from &foo[i] . Note that [] already dereferences the pointer, so there is no need for the extra & .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM