简体   繁体   中英

Removing punctuation marks using ispunct()

ispunct() works well when words are separated in this way "one, two; three" . Then it will remove ", ;" and replace with given character.

But if string is given in this manner "ts='TOK_STORE_ID';" then it will take "ts='TOK_STORE_ID';" as one single token, or

"one,one, two;four$three two" as three tokens 1. "one,one" 2. "two;four$three" 3. "two"

Is there any one so that "one,one, two;four$three two" could be considered as "one one two four three two" each separate token?

Writing manual code like:

for(i=0;i<str.length();i++)
{
  //validating each character
}

This operation will become very costly when string is very very long.

So is there any other function like ispunct() ? or anything else?

In c we do this to compare each character:

  for(i=0;i<str.length();i++)
    {
      if(str[i]==',' || str[i]==",") // Is there any way here to compare with all puctuations in one shot?
        str[i]=" "; //replace with space

    }

In c++ what is the correct way for this?

This operation will become very costly when string is very very long.

No, it won't. It will be an O(n) operation which is good for this problem. You cannot get better than this for this operation because any which way, you have to look at each and every character in the string. There is no way to do this without looking at each and every character in the string.

Assuming you're dealing with a typical 8-bit character set, I'd start by building a translation table:

std::vector<char> trans(UCHAR_MAX);

for (int i=0; i<UCHAR_MAX; i++)
    trans[i] = ispunct(i) ? ' ' : i;

Then processing a string of text can be something like this:

for (auto &ch : str)
    ch = trans[(unsigned char)ch];

For an 8-bit character set, the translation table will typically all fit in your L1 cache, and the loop has only one branch that's highly predictable (always taken except when you reach the end of the string) so it should be fairly fast.

Just to be clear, when I say "fairly fast", I mean i's extremely unlikely that this would be the bottleneck in the process you've described. You'd need a combination of a slow processor and fast network connection to stand any chance of this being the bottleneck in processing data you're obtaining over a network.

If you have a Raspberry Pi with a 10 GbE network connection, you might need to do a little more optimization work for this to keep up (but I'm not sure even then). For any less radical mismatch, the network is clearly going to be the bottleneck.

So is there any other function like ispunct()? or anything else?

As a matter of fact, there is. man ispunct gives me this beautiful list:

int isalnum(int c);
int isalpha(int c);
int isascii(int c);
int isblank(int c);
int iscntrl(int c);
int isdigit(int c);
int isgraph(int c);
int islower(int c);
int isprint(int c);
int ispunct(int c);
int isspace(int c);
int isupper(int c);
int isxdigit(int c);

Take whichever you want.

You can also use std::remove_copy_if to remove the punctuation completely:

#include <algorithm>
#include <string>      

  string words = "I,love.punct-uation!";
  string result;  // this will be our final string after it has been purified

  // iterates through each character in the string
  // to remove all punctuation
  std::remove_copy_if(words.begin(), words.end(),            
                    std::back_inserter(result), //Store output           
                    std::ptr_fun<int, int>(&std::ispunct)  
                   );

  // ta-da!
  cout << result << endl;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM