简体   繁体   中英

Regular Expression for Digits and Special Characters - C#

I use Html-Agility-Pack to extract information from some websites. In the process I get data in the form of string and I use that data in my program.

Sometimes the data I get includes multiple details in the single string . As the name of this Movie "Dog Eats Dog (2012) (2012)" . The name should have been "Dog Eats Dog (2012)" rather than the first one.

Above is the one example from many. In order to correct the issue I tried to use string.Distinct() method but it would remove all the duplicate characters in the string as in above example it would return "Dog Eats (2012)" . Now it solved my initial problem by removing the 2nd (2012) but created a new one by changing the actual title.

I thought my problem could be solved with Regex but I have no idea as to how I can use it here. As far as I know if I use Regex it would tell me that there are duplicate items in the string according to the defined Regex code.

But how do I remove it? There can be a string like "Meme 2013 (2013) (2013)". Now the actual title is "Meme 2013" with year (2013) and the duplicate year (2013). Even if I get a bool value indicating that the string has duplicate year, I cant think of any method to actually remove the duplicate substring .

The duplicate year always comes in the end of the string . So what should be the Regex that I would use to determine that the string actually has two years in it, like (2012) (2012)?

If I can correctly identify the string contains duplicate maybe I can use string.LastIndexOf() to try and remove the duplicate part. If there is any better way to do it please let me know.

Thanks.

The right regex is "( \\(\\d{4}\\))\\1+" .

string pattern = @"( \(\d{4}\))\1+";
new Regex(pattern).Replace(s, "$1");

Example here : https://repl.it/Evcy/2

Explanation:
Capture one " (dddd)" block, and remove all following identical ones.
( \\(\\d{4}\\)) does the capture, \\1+ finds any non empty sequence of that captured block

Finally, replace the initial block and its copies by the initial block alone.

This regex will allow for any pattern of whitespace, even none, as in (2013)(2013)

`@"(\(\d{4}\))(?:\s*\1)+"`

I have a demo of it here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM