Regular Expression for Digits and Special Characters - C#

Question

I use Html-Agility-Pack to extract information from some websites. In the process I get data in the form of string and I use that data in my program.

Sometimes the data I get includes multiple details in the single string . As the name of this Movie "Dog Eats Dog (2012) (2012)" . The name should have been "Dog Eats Dog (2012)" rather than the first one.

Above is the one example from many. In order to correct the issue I tried to use string.Distinct() method but it would remove all the duplicate characters in the string as in above example it would return "Dog Eats (2012)" . Now it solved my initial problem by removing the 2nd (2012) but created a new one by changing the actual title.

I thought my problem could be solved with Regex but I have no idea as to how I can use it here. As far as I know if I use Regex it would tell me that there are duplicate items in the string according to the defined Regex code.

But how do I remove it? There can be a string like "Meme 2013 (2013) (2013)". Now the actual title is "Meme 2013" with year (2013) and the duplicate year (2013). Even if I get a bool value indicating that the string has duplicate year, I cant think of any method to actually remove the duplicate substring .

The duplicate year always comes in the end of the string . So what should be the Regex that I would use to determine that the string actually has two years in it, like (2012) (2012)?

If I can correctly identify the string contains duplicate maybe I can use string.LastIndexOf() to try and remove the duplicate part. If there is any better way to do it please let me know.

Thanks.

Answer 1

The right regex is "( \\(\\d{4}\\))\\1+" .

string pattern = @"( \(\d{4}\))\1+";
new Regex(pattern).Replace(s, "$1");

Example here : https://repl.it/Evcy/2

Explanation:
Capture one " (dddd)" block, and remove all following identical ones.
( \\(\\d{4}\\)) does the capture, \\1+ finds any non empty sequence of that captured block

Finally, replace the initial block and its copies by the initial block alone.

Answer 2

This regex will allow for any pattern of whitespace, even none, as in (2013)(2013)

`@"(\(\d{4}\))(?:\s*\1)+"`

I have a demo of it here

Regular Expression for Digits and Special Characters - C#

Question

2 answers

solution1
1 ACCPTED 2016-12-22 16:29:30

solution2
0 2016-12-22 17:16:49

Regular Expression for Digits and Special Characters - C#

Question

2 answers

solution1 1 ACCPTED 2016-12-22 16:29:30

solution2 0 2016-12-22 17:16:49

solution1
1 ACCPTED 2016-12-22 16:29:30

solution2
0 2016-12-22 17:16:49