简体   繁体   English

数字和特殊字符的正则表达式-C#

[英]Regular Expression for Digits and Special Characters - C#

I use Html-Agility-Pack to extract information from some websites. 我使用Html-Agility-Pack从某些网站提取信息。 In the process I get data in the form of string and I use that data in my program. 在此过程中,我以string形式获取数据,然后在程序中使用该数据。

Sometimes the data I get includes multiple details in the single string . 有时我得到的数据在单个string包含多个详细信息。 As the name of this Movie "Dog Eats Dog (2012) (2012)" . 作为这部电影的名称“ Dog Eats Dog(2012)(2012)” The name should have been "Dog Eats Dog (2012)" rather than the first one. 名称应该是“ Dog Eats Dog(2012)”,而不是第一个。

Above is the one example from many. 以上是许多例子中的一个。 In order to correct the issue I tried to use string.Distinct() method but it would remove all the duplicate characters in the string as in above example it would return "Dog Eats (2012)" . 为了解决此问题,我尝试使用string.Distinct()方法,但是它将删除string中所有重复的字符,如上例所示,它将返回“ Dog Eats(2012)” Now it solved my initial problem by removing the 2nd (2012) but created a new one by changing the actual title. 现在,它通过删除第二个字母(2012年)解决了我最初的问题,但通过更改实际标题创建了一个新字母。

I thought my problem could be solved with Regex but I have no idea as to how I can use it here. 我以为可以用Regex解决我的问题,但是我不知道如何在这里使用它。 As far as I know if I use Regex it would tell me that there are duplicate items in the string according to the defined Regex code. 据我所知,如果我使用正则Regex ,它将告诉我根据定义的正则Regex代码, string中存在重复项。

But how do I remove it? 但是,如何删除它? There can be a string like "Meme 2013 (2013) (2013)". 可能会有类似“ Meme 2013(2013)(2013)”的string Now the actual title is "Meme 2013" with year (2013) and the duplicate year (2013). 现在的实际标题是带有年份(2013)和重复年份(2013)的“ Meme 2013”​​。 Even if I get a bool value indicating that the string has duplicate year, I cant think of any method to actually remove the duplicate substring . 即使我得到一个表明string具有重复年份的bool值,我也无法想到任何方法来实际删除重复的substring

The duplicate year always comes in the end of the string . 重复的年份始终位于string的末尾。 So what should be the Regex that I would use to determine that the string actually has two years in it, like (2012) (2012)? 那么,用于确定字符串实际上包含两年的正则Regex应该是什么,例如(2012)(2012)?

If I can correctly identify the string contains duplicate maybe I can use string.LastIndexOf() to try and remove the duplicate part. 如果我可以正确识别包含重复项的string可以使用string.LastIndexOf()尝试删除重复项。 If there is any better way to do it please let me know. 如果有更好的方法,请告诉我。

Thanks. 谢谢。

The right regex is "( \\(\\d{4}\\))\\1+" . 正确的正则表达式为"( \\(\\d{4}\\))\\1+"

string pattern = @"( \(\d{4}\))\1+";
new Regex(pattern).Replace(s, "$1");

Example here : https://repl.it/Evcy/2 这里的例子: https : //repl.it/Evcy/2

Explanation: 说明:
Capture one " (dddd)" block, and remove all following identical ones. 捕获一个“(dddd)”块,然后删除所有随后的相同块。
( \\(\\d{4}\\)) does the capture, \\1+ finds any non empty sequence of that captured block ( \\(\\d{4}\\))进行捕获, \\1+查找该捕获块的任何非空序列

Finally, replace the initial block and its copies by the initial block alone. 最后,仅用初始块替换初始块及其副本。

This regex will allow for any pattern of whitespace, even none, as in (2013)(2013) 此正则表达式将允许任何形式的空格,甚至没有空格,例如(2013)(2013)

`@"(\(\d{4}\))(?:\s*\1)+"`

I have a demo of it here 在这里有一个演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM