[英]Regular Expression for Digits and Special Characters - C#
I use Html-Agility-Pack
to extract information from some websites. 我使用
Html-Agility-Pack
从某些网站提取信息。 In the process I get data in the form of string
and I use that data in my program. 在此过程中,我以
string
形式获取数据,然后在程序中使用该数据。
Sometimes the data I get includes multiple details in the single string
. 有时我得到的数据在单个
string
包含多个详细信息。 As the name of this Movie "Dog Eats Dog (2012) (2012)" . 作为这部电影的名称“ Dog Eats Dog(2012)(2012)” 。 The name should have been "Dog Eats Dog (2012)" rather than the first one.
名称应该是“ Dog Eats Dog(2012)”,而不是第一个。
Above is the one example from many. 以上是许多例子中的一个。 In order to correct the issue I tried to use
string.Distinct()
method but it would remove all the duplicate characters in the string
as in above example it would return "Dog Eats (2012)" . 为了解决此问题,我尝试使用
string.Distinct()
方法,但是它将删除string
中所有重复的字符,如上例所示,它将返回“ Dog Eats(2012)” 。 Now it solved my initial problem by removing the 2nd (2012) but created a new one by changing the actual title. 现在,它通过删除第二个字母(2012年)解决了我最初的问题,但通过更改实际标题创建了一个新字母。
I thought my problem could be solved with Regex
but I have no idea as to how I can use it here. 我以为可以用
Regex
解决我的问题,但是我不知道如何在这里使用它。 As far as I know if I use Regex
it would tell me that there are duplicate items in the string
according to the defined Regex
code. 据我所知,如果我使用正则
Regex
,它将告诉我根据定义的正则Regex
代码, string
中存在重复项。
But how do I remove it? 但是,如何删除它? There can be a
string
like "Meme 2013 (2013) (2013)". 可能会有类似“ Meme 2013(2013)(2013)”的
string
。 Now the actual title is "Meme 2013" with year (2013) and the duplicate year (2013). 现在的实际标题是带有年份(2013)和重复年份(2013)的“ Meme 2013”。 Even if I get a
bool
value indicating that the string
has duplicate year, I cant think of any method to actually remove the duplicate substring
. 即使我得到一个表明
string
具有重复年份的bool
值,我也无法想到任何方法来实际删除重复的substring
。
The duplicate year always comes in the end of the string
. 重复的年份始终位于
string
的末尾。 So what should be the Regex
that I would use to determine that the string actually has two years in it, like (2012) (2012)? 那么,用于确定字符串实际上包含两年的正则
Regex
应该是什么,例如(2012)(2012)?
If I can correctly identify the string
contains duplicate maybe I can use string.LastIndexOf()
to try and remove the duplicate part. 如果我可以正确识别包含重复项的
string
可以使用string.LastIndexOf()
尝试删除重复项。 If there is any better way to do it please let me know. 如果有更好的方法,请告诉我。
Thanks. 谢谢。
The right regex is "( \\(\\d{4}\\))\\1+"
. 正确的正则表达式为
"( \\(\\d{4}\\))\\1+"
。
string pattern = @"( \(\d{4}\))\1+";
new Regex(pattern).Replace(s, "$1");
Example here : https://repl.it/Evcy/2 这里的例子: https : //repl.it/Evcy/2
Explanation: 说明:
Capture one " (dddd)" block, and remove all following identical ones. 捕获一个“(dddd)”块,然后删除所有随后的相同块。
( \\(\\d{4}\\))
does the capture, \\1+
finds any non empty sequence of that captured block ( \\(\\d{4}\\))
进行捕获, \\1+
查找该捕获块的任何非空序列
Finally, replace the initial block and its copies by the initial block alone. 最后,仅用初始块替换初始块及其副本。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.