简体   繁体   English

使用Regex从HTML /文本文件中提取字符串的一部分

[英]Using Regex to extract part of a string from a HTML/text file

I have a C# regular expression to match author names in a text document that is written as: 我有一个C#正则表达式来匹配文本文档中的作者姓名,该文本编写为:

"author":"AUTHOR'S NAME"

The regex is as follows: 正则表达式如下:

new Regex("\"author\":\"[A-Za-z0-9]*\\s?[A-Za-z0-9]*")

This returns "author":"AUTHOR'S NAME . However, I don't want the quotation marks or the word Author before. I just want the name. 这将返回"author":"AUTHOR'S NAME 。但是,我不需要引号或单词Author 。我只想要名称。

Could anyone help me get the expected value please? 有人可以帮我得到期望的价格吗?

Use regex groups to get a part of the string. 使用正则表达式组来获取字符串的一部分。 ( ) acts as a capture group and can be accessed by the .Groups field. ( )作为捕获组,可以通过.Groups字段进行访问。

.Groups[0] matches the whole string .Groups[0]匹配整个字符串

.Groups[1] matches the first group (and so on) .Groups[1]匹配第一个组(依此类推)

string pattern = "\"author\":\"([A-Za-z0-9]*\\s?[A-Za-z0-9]*)\"";
var match = Regex.Match("\"author\":\"Name123\"", pattern);
string authorName = match.Groups[1];

You can also use look-around approach to only get a match value: 您还可以使用环顾四周方法仅获取匹配值:

var txt = "\"author\":\"AUTHOR'S NAME\"";
var rgx = new Regex(@"(?<=""author"":"")[^""]+(?="")");
var result = rgx.Match(txt).Value;

My regex yields 555,020 iterations per second speed with this input string, which should suffice. 我的正则表达式使用此输入字符串每秒可产生555,020次迭代,这足够了。

result will be AUTHOR'S NAME . result将是“ AUTHOR'S NAME

(?<="author":") checks if we have "author":" before the match, [^"]+ looks safe since you only want to match alphanumerics and space between the quotes, and (?=") is checking the trailing quote. (?<="author":")检查匹配之前是否有"author":"[^"]+看起来很安全,因为您只想匹配引号之间的字母数字和空格,而(?=")为检查尾随报价。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM