[英]Regular Expression in CS: data extraction
I have data like this: 我有这样的数据:
<td><a href="/New_York_City" title="New York City">New York</a></td>
And I would like to get New York out of it. 我想摆脱纽约 。
I don't have any skill in regex what so ever. 我在正则表达式上没有任何技能。 I have tried this though:
我已经尝试过了:
StreamReader sr = new StreamReader("c:\\USAcityfile2.txt");
string pattern = "<td>.*</td>";
Regex r = new Regex(pattern, RegexOptions.IgnoreCase);
Regex r1 = new Regex("<a .*>.*</a>", RegexOptions.IgnoreCase);
string read = "";
while ((read = sr.ReadLine()) != null)
{
foreach (Match m in r.Matches(read))
{
foreach (Match m1 in r1.Matches(m.Value.ToString()))
Console.WriteLine(m1.Value);
}
}
sr.Close();
sr.Dispose();
this gave me <a href="/New_York_City" title="New York City">New York</a>
. 这给了我
<a href="/New_York_City" title="New York City">New York</a>
。
How can reach to data between <a .*>
and </a>
? 如何访问
<a .*>
和</a>
之间的数据? thanks. 谢谢。
If you insist on a regex for this particular case, then try this: 如果您在这种情况下坚持使用正则表达式,请尝试以下操作:
String pattern = @"(?<=<a[^>]*>).*?(?=</a>)
(?<=<a[^>]*>)
is a positive lookbehind assertion to ensure that there is <a[^>]*>
before the wanted pattern. (?<=<a[^>]*>)
是肯定的后置断言,以确保所需模式之前有<a[^>]*>
[^> <a[^>]*>
。
(?=</a>)
is a positive lookahead assertion to ensure that there is </a>
after the pattern (?=</a>)
是肯定的超前断言,以确保模式之后有</a>
.*?
is a lazy quantifier, matching as less as possible till the first </a>
是一个懒惰的量词,在第一个
</a>
之前尽可能少地匹配
A good reference for regular expressions is regular-expressions.info 正则表达式的一个很好的参考是regular-expressions.info
只有一个正则表达式可以:
string pattern = "<a[^>]*>(.*)</a>";
foreach (Match m1 in r1.Matches(m.Value.ToString()))
{
//Console.WriteLine(m1.Value);
string[] res = m1.Value.Split(new char[] {'>','<'});
Console.WriteLine(res[2]);
}
Did the trick, for this particular example. 对于这个特定的例子,做到了。 Still not what I am looking.
仍然不是我要找的东西。
var g = Regex.Match(s, @"\<a[^>]+\>([^<]*)").Groups[1];
To find all values of <a>
in your file you may use the following (easier) code: 要查找文件中
<a>
所有值,可以使用以下(简便的)代码:
var allValuesOfAnchorTag =
from line in File.ReadLines(filename)
from match in Regex.Matches(line, @"\<a[^>]+\>([^<]*)").OfType<Match>()
let @group = match.Groups[1]
where @group.Success
select @group.Value;
However you seem to work with XML as @kirill-polishchuk correctly pointed out. 但是,正如@ kirill-polishchuk正确指出的那样,您似乎使用XML。 If that is true code is even more simple:
如果是这样的话,代码就更简单了:
var values = from e in XElement.Load(filename).Descendants("a")
select e.Value;
As per OP comment, that input document is HTML, it'd be better to use HTML parser, eg: Html Agility Pack . 根据OP注释,该输入文档为HTML,最好使用HTML解析器,例如: Html Agility Pack 。 You can use XPath
//td/a
to obtain desired result. 您可以使用XPath
//td/a
获得所需的结果。
Using the HTML Agility Pack ( project page , nuget ), this does the trick: 使用HTML Agility Pack( 项目页面 , nuget ),可以达到以下目的:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("your html here");
// or doc.Load(stream);
var nodes = doc.DocumentNode.DescendantNodes("a");
// or var nodes = doc.DocumentNode.SelectNodes("//td/a") ?? new HtmlNodeCollection();
foreach (var node in nodes)
{
string city = node.InnerText;
}
// or var linkTexts = nodes.Select(node => node.InnerText);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.