CS中的正则表达式：数据提取

Question

I have data like this: 我有这样的数据：

<td><a href="/New_York_City" title="New York City">New York</a></td>

And I would like to get New York out of it. 我想摆脱纽约。

I don't have any skill in regex what so ever. 我在正则表达式上没有任何技能。 I have tried this though: 我已经尝试过了：

StreamReader sr = new StreamReader("c:\\USAcityfile2.txt");
string pattern = "<td>.*</td>";
Regex r = new Regex(pattern, RegexOptions.IgnoreCase);
Regex r1 = new Regex("<a .*>.*</a>", RegexOptions.IgnoreCase);
 string read = "";
while ((read = sr.ReadLine()) != null)
{
    foreach (Match m in r.Matches(read))
    {
        foreach (Match m1 in r1.Matches(m.Value.ToString()))
            Console.WriteLine(m1.Value);
    }
}
sr.Close();
sr.Dispose();

this gave me <a href="/New_York_City" title="New York City">New York</a> . 这给了我<a href="/New_York_City" title="New York City">New York</a> 。

How can reach to data between <a .*> and </a> ? 如何访问<a .*>和</a>之间的数据？ thanks. 谢谢。

Answer 1

If you insist on a regex for this particular case, then try this: 如果您在这种情况下坚持使用正则表达式，请尝试以下操作：

String pattern = @"(?<=<a[^>]*>).*?(?=</a>)

(?<=<a[^>]*>) is a positive lookbehind assertion to ensure that there is <a[^>]*> before the wanted pattern. (?<=<a[^>]*>)是肯定的后置断言，以确保所需模式之前有<a[^>]*> [^> <a[^>]*> 。

(?=</a>) is a positive lookahead assertion to ensure that there is </a> after the pattern (?=</a>)是肯定的超前断言，以确保模式之后有</a>

.*? is a lazy quantifier, matching as less as possible till the first </a> 是一个懒惰的量词，在第一个</a>之前尽可能少地匹配

A good reference for regular expressions is regular-expressions.info 正则表达式的一个很好的参考是regular-expressions.info

Their lookaround explanation 他们的环视说明

Answer 2

只有一个正则表达式可以：

string pattern = "<a[^>]*>(.*)</a>";

Answer 3

           foreach (Match m1 in r1.Matches(m.Value.ToString()))
                {
                    //Console.WriteLine(m1.Value);
                    string[] res = m1.Value.Split(new char[] {'>','<'});
                    Console.WriteLine(res[2]);
                }

Did the trick, for this particular example. 对于这个特定的例子，做到了。 Still not what I am looking. 仍然不是我要找的东西。

Answer 4

var g = Regex.Match(s, @"\<a[^>]+\>([^<]*)").Groups[1];

To find all values of <a> in your file you may use the following (easier) code: 要查找文件中<a>所有值，可以使用以下（简便的）代码：

        var allValuesOfAnchorTag =
            from line in File.ReadLines(filename)
            from match in Regex.Matches(line, @"\<a[^>]+\>([^<]*)").OfType<Match>()
            let @group = match.Groups[1]
            where @group.Success
            select @group.Value;

However you seem to work with XML as @kirill-polishchuk correctly pointed out. 但是，正如@ kirill-polishchuk正确指出的那样，您似乎使用XML。 If that is true code is even more simple: 如果是这样的话，代码就更简单了：

        var values = from e in XElement.Load(filename).Descendants("a")
                         select e.Value;

Answer 5

As per OP comment, that input document is HTML, it'd be better to use HTML parser, eg: Html Agility Pack . 根据OP注释，该输入文档为HTML，最好使用HTML解析器，例如： Html Agility Pack 。 You can use XPath //td/a to obtain desired result. 您可以使用XPath //td/a获得所需的结果。

Answer 6

Using the HTML Agility Pack ( project page , nuget ), this does the trick: 使用HTML Agility Pack（项目页面， nuget ），可以达到以下目的：

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("your html here"); 
// or doc.Load(stream);

var nodes = doc.DocumentNode.DescendantNodes("a");
// or var nodes = doc.DocumentNode.SelectNodes("//td/a") ?? new HtmlNodeCollection();

foreach (var node in nodes)
{
    string city = node.InnerText;
}

// or var linkTexts = nodes.Select(node => node.InnerText);

CS中的正则表达式：数据提取

问题描述

6 个解决方案

解决方案1
1 已采纳 2012-03-20 06:58:59

解决方案2
0 2012-03-20 06:32:17

解决方案3
0

解决方案4
0 2012-03-20 06:56:22

解决方案5
0 2012-03-20 08:32:03

解决方案6
0 2012-03-20 10:59:35

CS中的正则表达式：数据提取

问题描述

6 个解决方案

解决方案1 1 已采纳 2012-03-20 06:58:59

解决方案2 0 2012-03-20 06:32:17

解决方案3 0

解决方案4 0 2012-03-20 06:56:22

解决方案5 0 2012-03-20 08:32:03

解决方案6 0 2012-03-20 10:59:35

解决方案1
1 已采纳 2012-03-20 06:58:59

解决方案2
0 2012-03-20 06:32:17

解决方案3
0

解决方案4
0 2012-03-20 06:56:22

解决方案5
0 2012-03-20 08:32:03

解决方案6
0 2012-03-20 10:59:35