How can i extract a text from a string?

Question

I have this code:

private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
    WebRequest request = WebRequest.Create(url);
    request.Method = "GET";
    WebResponse response = request.GetResponse();
    Stream stream = response.GetResponseStream();
    StreamReader reader = new StreamReader(stream);
    string content = reader.ReadToEnd();
    int start = content.IndexOf("profile/");
    int end = content.IndexOf("'");
    string result = content.Substring(start, end - start - 1);
    reader.Close();
    response.Close();
}

For example i have a long line:

<span class="message-profile-name" ><a  href='/profile/daniel'>daniel</a></span>: <span class="message-text">hello everyone<wbr/> <img class='emoticon emoticon-tongue' src='/t.gif'/></span>

I want to build a new string with: daniel hello everyone

How can i do it ? In my code it dosent work im getting error exception say

ArgumentOutOfRangeException Length cannot be less than zero. Parameter name: length

On the line: string result = content.Substring(start, end - start - 1); In this case: start = 19572 end = 2110

Answer 1

Use HtmlAgilityPack instead of trying to parse manually.

var wc = new WebClient();

wc.DownloadStringCompleted += (s, e) =>
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(e.Result);

    var link = doc.DocumentNode
                    .SelectSingleNode("//span[@class='message-profile-name']")
                    .Element("a")
                    .Attributes["href"].Value;
};

wc.DownloadStringAsync(new Uri("http://chatroll.com/rotternet"));

Answer 2

Use appropriate tools for spliting symbols array into the meaningful for you data array.

You can use a HtmlAgilityPack to parse the string and return the tree of meaningful tokens.

After you can iterate over them and aggregate into the result string based on your own logic.

Answer 3

It seems the string you want will always be enclosed inside an href with the format profile/xxx , it'd be simple with regex once you get the content into text form, and using regex would still work even if you can have the potential of having multiple <a href=...> elements

Match match = Regex.Match(content, @"(?<=<a\s*?href='/profile/\w*?'>\s*?)\w*?(?=\s*?<\s*?/a\s*?>)");
string result = match.Value;

Will match all the bold ones, and .Value will return whatever is the element's value, in this case daniel , you can also preced the regex with (i?) to make it case insensitive to also match the last item in the list

<a href='/profile/daniel'>daniel</a>
<a href='/profile/danielbc'>daniel</a>
<a href='/profilex/danielbc'>daniel</a>
<a href='/profile/danielbc'> daniel </a>
<a href='/profile/danielbc '>daniel</a>
<a href='/PROFILE/danielbc'> daniel </a>

UPDATE:

To get the content from any other kind of element, just replace the highlighted section to match the element, (?<= <a\\s*?href='/profile/\\w*?'>\\s*? )\\w*?(?= \\s*?<\\s*?/a\\s*?> ). In your case, "message-text">hello everyone would be (?i)(?<= "message-text"\\s*?>\\s*? ) .*? (?= \\s*?<\\s*?/wbr\\s*?> ) , and that will get hello everyone from the following variations, the .*? means match anything (including spaces and punctuations), but as few as possible). Note that I changed your ending tag from your reply, if it it should be and not it's a tiny change you can make to get it working

"message-text">hello everyone
hello everyone
hello everyone

How can i extract a text from a string?

Question

3 answers

solution1
1 2012-08-09 21:31:46

solution2
0 2012-08-09 21:30:22

solution3
0 ACCPTED 2012-08-09 22:39:13

How can i extract a text from a string?

Question

3 answers

solution1 1 2012-08-09 21:31:46

solution2 0 2012-08-09 21:30:22

solution3 0 ACCPTED 2012-08-09 22:39:13

solution1
1 2012-08-09 21:31:46

solution2
0 2012-08-09 21:30:22

solution3
0 ACCPTED 2012-08-09 22:39:13