简体   繁体   中英

How can i extract a text from a string?

I have this code:

private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
    WebRequest request = WebRequest.Create(url);
    request.Method = "GET";
    WebResponse response = request.GetResponse();
    Stream stream = response.GetResponseStream();
    StreamReader reader = new StreamReader(stream);
    string content = reader.ReadToEnd();
    int start = content.IndexOf("profile/");
    int end = content.IndexOf("'");
    string result = content.Substring(start, end - start - 1);
    reader.Close();
    response.Close();
}

For example i have a long line:

<span class="message-profile-name" ><a  href='/profile/daniel'>daniel</a></span>: <span class="message-text">hello everyone<wbr/> <img class='emoticon emoticon-tongue' src='/t.gif'/></span>

I want to build a new string with: daniel hello everyone

How can i do it ? In my code it dosent work im getting error exception say

ArgumentOutOfRangeException Length cannot be less than zero. Parameter name: length

On the line: string result = content.Substring(start, end - start - 1); In this case: start = 19572 end = 2110

Use HtmlAgilityPack instead of trying to parse manually.

var wc = new WebClient();

wc.DownloadStringCompleted += (s, e) =>
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(e.Result);

    var link = doc.DocumentNode
                    .SelectSingleNode("//span[@class='message-profile-name']")
                    .Element("a")
                    .Attributes["href"].Value;
};

wc.DownloadStringAsync(new Uri("http://chatroll.com/rotternet"));

Use appropriate tools for spliting symbols array into the meaningful for you data array.

You can use a HtmlAgilityPack to parse the string and return the tree of meaningful tokens.

After you can iterate over them and aggregate into the result string based on your own logic.

It seems the string you want will always be enclosed inside an href with the format profile/xxx , it'd be simple with regex once you get the content into text form, and using regex would still work even if you can have the potential of having multiple <a href=...> elements

Match match = Regex.Match(content, @"(?<=<a\s*?href='/profile/\w*?'>\s*?)\w*?(?=\s*?<\s*?/a\s*?>)");
string result = match.Value;

Will match all the bold ones, and .Value will return whatever is the element's value, in this case daniel , you can also preced the regex with (i?) to make it case insensitive to also match the last item in the list

  • <a href='/profile/daniel'>daniel</a>
  • <a href='/profile/danielbc'>daniel</a>
  • <a href='/profilex/danielbc'>daniel</a>
  • <a href='/profile/danielbc'> daniel </a>
  • <a href='/profile/danielbc '>daniel</a>
  • <a href='/PROFILE/danielbc'> daniel </a>

UPDATE:

To get the content from any other kind of element, just replace the highlighted section to match the element, (?<= <a\\s*?href='/profile/\\w*?'>\\s*? )\\w*?(?= \\s*?<\\s*?/a\\s*?> ). In your case, "message-text">hello everyone<wbr/> would be (?i)(?<= "message-text"\\s*?>\\s*? ) .*? (?= \\s*?<\\s*?/wbr\\s*?> ) , and that will get hello everyone from the following variations, the .*? means match anything (including spaces and punctuations), but as few as possible). Note that I changed your ending tag from your reply, if it it should be and not it's a tiny change you can make to get it working

  • "message-text">hello everyone</wbr>
  • <wbr asdfjlds "message-text">hello everyone</wbr>
  • <wbr "message-text">hello everyone</wbr>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM