简体   繁体   中英

How to use substring to extract certain text

I have the following C# code that extracts certain text from an HTML source:

string url = txtURL.Text;
string pageCode = WorkerClass.getSourceCode(url);
int startIndex = pageCode.IndexOf("</B>");
pageCode = pageCode.Substring(startIndex, pageCode.Length - startIndex);
StreamWriter sw = new StreamWriter("websitesource.txt");
sw.Write(pageCode);
sw.Close();

The above code writes the following to the text file:

</B> WILLIAMS AJAYA L                     <BR>                                                                      
<B>Address : </B> NEW YORK            NY                                          <BR>                                        
<B>Profession : </B> ATHLETIC TRAINER                          <BR>                                                           
<B>License No: </B> 001475 <BR>                                                                                            
<B>Date of Licensure : </B> 01/12/07      <BR>                                                                                
<B>Additional Qualification : </B>     &nbsp; Not applicable in this profession                       <BR>                    
<B> <A href="http://www.op.nysed.gov/help.htm#status"> Status :</A></B> REGISTERED                                        <BR>
<B>Registered through last day of : </B> 08/15      <BR>
<HR><div class ="note">                                                                                                       
* Use of this online verification service signifies that you have read and agree to the                                       
<A href="http://www.op.nysed.gov/usage.htm">terms and conditions of use</A>.   

How can I use the code within a forloop to store the text (any spaces around it trimmed) in a string array?

So the string array should have it like this:

string[] ar = {
"WILLIAMS AJAYA L",
"NEW YORK          NY",
"ATHLETIC TRAINER",
"001475",
"01/12/07",
"Not applicable in this profession",
"REGISTERED",
"08/15"
var lines =  File.ReadLines("websitesource.txt")
                .Select(line =>
                    line.Substring(line.LastIndexOf("</B>") + 4)
                        .Replace("<BR>", "")
                        .Trim())
                        .ToArray();

I did it slightly differently than Selman22 using a string split command. I also remove &nbsp; and replace with space. Also, this will work regardless of where line breaks are at (since HTML doesn't require any specific formatting).

var split = File.ReadAllText(FILENAME)
                .Replace("<BR>", "").Replace("&nbsp;", " ")
                .Split(new[] {"<B>", "</B>"}, StringSplitOptions.RemoveEmptyEntries)
                .Where((x, i) => i%2 == 0)
                .Select(y => y.Trim()).ToList();

split.ForEach(Console.WriteLine);
Console.ReadKey();

The important part of this is to make sure your data is always in this format - as HTML can change frequently and simple changes to the DOM will completely throw off your parsing.

Best of luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM