I have the following C# code that extracts certain text from an HTML source:
string url = txtURL.Text;
string pageCode = WorkerClass.getSourceCode(url);
int startIndex = pageCode.IndexOf("</B>");
pageCode = pageCode.Substring(startIndex, pageCode.Length - startIndex);
StreamWriter sw = new StreamWriter("websitesource.txt");
sw.Write(pageCode);
sw.Close();
The above code writes the following to the text file:
</B> WILLIAMS AJAYA L <BR>
<B>Address : </B> NEW YORK NY <BR>
<B>Profession : </B> ATHLETIC TRAINER <BR>
<B>License No: </B> 001475 <BR>
<B>Date of Licensure : </B> 01/12/07 <BR>
<B>Additional Qualification : </B> Not applicable in this profession <BR>
<B> <A href="http://www.op.nysed.gov/help.htm#status"> Status :</A></B> REGISTERED <BR>
<B>Registered through last day of : </B> 08/15 <BR>
<HR><div class ="note">
* Use of this online verification service signifies that you have read and agree to the
<A href="http://www.op.nysed.gov/usage.htm">terms and conditions of use</A>.
How can I use the code within a forloop to store the text (any spaces around it trimmed) in a string array?
So the string array should have it like this:
string[] ar = {
"WILLIAMS AJAYA L",
"NEW YORK NY",
"ATHLETIC TRAINER",
"001475",
"01/12/07",
"Not applicable in this profession",
"REGISTERED",
"08/15"
var lines = File.ReadLines("websitesource.txt")
.Select(line =>
line.Substring(line.LastIndexOf("</B>") + 4)
.Replace("<BR>", "")
.Trim())
.ToArray();
I did it slightly differently than Selman22 using a string split command. I also remove
and replace with space. Also, this will work regardless of where line breaks are at (since HTML doesn't require any specific formatting).
var split = File.ReadAllText(FILENAME)
.Replace("<BR>", "").Replace(" ", " ")
.Split(new[] {"<B>", "</B>"}, StringSplitOptions.RemoveEmptyEntries)
.Where((x, i) => i%2 == 0)
.Select(y => y.Trim()).ToList();
split.ForEach(Console.WriteLine);
Console.ReadKey();
The important part of this is to make sure your data is always in this format - as HTML can change frequently and simple changes to the DOM will completely throw off your parsing.
Best of luck!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.