简体   繁体   中英

Regex to remove and replace characters

I have the following

<option value="Abercrombie">Abercrombie</option>

My file has about 2000 rows in it each row has a different location, I'm trying to understand regex but unfortunately nothing I learn will go in and I'm unsure if this is possible.

What I want to do is run a regex which will strip the above HTML which will leave the following

Abercrombie 

I then want to prefix a particular number to the front so the result would be for example

2,Abercrombie 

Is this possible?

Don't use a regular expression since HTML is not a regular language. You can use Linq's XML parser. If you want to process the entire file, you can replace the elements inline:

int myNumber  = 2;
var html      = @"<html><body><option value=""Abercrombie"">Abercrombie</option><div><option value=""Forever21"">Forever21</option></div></body></html>";            
var doc       = XDocument.Load(new StringReader(html));

var options = doc.Descendants().Where(o => o.Name == "option").ToList();
foreach (var element in options)
{
    element.ReplaceWith(string.Format("{0},{1}", myNumber, element.Value));
}

var result = doc.ToString();

This gives:

<html>
    <body>2,Abercrombie<div>2,Forever21</div></body>
</html>

If you just want to grab the text for a specific tag, you can use the following:

int myNumber  = 2;
var html      = @"<option value=""Abercrombie"">Abercrombie</option>";            
var doc       = XDocument.Load(new StringReader(html));
var element   = doc.Descendants().FirstOrDefault(o => o.Name == "option");
var attribute = element.Attribute("value").Value;
var result    = string.Format("{0},{1}", myNumber, attribute);

//result == "2,Abercrombie"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM