So this is a rather odd question, I know that. I use a tool called pdf2htmlEX
, which converts a PDF to HTML. So far the results has been pretty damn impressive. I have yet seen a single error in all the PDFs I have converted to HTML.
With this HTML, I need to replace some strings dynamically with C#. However, I can't simply say line.Replace("#SOME_STRING", "Another string")
, although I wrote #SOME_STRING
in the document before exporting to PDF. Why not, you might ask? Because the output of pdf2htmlEX
can look something like this:
<div class="t m0 x5 h5 ya ff4 fs3 fc0 sc0 ls0 ws0">#SOME_ST<span class="_ _5"></span>RING </div>
See that empty span-tag with a _
and _5
class? Yep, that prevents me from replacing my word. The _5
class simply has some width (like width: 0.9889px
).
In this case, how would I replace #SOME_ST<span class="_ _5"></span>RING
with something else?
Here are some cases:
(#SOME_STRING) #SOME_ST<span class="_ _5"></span>RING
(#SOME_OTHER_STRING) #SOME_<span class="_ _7"></span>OTHER_ST<span class="_ _5"></span>RING
I'm kind of lost here, because I can't remove all the _5
elements, because the class is randomized everytime I change something in the document.
EDIT: So I basically need a way to filter out the HTML tags from my own Key-Value pair, so I can replace the words like #SOME_STRING -> SOMETHING_ELSE
.
Try using regex to filter all empty spans:
var myRegex = new Regex(@"(?<emptyspan><span[^>]*></span>)", RegexOptions.None);
var strTargetString = @"<div class=""t m0 x5 h5 ya ff4 fs3 fc0 sc0 ls0 ws0"">#SOME_ST<span class=""_ _5""></span>RING </div> <span></span>";
foreach (Match myMatch in myRegex.Matches(strTargetString))
{
var emptyString = myMatch.Groups["emptyspan"].Value;
// replace or remove empty string ??
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.