简体   繁体   中英

How to get all applied CSS Style info from mshtml element via C#

I'm working on a C# application that needs to parse a web page and convert into another format. Without going too deep into the output format and use case etc. My problem is getting the computed CSS for any given element, in this case most of them. I'm dealing with a combination of inline styles, CSS, and formatting elements like <strong>,<em>,<u> etc.

I'm currently loading the web page into mshtml and using the IHTMLElement2 interface to access the currentStyle object. this is proving to be too slow. I have profiled it and the significant time is spent getting the value of the style rule via the call to currentStyle.XXX . Since I need to query multiple properties ie background-color, font-family, font-size, text-align, text-decoration, etc. repeated by each element I'm doing thousands of COM calls and it's taking several minutes for a small document. All modern browsers do this is fractions of a second. I imagine it's the COM interop that's killing me?

Is there a better way? I'd like to get all the computed Style rules that apply to the element in one shot. does anybody know how to use IHTMLElementAppliedStyles ? does it do what I'm looking for and where do you get an instance of it? side note: I'm referencing the HTML Object Library to get mshtml but it does not seem to be the IE9/10 version??? not all the interfaces are available ie IHTMLDocument7

Thanks,

I've been working on this and have a few updates...

a) I had a bug in the code that walks back up the tree to resolve relative values like 80% or 1.2em to absolute values like pt etc. that fix resulted in a huge speed increase. It's still a bit too slow for me, down to 20~30 seconds for a what equates to 3 pages of word document (with tables and ordered lists etc.).

b) I added a C# wrapper class for IHTMLElement2 that caches the CSS values so I at least only have to read them once per Dom Node via COM Interop. that helped a bit, so I'm now down to 8 ~ 10 seconds for same 3 pages of word doc equivalent html.

c) I'm looking into creating a C++ wrapper for IHTMLElement that will load all the CSS values into an array and pass the whole array with a single COM interop call but so far C++ and COM wrappers looks like a steep learning curve: MFC, ATL, COM, oh my.

d) since I have no c++ experience and the wrapper idea is looking very challenging I'm considering building a C# css parser and resolver, so I can dump mshtml and use htmlagilitypack + my css parser/resolver. Also a big job.

looking forward to comments, guidance, answers Thanks.

(Probably too late for the original asker, but hopefully useful to folk who come by later.)

Using IHTMLWindow7::getComputedStyle(IHTMLDOMNode node) returns a live IHTMLCSSStyleDeclaration object, which gives the fully-calculated styles after considering all rules and inline styles, including browser-default styles such as giving <strong> a heavier font weight.

If you want to bind to specific properties such as backgroundColor they're available directly on IHTMLCSSStyleDeclaration , IHTMLCSSStyleDeclaration2 , etc. Alternately you can access specific properties by name with IHTMLCSSStyleDeclaration::getPropertyValue(string name) . To get a list of all names defined on the element, use the length and item properties.

The big caveat is that the IHTMLWindow7 and IHTMLCSSStyleDeclaration * interfaces aren't declared in the Primary Interop Assembly for mshtml, so by default they aren't available in a strongly-bound fashion. So you can either access them dynamically or create a custom Interop Assembly that provides access to them.

Creating a custom Interop Assembly (IA) for mshtml isn't hard, but by default the .NET member definitions often aren't ideal and the assembly is huge. If you don't mind that, find mshtml.tlb on your PC and run this from a VS developer prompt is:

tlbimp mshtml.tlb /out:"custommshtml.dll" /namespace:"custommshtml" /transform:dispret /asmversion:"1.0.0.0" /tlbreference:"C:\Windows\System32\stdole2.tlb" /nologo /silence:3001 /silence:3002

That generates an IA for the version of IE you have installed. You'll get some warnings, which can be ignored as long as you don't plan to use those members. Tweak as desired, but don't use mshtml as the namespace—it makes things very confusing.

In your project, reference your IA instead of mshtml. You'll need to adjust using statements and whatnot to use the different namespace. Depending on where your original DOM objects are coming from you may find that they have a type in the mshtml namespace. That's fine; you'll still be able to case to your custom interfaces. Also, during debugging the Immediate Window may claim that some methods/properties don't exist even though they appear in IntelliSense—that's just because they haven't been referenced in the project, so the compiler hasn't embedded the needed definitions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM