简体   繁体   中英

Excel: Export to XML - With XML in cells

I'm trying to export a spreadsheet that has some XML in some of the cells of the table.

ID (column A): 23455

FACT (column B) (this code is copied & pasted from a sample cell - they don't all have this simplicity or structure):

"<div class=""fact"">
<p><strong>FACT.</strong> The closest star to our solar system is Alpha Centauri.</p>
</div>
"

I'd like to have XML like the following:

<record>
    <ID>23455</ID>
    <FACT><div class="fact"><p><strong>FACT.</strong> The closest star to our solar system is Alpha Centauri.</p></div></FACT>
</record>

This is complex enough that I doubt that Excel's native XML schema export will work (that thing is persnickety enough that I can't get it to work with simplest of data values).

My current thought is to write a Perl script, to read this as a CSV file and export XML. However, I've noticed that CSV does a poor job handling XML that's been "embedded" like this.

I'm hoping someone else might have a better suggestion for how to pull this information out.


Edit: Finally figured out the mistake I made with export. Can export and get the following:

<record>
    <ID>23455</ID>
    <FACT>&lt;div class="fact"&gt;&lt;p&gt;&lt;strong&gt;FACT.&lt;/strong&gt; The closest star to our solar system is Alpha Centauri.&lt;/p&gt;&lt;/div&gt
    </FACT>
</record>

I think I can work with this...some regex and it might be good enough (looking for all &lt; might put me at risk of killing a true less-than sign).

So I'm still open to suggestions

Just posting this as the answer...

If you export the column as text you can get the following:

<record>
    <ID>23455</ID>
    <FACT>&lt;div class="fact"&gt;&lt;p&gt;&lt;strong&gt;FACT.&lt;/strong&gt; The closest star to our solar system is Alpha Centauri.&lt;/p&gt;&lt;/div&gt
    </FACT>
</record>

In an XML editor I did a find and replace to get all the tags using the following regex: s/&lt;(\/?[\w\s="-_]+?)&gt;/<$1>/

It's a bit dangerous if there are actual signs in the document, but you'd need a case where it was < /maybe and text with common tag symbols ="-_ > - possible but most equations are of the form X < Y < Z . Our content doesn't use <> all that much, so I can be fairly confident it won't catch the edge case.

I also "fixed" all the HTML ( s/<b>/<b/>/ and s/<img (.*?)>/<img $1/>/ ) and checked parsing (theoretically an edge case would cause a parsing error).

And yes, I now have a doc in mixed DTD that will make all true XML peeps quake with horror, but I can work with it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM