简体   繁体   English

Excel:导出到 XML - 单元格中包含 XML

[英]Excel: Export to XML - With XML in cells

I'm trying to export a spreadsheet that has some XML in some of the cells of the table.我正在尝试导出在表格的某些单元格中包含一些 XML 的电子表格。

ID (column A): 23455 ID(A栏):23455

FACT (column B) (this code is copied & pasted from a sample cell - they don't all have this simplicity or structure):事实(B 列)(此代码是从示例单元格中复制和粘贴的——它们并非都具有这种简单性或结构):

"<div class=""fact"">
<p><strong>FACT.</strong> The closest star to our solar system is Alpha Centauri.</p>
</div>
"

I'd like to have XML like the following:我想要 XML 如下所示:

<record>
    <ID>23455</ID>
    <FACT><div class="fact"><p><strong>FACT.</strong> The closest star to our solar system is Alpha Centauri.</p></div></FACT>
</record>

This is complex enough that I doubt that Excel's native XML schema export will work (that thing is persnickety enough that I can't get it to work with simplest of data values).这足够复杂,我怀疑 Excel 的本机 XML 模式导出是否会起作用(这件事太挑剔了,我无法让它与最简单的数据值一起工作)。

My current thought is to write a Perl script, to read this as a CSV file and export XML.我目前的想法是编写一个 Perl 脚本,将其读取为 CSV 文件并导出 XML。 However, I've noticed that CSV does a poor job handling XML that's been "embedded" like this.但是,我注意到 CSV 在处理像这样“嵌入”的 XML 方面做得很差。

I'm hoping someone else might have a better suggestion for how to pull this information out.我希望其他人可能对如何提取这些信息有更好的建议。


Edit: Finally figured out the mistake I made with export.编辑:终于弄清楚了我在导出时犯的错误。 Can export and get the following:可以导出并得到以下内容:

<record>
    <ID>23455</ID>
    <FACT>&lt;div class="fact"&gt;&lt;p&gt;&lt;strong&gt;FACT.&lt;/strong&gt; The closest star to our solar system is Alpha Centauri.&lt;/p&gt;&lt;/div&gt
    </FACT>
</record>

I think I can work with this...some regex and it might be good enough (looking for all &lt; might put me at risk of killing a true less-than sign).我想我可以使用这个……一些正则表达式,它可能已经足够好了(寻找所有的&lt;可能会让我面临杀死一个真正的小于号的风险)。

So I'm still open to suggestions所以我仍然愿意接受建议

Just posting this as the answer...只是将其发布为答案...

If you export the column as text you can get the following:如果将列导出为文本,您可以获得以下信息:

<record>
    <ID>23455</ID>
    <FACT>&lt;div class="fact"&gt;&lt;p&gt;&lt;strong&gt;FACT.&lt;/strong&gt; The closest star to our solar system is Alpha Centauri.&lt;/p&gt;&lt;/div&gt
    </FACT>
</record>

In an XML editor I did a find and replace to get all the tags using the following regex: s/&lt;(\/?[\w\s="-_]+?)&gt;/<$1>/在 XML 编辑器中,我使用以下正则表达式进行了查找和替换以获取所有标签: s/&lt;(\/?[\w\s="-_]+?)&gt;/<$1>/

It's a bit dangerous if there are actual signs in the document, but you'd need a case where it was < /maybe and text with common tag symbols ="-_ > - possible but most equations are of the form X < Y < Z . Our content doesn't use <> all that much, so I can be fairly confident it won't catch the edge case.如果文档中有实际的标志,那就有点危险了,但是你需要一个例子,它是< /maybe and text with common tag symbols ="-_ > - 可能但大多数方程的形式是X < Y < Z . 我们的内容并没有太多地使用 <>,所以我可以相当确信它不会遇到边缘情况。

I also "fixed" all the HTML ( s/<b>/<b/>/ and s/<img (.*?)>/<img $1/>/ ) and checked parsing (theoretically an edge case would cause a parsing error).我还“修复”了所有 HTML ( s/<b>/<b/>/s/<img (.*?)>/<img $1/>/ )并检查了解析(理论上边缘情况会导致解析错误)。

And yes, I now have a doc in mixed DTD that will make all true XML peeps quake with horror, but I can work with it.是的,我现在有一个混合 DTD 的文档,它可以让所有真正的 XML 惊恐万分,但我可以使用它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM