將“文檔格式” / XML轉換為CSV

Question

我正在嘗試轉換：

<doc id="123" url="http://url.org/thing?curid=123" title="title"> 
Title

text text text more text

</doc>

轉換為CSV文件（該文件包含大量格式類似的“文檔”）。 如果這是一個普通的XML文件，我認為我能與像一個解決方案來弄明白這個，但因為上面的代碼是不是在正規的XML格式，我卡住了。

我想做的是將數據導入到postgresql中，並且從我收集的信息來看，如果采用CSV格式，則更容易導入此信息（如果還有其他方法，請告訴我）。 我需要的是將“ id”，“ url”，“ title”和“ text / body”分開。

額外的問題：文本/正文中的第一行是文檔的標題，是否可以在轉換中刪除/操作第一行？

謝謝！

Answer 1

就Python而言：

給定一個XML文件（thedoc.xml），例如：

<?xml version="1.0" encoding="UTF-8"?>
<docCollection>
    <doc id="123" url="http://url.org/thing?curid=123" title="Farenheit451"> 
    Farenheit451

    It was a pleasure to burn...
    </doc>

    <doc id="456" url="http://url.org/thing?curid=456" title="Sense and sensitivity"> 
    Sense and sensitivity

    It was sensibile to be sensitive &amp; nice...
    </doc>        
</docCollection>

還有一個使用lxml的腳本（thecode.py），如下所示：

from lxml import etree
import pandas
import HTMLParser 

inFile = "./thedoc.xml"
outFile = "./theprocdoc.csv"

#It is likely that your XML might be too big to be parsed into memory,
#for this reason it is better to use the incremental parser from lxml.
#This is initialised here to be triggering an "event" after a "doc" tag
#has been parsed.
ctx = etree.iterparse(inFile, events = ("end",), tag=("doc",))

hp = HTMLParser.HTMLParser()
csvData = []
#For every parsed element in the "context"...
for event, elem in ctx:
    #...isolate the tag's attributes and apply some formating to its text
    #Please note that you can remove the cgi.escape if you are not interested in HTML escaping. Please also note that the body is simply split at the newline character and then rejoined to ommit the title.
    csvData.append({"id":elem.get("id"),
                    "url":elem.get("url"),
                    "title":elem.get("title"),
                    "body":hp.unescape("".join(elem.text.split("\n")[2:]))})
    elem.clear() #It is important to call clear here, to release the memory occupied by the element's parsed data.

#Finally, simply turn the list of dictionaries to a DataFrame and writeout the CSV. I am using pandas' to_csv here for convenience.
pandas.DataFrame(csvData).to_csv(outFile, index = False)

它將生成如下所示的CSV（theprocdoc.csv）：

body,id,title,url
        It was a pleasure to burn...    ,123,Farenheit451,http://url.org/thing?curid=123
        It was sensibile to be sensitive...    ,456,Sense and sensibility,http://url.org/thing?curid=456

有關更多信息（並且由於我無法格式化內嵌注釋中的鏈接），請參見lxml.etree.iterparse ， cgi.escape ， pandas.DataFrame.to_csv 。

希望這可以幫助。

將“文檔格式” / XML轉換為CSV

問題描述

1 個解決方案

解決方案1
1 已采納 2015-07-11 21:10:21

將“文檔格式” / XML轉換為CSV

問題描述

1 個解決方案

解決方案1 1 已采納 2015-07-11 21:10:21

解決方案1
1 已采納 2015-07-11 21:10:21