[英]Python lxml extract text when a tag exists in the middle of the text
我正在嘗試解析和提取聲明文本標簽內的所有文本,並將其准備為 csv。 所以每個聲明標簽都會有一列包含所有聲明文本。
基本上,索賠以兩種 styles 表示。 第一個claim id="CLM-00001" num="00001">
是另一個嵌套聲明文本標簽內的嵌套聲明文本標簽。 第二種樣式,如果您查看<claim id="CLM-00002" num="00002">
它在文本中間有一個<claim-ref
標簽(這似乎是我的問題)。
<claims id="claims">
<claim id="CLM-00001" num="00001">
<claim-text>1. A method of forming an amorphous metal foam formed of an amorphous metal powder comprising:
<claim-text>mixing at least one amorphous metal powder and at least one gas-splitting propellant powder into a propellant filled amorphous metal powder mixture, such that upon decomposition of the gas-splitting propellant powder, gas-containing pores are created within the amorphous metal powder mixture;</claim-text>
<claim-text>compacting the mixture such that the amorphous metal powder particles are bonded to one another to form a gas-tight seal around the gas-splitting propellant powder particles, the mixture being compacted at a compacting temperature and pressure sufficient to allow for bonding of the mixture, wherein the temperature is below any crystalline transition temperature of the amorphous metal powder, and for a duration not exceeding a time for any crystalline transformation of said amorphous metal powder at the compacting temperature and pressure;</claim-text>
<claim-text>cooling the compacted mixture at a cooling rate sufficient that the amorphous metal powder mixture remains amorphous;</claim-text>
<claim-text>expanding the compacted amorphous metal powder mixture to form a foam material, said expansion being conducted at an expansion temperature below any crystalline transition temperature of the amorphous metal powder, but sufficiently high to allow bubble expansion, at a surrounding pressure sufficient to promote expansion arising from a difference between a pressure in the gas-containing pores and the surrounding pressure, and for a duration not exceeding the time for any crystalline transformation to take place; and</claim-text>
<claim-text>cooling the expanded foam material in order to allow the foam material to remain amorphous.</claim-text>
</claim-text>
</claim>
<claim id="CLM-00002" num="00002">
<claim-text>2. The method according to <claim-ref idref="CLM-00001">claim 1</claim-ref> wherein the gas-splitting propellant powder decomposes during expansion.</claim-text>
</claim>
<claim id="CLM-00003" num="00003">
<claim-text>3. The method according to <claim-ref idref="CLM-00001">claim 1</claim-ref> wherein the gas-splitting propellant powder decomposes during compaction.</claim-text>
</claim>
...
...
...
</claims>
我試過這個: Python 元素樹 - 從元素中提取文本,剝離標簽
還有這個: python xml.etree.ElementTree 刪除文本中間的空標簽
我嘗試了 itertext() 方法,對於第一個聲明標簽,它讓我得到了這個(它得到了我需要的一切):
['1. A method of forming an amorphous metal foam formed of an amorphous metal powder comprising:\n ', 'mixing at least one amorphous metal powder and at least one gas-splitting propellant powder into a propellant filled amorphous metal powder mixture, such that upon decomposition of the gas-splitting propellant powder, gas-containing pores are created within the amorphous metal powder mixture;', '\n ', 'compacting the mixture such that the amorphous metal powder particles are bonded to one another to form a gas-tight seal around the gas-splitting propellant powder particles, the mixture being compacted at a compacting temperature and pressure sufficient to allow for bonding of the mixture, wherein the temperature is below any crystalline transition temperature of the amorphous metal powder, and for a duration not exceeding a time for any crystalline transformation of said amorphous metal powder at the compacting temperature and pressure;', '\n ', 'cooling the compacted mixture at a cooling rate sufficient that the amorphous metal powder mixture remains amorphous;', '\n ', 'expanding the compacted amorphous metal powder mixture to form a foam material, said expansion being conducted at an expansion temperature below any crystalline transition temperature of the amorphous metal powder, but sufficiently high to allow bubble expansion, at a surrounding pressure sufficient to promote expansion arising from a difference between a pressure in the gas-containing pores and the surrounding pressure, and for a duration not exceeding the time for any crystalline transformation to take place; and', '\n ', 'cooling the expanded foam material in order to allow the foam material to remain amorphous.', '\n ', '\n ']
現在進入下一個聲明標簽<claim id="CLM-00002" num="00002">
它應該讓我很理想:
The method according to wherein the gas-splitting propellant powder decomposes during expansion.
但它讓我:
['2. The method according to ', '\n ']
我正在使用的代碼讓我得到這個結果是:
result = []
for doc in root.xpath('//claims/claim/claim-text'):
textwork = ((doc.getparent()).itertext('claim-text'))
b=[]
for texts in textwork:
b.append(texts)
result.append([b])
write_all_to_csv(result, FILENAME_CLAIMS)
注意:代碼是一個縮短的版本。 我還從可以正常工作的聲明中提取其他內容。 只是縮短它以專注於問題。
只需從 itertext 方法中刪除標簽名稱,它就會提取標簽中的所有相關文本。 希望這可以幫助。
from lxml import etree
root=etree.fromstring(xml)
result = []
for doc in root.xpath('//claims/claim/claim-text'):
textwork = (''.join((doc.getparent()).itertext()))
#print(textwork)
#b=[]
#for texts in textwork:
# b.append(texts)
result.append([textwork])
print(result)
#write_all_to_csv(result, FILENAME_CLAIMS)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.