I am trying to create an adjacency matrix of t_lemma (other elements like nodetype, ord, etc. can be ignored, I am including them just for completness in case they are somehow needed) - meaning which t_lemma is a parent of which - from this XML document representing a syntactic analysis of a (czech) sentence, where t_lemma represents the neutral shape of a specific word.
Currently, I am using cElementTree library for Python but I am open to use some other if what I am asking is impossible or computation-time wisely hard to achieve using cElementTree
<t_tree id="t_tree-cs-s1-root">
<atree.rf>a_tree-cs-s1-root</atree.rf>
<ord>0</ord>
<children id="t_tree-cs-s1-n107">
<children>
<LM id="t_tree-cs-s1-n108">
<nodetype>complex</nodetype>
<ord>1</ord>
<t_lemma>muž</t_lemma>
<functor>ACT</functor>
<formeme>n:1</formeme>
<is_clause_head>0</is_clause_head>
<clause_number>1</clause_number>
<a>
<lex.rf>a_tree-cs-s1-n1</lex.rf>
</a>
<gram>
<sempos>n.denot</sempos>
<gender>anim</gender>
<number>sg</number>
<negation>neg0</negation>
</gram>
</LM>
<LM id="t_tree-cs-s1-n109">
<nodetype>complex</nodetype>
<ord>3</ord>
<t_lemma>strom</t_lemma>
<functor>PAT</functor>
<formeme>n:4</formeme>
<is_clause_head>0</is_clause_head>
<clause_number>1</clause_number>
<a>
<lex.rf>a_tree-cs-s1-n3</lex.rf>
</a>
<gram>
<sempos>n.denot</sempos>
<gender>inan</gender>
<number>sg</number>
<negation>neg0</negation>
</gram>
</LM>
</children>
<nodetype>complex</nodetype>
<ord>2</ord>
<t_lemma>zasadit</t_lemma>
<functor>PRED</functor>
<formeme>v:fin</formeme>
<sentmod>enunc</sentmod>
<is_clause_head>1</is_clause_head>
<clause_number>1</clause_number>
<a>
<lex.rf>a_tree-cs-s1-n2</lex.rf>
</a>
<gram>
<sempos>v</sempos>
<verbmod>ind</verbmod>
<deontmod>decl</deontmod>
<tense>ant</tense>
<aspect>cpl</aspect>
<resultative>res0</resultative>
<dispmod>disp0</dispmod>
<iterativeness>it0</iterativeness>
<negation>neg0</negation>
<diathesis>act</diathesis>
</gram>
</children>
</t_tree>
What this XML represents is a tree of looking like this:
And what I am trying to get to is a matrix looking like this.
muž strom zasadit
muž 1 0 -1
storm 0 1 -1
zasadit 1 1 1
I have figured out an answer that works on very big trees I have tested it on, though I had to take account of the element <ord>
- denoting order of a word in a sentence - to eliminate the issue that would arise in case of sentences like: "Man and woman, walking day and night."
walking
/ \
and and
/ \ / \
man woman day night
Only taking <t_lemma>
into account would lead to unclear interpretation of (child->parent)
function, ie: we would have two and s to which words: man, woman, day, night all lead like this:
element parent
_______________
man and
woman and
day and
night and
and walking
and walking
That turned the previous table into following:
element parent
_______________
man:1 and:2
woman:3 and:2
day:5 and:6
night:7 and:6
and:2 walking:4
and:6 walking:4
So, here is the functional Python code:
parentDictionary = {}
def getchildlemma(element, parent):
for i in element.findall("*"):
if i.tag == "t_lemma":
e = i.text
for i in element.findall("*"):
if i.tag == "ord":
e = e +":"+ i.text
parentDictionary[e] = parent
parent = e
else:
e = parent
for i in element.findall("*"):
if i.tag == "children" or i.tag == "LM":
getchildlemma(i,parent)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.