简体   繁体   中英

Going deeper with xpath node()

I'm trying to find the surrounding text of all hyperlinks within paragraphs on Wikipedia pages, and the way I'm doing that involves using the xpath tree.xpath("//p/node()") . Things work fine on most links, and I'm able to find most things that are <Element a at $mem_location$> . However, if a hyperlink is italicized (see example below), the xpath node() only sees it as an <Element i at $mem_location> , and doesn't look any deeper.

This is causing my code to miss hyperlinks, and messes up the indexing for the rest of the page.

Ex:

<p>The closely related term, <a href="/wiki/title="Mange">mange</a>,
is commonly used with <a href="/wiki/Domestic_animal" title="Domestic animal" class="mw-redirect">domestic animals</a> 
(pets) and also livestock and wild mammals, whenever hair-loss is involved. 

<i><a href="/wiki/Sarcoptes" title="Sarcoptes">Sarcoptes</a></i> 
and <i><a href="/wiki/Demodex" title="Demodex">Demodex</a></i> 
species are involved in mange, both of these genera are also involved in human skin diseases (by 
convention only, not called mange). <i>Sarcoptes</i> in humans is especially 
severe symptomatically, and causes the condition known as 
<a href="/wiki/Scabies" title="Scabies">scabies</a>.</p>

The node() grabs "Mange", "Domestic animal", and "Scabies" properly, but pretty much skips "Sarcoptes" and "Demodex" and screws up the indexing, since I'm filtering out nodes that are <Element a at $mem_location$> and not <Element i at $mem_location$> .

Is there a way to look deeper with node() ? I couldn't find anything in the documentation about it.

Edit: My xpath is "//p/node()" right now, but it's only grabbing the outermost element layer. Most of the time it's <a> , which is great, but if it's wrapped in an <i> layer, it only grabs that. I'm asking if there's a way I can check deeper, so that I might be able to find the <a> within the <i> wrapper.

The relevant code is below: tree = etree.HTML(read)

titles = list(tree.xpath('//p//a[contains(@href,\"/wiki/\")]/@title')) #extracts the titles of all hyperlinks in section paragraphs
hyperlinks = list(tree.xpath('//p//a[contains(@href,\"/wiki/\")]/text()'))
b = list(tree.xpath("//p/b/text()")) #extracts all bolded words in section paragraphs
t = list(tree.xpath("//p/node()"))

b_count = 0
a_count = 0
test = []
for items in t:
print items
items = str(items)
if "<Element b" in str(items):
  test.append(b[b_count])
  b_count += 1
  continue
if "<Element a" in str(items):
  test.append((hyperlinks[a_count],titles[a_count]))
  a_count +=1
  continue

if "<Element " not in items:
  pattern = re.compile('(\t(.*?)\n)')
  look = pattern.search(str(items))

  if look != None: #if there is a match
    test.append(look.group().partition("\t")[2].partition("\n")[0])

  period_pattern = re.compile("(\t(.*?)\.)")
  look_period = period_pattern.search(str(items))
  if look_period != None:
    test.append(look_period.group().partition("\t")[2])

I cannot think of a direct xpath that can do the trick, but you can always loop through the contents and filter out the elements like this -

for i,x in enumerate(t):
    if x.tag == i:
        aNodes = x.find('a')
        if aNodes is not None and len(aNodes) > 0:
            del t[i]
            for j, y in enumerate(x.findall('/nodes()')): #doing x.findall to take in text elements as well as a elements.
                t.insert(i+j,y)

This would handle multiple a inside a single i as well, like <i><a>something</a><a>blah</a></i>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM