I'm trying to find the surrounding text of all hyperlinks within paragraphs on Wikipedia pages, and the way I'm doing that involves using the xpath tree.xpath("//p/node()")
. Things work fine on most links, and I'm able to find most things that are <Element a at $mem_location$>
. However, if a hyperlink is italicized (see example below), the xpath node()
only sees it as an <Element i at $mem_location>
, and doesn't look any deeper.
This is causing my code to miss hyperlinks, and messes up the indexing for the rest of the page.
Ex:
<p>The closely related term, <a href="/wiki/title="Mange">mange</a>,
is commonly used with <a href="/wiki/Domestic_animal" title="Domestic animal" class="mw-redirect">domestic animals</a>
(pets) and also livestock and wild mammals, whenever hair-loss is involved.
<i><a href="/wiki/Sarcoptes" title="Sarcoptes">Sarcoptes</a></i>
and <i><a href="/wiki/Demodex" title="Demodex">Demodex</a></i>
species are involved in mange, both of these genera are also involved in human skin diseases (by
convention only, not called mange). <i>Sarcoptes</i> in humans is especially
severe symptomatically, and causes the condition known as
<a href="/wiki/Scabies" title="Scabies">scabies</a>.</p>
The node()
grabs "Mange", "Domestic animal", and "Scabies" properly, but pretty much skips "Sarcoptes" and "Demodex" and screws up the indexing, since I'm filtering out nodes that are <Element a at $mem_location$>
and not <Element i at $mem_location$>
.
Is there a way to look deeper with node()
? I couldn't find anything in the documentation about it.
Edit: My xpath is "//p/node()"
right now, but it's only grabbing the outermost element layer. Most of the time it's <a>
, which is great, but if it's wrapped in an <i>
layer, it only grabs that. I'm asking if there's a way I can check deeper, so that I might be able to find the <a>
within the <i>
wrapper.
The relevant code is below: tree = etree.HTML(read)
titles = list(tree.xpath('//p//a[contains(@href,\"/wiki/\")]/@title')) #extracts the titles of all hyperlinks in section paragraphs
hyperlinks = list(tree.xpath('//p//a[contains(@href,\"/wiki/\")]/text()'))
b = list(tree.xpath("//p/b/text()")) #extracts all bolded words in section paragraphs
t = list(tree.xpath("//p/node()"))
b_count = 0
a_count = 0
test = []
for items in t:
print items
items = str(items)
if "<Element b" in str(items):
test.append(b[b_count])
b_count += 1
continue
if "<Element a" in str(items):
test.append((hyperlinks[a_count],titles[a_count]))
a_count +=1
continue
if "<Element " not in items:
pattern = re.compile('(\t(.*?)\n)')
look = pattern.search(str(items))
if look != None: #if there is a match
test.append(look.group().partition("\t")[2].partition("\n")[0])
period_pattern = re.compile("(\t(.*?)\.)")
look_period = period_pattern.search(str(items))
if look_period != None:
test.append(look_period.group().partition("\t")[2])
I cannot think of a direct xpath that can do the trick, but you can always loop through the contents and filter out the elements like this -
for i,x in enumerate(t):
if x.tag == i:
aNodes = x.find('a')
if aNodes is not None and len(aNodes) > 0:
del t[i]
for j, y in enumerate(x.findall('/nodes()')): #doing x.findall to take in text elements as well as a elements.
t.insert(i+j,y)
This would handle multiple a
inside a single i
as well, like <i><a>something</a><a>blah</a></i>
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.