简体   繁体   中英

XPATH - how to get inner html data free from <br> tags?

This question has been asked before,

This is HTML data

<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>


<othertag>
 othertag data
</othertag>
<moretag>
 moretag data
</moretag>
....
repeating n times
....

My goal is to extract the data inside <p></p> without being splitted by the <br> tags along with other data

This is my query

//p//text() | //othertag//text() | //moretag//text()

This gave

('This is is some important data', 'even this data', 'this is useful too',
'othertag data','moretag data')

Notice above that the <p> tag text data has been split in the output?

I'd want it formatted as a proper unit like below ,

('This is is some important data even this data this is useful too',
'othertag data','moretag data')

If impossible, can i get it atleast this way?

('This is is some important <br> data even this data <br> this is useful too',
'othertag data','moretag data') 

I cannot use a join statement because it would be hard to selectively join variable list values in variable indexes (No one can predict how many <br> tags would be there and therefore the data may get split variable times)

My Attempts (with help from other users)

string(//p//text()) | //othertag//text() | //moretag//text()

Above Query Gives XPATH Error

This one as well,

import lxml.html, lxml.etree

    ns = lxml.etree.FunctionNamespace(None)

    def cat(context, a):
        return ''.join(a)
    ns['cat'] = cat

This query as well gave InvalidType Error

cat(//p//text()) | //othertag//text() | //moretag//text()

I'm using python 2.7

If you are open to using other libraries, then you can use BeautifulSoup for this.

Demo -

>>> s = """<p>
... This is some important data
... <br>
... Even this is data
... <br>
... this is useful too
... </p>
...
...
... <othertag>
...  othertag data
... </othertag>
... <moretag>
...  moretag data
... </moretag>"""

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s,'html.parser')

>>> soup.find('p').text
'\nThis is some important data\n\nEven this is data\n\nthis is useful too\n'

>>> print(soup.find('p').text)

This is some important data

Even this is data

this is useful too

You can try using the following custom XPath function :

demo codes :

import lxml.html, lxml.etree

source = '''your html here'''
doc = lxml.html.fromstring(source)
ns = lxml.etree.FunctionNamespace(None)

def cat(context, elements):
    return [''.join(e.xpath('.//text()')) for e in elements]
ns['concat-texts'] = cat

print repr(doc.xpath('concat-texts(//p)| //othertag//text() | //moretag//text()'))

sample HTML input :

source = '''
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>

<p>
foo
<br>
bar
<br>
baz
</p>

<othertag>
 othertag data
</othertag>
<moretag>
 moretag data
</moretag>
'''

output :

['\nThis is some important data\n\nEven this is data\n\nthis is useful too\n', '\nfoo\n\nbar\n\nbaz\n', '\n othertag data\n', '\n moretag data\n']

I know this comes late, but somebody might find it useful still. The way I got it working is by replacing the br tags in the original html. It was a bytes object so it had to be decoded and encoded but it worked like a charm:

from lxml import html
import requests

page = request.get("the website you are getting the html from")
content = page.content.decode('utf-8').replace("<br>", " ").encode('utf-8')
tree = html.fromstring(content)

After this, the //p//text()) returned 'This is is some important data even this data this is useful too' which is what you wanted.

You say: "I'd want it formatted as a proper unit like below,

('This is is some important data even this data this is useful too', 'othertag data','moretag data')"

But actually, XPath does not do formatting. You're suggesting that you want a sequence of three strings returned; the formatting is done later.

You're using Python which means, I assume, that you are using XPath 1.0. In XPath 1.0, there is no such thing as a sequence of three strings. You could return three nodes (the p, othertag, and moretag nodes), and then extracting the string values of these nodes becomes a Python problem rather than an XPath problem. Or you could return the three strings in three separate calls: for example, string(//p) would give you the string value of the first p element.

In your question you say the data is repeated. But you don't say which data is repeated. I dont have a clear picture of what your real source document looks like. That's probably why the answers to your question, including mine, are so incomplete.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM