Extracting text from a html file?

Question

I have a web page which contains a bunch of text and I want to extract just the text from the page and write it to a file. I am trying to use BeautifulSoup but am not sure it easily does what I want. Here is the story: I believe that the text I want to extract lies between:

<td colspan="2" class="msg_text_cell" style="text-align: justify; background-color: rgb(212, 225, 245); background-image: none; background-repeat: repeat-x;" rowspan="2" valign="top" width="100%">

and

<p></p><div style="overflow: hidden; width: 550px; height: 48px;">

What I want to do is the select just the text lines between, but no including the above begin and end text. Note that the begin html above is on a line by itself but the end text sometimes occurs just after the last text I want but is not on a new line.

I can not seem to see how to do what I want with BeautifulSoup, but probably it is my unfamiliarity getting in the way.

Also, the text I want to extract occurs say 50 times in the page, so I want all such text separated by something like '+++++++++++++++++++++' to make it easier to read.

Thanks much for your help.

Answer 1

simply put you can loop over expected dom elements that contain the text you want and extract it that way ... using jquery something like $('td.msg_text_cell').each( function (idx,el) { idx would be the index in the array of jQuery objects found from the selector above getting all tds with a class of msg_text_cell ... })

you can do with native js also so don't think that i'm pushing jquery ... just a framework i'm more familiar with

Answer 2

You can do it easily with BeautifulSoup

from bs4 import BeautifulSoup as bs
soup = "<td colspan=\"2\" class=\"msg_text_cell\" style=\"text-align: justify; background-color: rgb(212, 225, 245); background-image: none; background-repeat: repeat-x;\" rowspan=\"2\" valign=\"top\" width=\"100%\"> <p>The text</p><div style=\"overflow: hidden; width: 550px; height: 48px;\">"
soup = bs(soup)
soup.find('p')

You can now found something like the text inside the

tag

Output: <p>The text</p>

You can now add loop to modify the variable.

Then you can save in a file.

with open("data.csv","w") as tW:
writer = csv.writer(tW,delimiter=",")
writer.writerow(["Ptag"])
for i in soup:
    p = i.get_text()
    writer.writerow([p])

Answer 3

如果您知道Ruby的相关知识，我可以向您介绍Nokogiri，它是用于屏幕抓取的出色宝石。

Extracting text from a html file?

Question

3 answers

solution1
0 2013-07-31 01:39:03

solution2
0 2016-11-07 07:30:39

solution3
0 2011-10-01 17:53:40

Extracting text from a html file?

Question

3 answers

solution1 0 2013-07-31 01:39:03

solution2 0 2016-11-07 07:30:39

solution3 0 2011-10-01 17:53:40

solution1
0 2013-07-31 01:39:03

solution2
0 2016-11-07 07:30:39

solution3
0 2011-10-01 17:53:40