简体   繁体   中英

Extracting text from a html file?

I have a web page which contains a bunch of text and I want to extract just the text from the page and write it to a file. I am trying to use BeautifulSoup but am not sure it easily does what I want. Here is the story: I believe that the text I want to extract lies between:

<td colspan="2" class="msg_text_cell" style="text-align: justify; background-color: rgb(212, 225, 245); background-image: none; background-repeat: repeat-x;" rowspan="2" valign="top" width="100%">

and

<p></p><div style="overflow: hidden; width: 550px; height: 48px;">

What I want to do is the select just the text lines between, but no including the above begin and end text. Note that the begin html above is on a line by itself but the end text sometimes occurs just after the last text I want but is not on a new line.

I can not seem to see how to do what I want with BeautifulSoup, but probably it is my unfamiliarity getting in the way.

Also, the text I want to extract occurs say 50 times in the page, so I want all such text separated by something like '+++++++++++++++++++++' to make it easier to read.

Thanks much for your help.

simply put you can loop over expected dom elements that contain the text you want and extract it that way ... using jquery something like $('td.msg_text_cell').each( function (idx,el) { idx would be the index in the array of jQuery objects found from the selector above getting all tds with a class of msg_text_cell ... })

you can do with native js also so don't think that i'm pushing jquery ... just a framework i'm more familiar with

You can do it easily with BeautifulSoup

from bs4 import BeautifulSoup as bs
soup = "<td colspan=\"2\" class=\"msg_text_cell\" style=\"text-align: justify; background-color: rgb(212, 225, 245); background-image: none; background-repeat: repeat-x;\" rowspan=\"2\" valign=\"top\" width=\"100%\"> <p>The text</p><div style=\"overflow: hidden; width: 550px; height: 48px;\">"
soup = bs(soup)
soup.find('p')

You can now found something like the text inside the

tag

Output: <p>The text</p>

You can now add loop to modify the variable.

Then you can save in a file.

with open("data.csv","w") as tW:
writer = csv.writer(tW,delimiter=",")
writer.writerow(["Ptag"])
for i in soup:
    p = i.get_text()
    writer.writerow([p])

如果您知道Ruby的相关知识,我可以向您介绍Nokogiri,它是用于屏幕抓取的出色宝石。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM