I need to make a script what showing me all the characters in between the keywords.
Let's say, I download html page and then read it (it has 33985 characters in there). I need to print everything between "<td class="ml_subject"><a href="?tab=inbox"
and "</a></td>"
which is dozen letters away.
I can find the start point by using:
if "<td class="ml_subject"><a href="?tab=inbox" in html:
print "Success"
but what's then ?
Use the find()
method: -> http://docs.python.org/library/stdtypes.html#str.find
This would look something like this:
# html is your input string
start = html.find( '<td class="ml_subject"><a href="?tab=inbox>' )
end = html.find( '</a></td>', start )
result = html[start:end]
string = 'how to print everything after keyword ? for instance print everything between word “Apple” and word “Pen”'
s, e = string.index('Apple') + 5, string.index('Pen')
# plus 5 because we do not want to capture apple
print string[s:e]
Use lxml
or some other HTML processing module:
from lxml.html import fragment_fromstring
from lxml.cssselect import CSSSelector
HTML = '<td class="ml_subject"><a href="?tab=inbox">Foobar</a></td>'
tree = fragment_fromstring(HTML)
selector = CSSSelector('td.ml_subject > a[href="?tab=inbox"]')
result = selector(tree)[0].text
Use find
to find the keywords in your string and use slice notation to extract the text. find
returns -1 if the string is not found, make sure you check that in your actual implementation.
>>> a = "stuff Apple more stuff Pen blah blah"
>>> delim1 = 'Apple'
>>> delim2 = 'Pen'
>>> i1 = a.find(delim1)
>>> i1
6
>>> i2 = a.find(delim2)
>>> i2
23
>>> a[i1+len(delim1):i2]
' more stuff '
To print all link text you could use BeautifulSoup
:
try:
from urllib2 import urlopen
except ImportError: # Python 3.x
from urllib.request import urlopen
from bs4 import BeautifulSoup # pip install beautifulsoup4
soup = BeautifulSoup(urlopen(url))
print('\n'.join(soup('a', href="?tab=inbox", text=True)))
If the link must have td.ml_subject
parent then you could use a function as search criteria:
def link_inside_td(tag):
td = tag.parent
return (tag.name == 'a' and tag.get('href') == "?tab=inbox" and
td.name == 'td' and td.get('class') == "ml_subject")
print('\n'.join(soup(link_inside_td, text=True)))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.