I have a simple Python script that uses BeautifulSoup to find a section of the HTML tree. For example, to find everything inside the <div id="doctext">
tags, the script does this:
html_section = str(soup.find("div", id="doctext"))
I would like to be able to make the arguments to find()
vary, however, according to strings given in an input file. For example, a user could feed the script a URL followed by a string like "div", id="doctext"
, and the script would adjust the find accordingly. Imagine that the input file looks like this:
http://www.example.com | "div", id="doctext"
The script splits the line to get the URL, which works fine, but I want it to also grab the arguments. For example:
vars = line.split(' | ')
html = urllib2.urlopen(vars[0]).read()
soup = BeautifulSoup(html)
args = vars[1].split()
html_section = str(soup.find(*args))
This doesn't work---and probably doesn't make sense as I've been trying multiple ways to do this. How do I get the string provided by the input file and prepare it into the right syntax for the soup.find()
function?
You could parse line
like this:
line = 'http://www.example.com | div, id=doctext'
url, args = line.split(' | ', 1)
args = args.split(',')
name = args[0]
params = dict([param.strip().split('=') for param in args[1:]])
print(name)
print(params)
yields
div
{'id': 'doctext'}
Then you could call soup.find
like this:
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
html_section = str(soup.find(name, **params))
WARNING: Note that if doctext
(or some other keyword argument) contains a comma, then
args = args.split(',')
will split the parameters in the wrong place. This problem might arise if you are searching for some text
content that contains a comma.
So let's look for a better solution:
To avoid the problem described above, you might consider using the JSON format for the arguments: if line
looks like this:
'http://www.example.com | ["div", {"id": "doctext"}]'
Then you could parse it with
import json
line = 'http://www.example.com | ["div", {"id": "doctext"}]'
url, arguments = line.split('|', 1)
url = url.strip()
arguments = json.loads(arguments)
args = []
params = {}
for item in arguments:
if isinstance(item, dict):
params = item
else:
args.append(item)
print(args)
print(params)
which yields
[u'div']
{u'id': u'doctext'}
Then you could call soup.find
with
html_section = str(soup.find(*args, **params))
An added advantage is that you can supply any number of soup.find's positional arguments (for name
, attrs
, recursive
, and text
), not just the name
.
assume that user will feed the script those args, you will get them with sys.argv, then use them with your code
#foo.py
import sys
for arg in sys.argv:
print arg
hvn@hvnatvcc: ~/test $ python foo.py http://xyz.com div doctext
foo.py
http://xyz.com
div
doctext
your code will look like this
html = urllib2.urlopen(sys.argv[1]).read()
soup = BeautifulSoup(html)
html_section = str(soup.find(sys.argv[2], id=sys.argv[3]))
what wrong with your code is: find()
will not treat id
in string id="doctext"
as a keywork of function. Python see 'id="doctext"'
as a whole string.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.