简体   繁体   中英

In python BeautifulSoup4, convert string into arguments for find()

I have a simple Python script that uses BeautifulSoup to find a section of the HTML tree. For example, to find everything inside the <div id="doctext"> tags, the script does this:

html_section = str(soup.find("div", id="doctext"))

I would like to be able to make the arguments to find() vary, however, according to strings given in an input file. For example, a user could feed the script a URL followed by a string like "div", id="doctext" , and the script would adjust the find accordingly. Imagine that the input file looks like this:

http://www.example.com | "div", id="doctext"

The script splits the line to get the URL, which works fine, but I want it to also grab the arguments. For example:

vars = line.split(' | ')
html = urllib2.urlopen(vars[0]).read()
soup = BeautifulSoup(html)
args = vars[1].split()
html_section = str(soup.find(*args))

This doesn't work---and probably doesn't make sense as I've been trying multiple ways to do this. How do I get the string provided by the input file and prepare it into the right syntax for the soup.find() function?

You could parse line like this:

line = 'http://www.example.com | div, id=doctext'
url, args = line.split(' | ', 1)
args = args.split(',')
name = args[0]
params = dict([param.strip().split('=') for param in args[1:]])
print(name)
print(params)

yields

div
{'id': 'doctext'}

Then you could call soup.find like this:

html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
html_section = str(soup.find(name, **params))

WARNING: Note that if doctext (or some other keyword argument) contains a comma, then

args = args.split(',')

will split the parameters in the wrong place. This problem might arise if you are searching for some text content that contains a comma.


So let's look for a better solution:

To avoid the problem described above, you might consider using the JSON format for the arguments: if line looks like this:

'http://www.example.com | ["div", {"id": "doctext"}]'

Then you could parse it with

import json
line = 'http://www.example.com | ["div", {"id": "doctext"}]'
url, arguments = line.split('|', 1)
url = url.strip()
arguments = json.loads(arguments)
args = []
params = {}
for item in arguments:
    if isinstance(item, dict):
        params = item
    else:
        args.append(item)

print(args)
print(params)

which yields

[u'div']
{u'id': u'doctext'}

Then you could call soup.find with

html_section = str(soup.find(*args, **params))

An added advantage is that you can supply any number of soup.find's positional arguments (for name , attrs , recursive , and text ), not just the name .

assume that user will feed the script those args, you will get them with sys.argv, then use them with your code

#foo.py
import sys

for arg in sys.argv:
    print arg

hvn@hvnatvcc: ~/test $ python foo.py http://xyz.com div doctext
foo.py
http://xyz.com
div
doctext

your code will look like this

html = urllib2.urlopen(sys.argv[1]).read()
soup = BeautifulSoup(html)
html_section = str(soup.find(sys.argv[2], id=sys.argv[3]))

what wrong with your code is: find() will not treat id in string id="doctext" as a keywork of function. Python see 'id="doctext"' as a whole string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM