简体   繁体   中英

Is it possible to use 2 different BeautifulSoup soup.select in one for loop?

Is it possible to reduce the code so that I have one for loop instead of two? The reason why I want to do this is because it's a time critical crawling loop.

i = 0
data = []
data.append([])
data.append([])

for product in soup.select('div > span.name'):
    data[0].append(product.text)
    i += 1

i = 0

for product in soup.select('div > span.value'):
    data[1].append(product.text)
    i += 1

This is the HTML Part I want to get the data out:

<html><body><div id="pagecontent"><div id="container"><div id="content"><div id="tab-description"><div id="attributes">
<div class="attr">
    <span class="name">Ugug</span>
    <span class="value">dfgd454</span>
</div>

You could easily collect the data using list comprehensions:

In [2]: html = """<div><span class='name'>Andrew</span><span class='value'>42</span></div>
   ...: <div><span class='name'>Bob</span><span class='value'>128</span></div>"""

In [3]: soup = BeautifulSoup(html)

In [4]: patterns = ['div > span.name', 'div > span.value']

In [5]: data = [[product.text for product in soup.select(pattern)] for pattern in patterns] 

In [6]: data
Out[6]: [['Andrew', 'Bob'], ['42', '128']]

However, this code still invokes separate for loop for each select pattern. If you want to use one loop, you should provide an example of document structure.


For given document structure I could suggest another solution:

In [7]: html = '''<html><body><div id="pagecontent"><div id="container"><div id="content"><div id="tab-description"><div id="attributes">
   ...: <div class="attr">
   ...:     <span class="name">Ugug</span>
   ...:     <span class="value">dfgd454</span>
   ...: </div>'''

In [8]: soup = BeautifulSoup(html)

In [9]: attrs = soup.select('div.attr')

In [10]: attrs
Out[10]: 
[<div class="attr">
 <span class="name">Ugug</span>
 <span class="value">dfgd454</span>
 </div>]

In [11]: def parse_attr(attr):
   ....:     return {
   ....:         'name': attr.find(class_='name').text,
   ....:         'value': attr.find(class_='value').text
   ....:     }
   ....: 

In [12]: list(map(parse_attr, attrs))
Out[12]: [{'name': 'Ugug', 'value': 'dfgd454'}]

You may also extend the number of attributes. In this case, you can rewrite the function parse_attr in the following way:

In [25]: def parse_attr(attr):
    return {span['class'][0]: span.text for span in attr('span')}
   ....: 

In [26]: list(map(parse_attr, attrs))
Out[26]: [{'name': 'Ugug', 'value': 'dfgd454'}]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM