简体   繁体   中英

Create an array of keywords

I am trying to create an array of keywords from a CSV column that contains html. The CSV complains incomplete data in the categories div.

categories = []

def find_elms(soup, tag, attribute):
    """Find the block using it's tag and attribute values"""
    categories_block = soup.find(tag, attribute)
    if categories_block:
        keywords = [elm.text for elm in categories_block.findAll('a')]
        return keywords
        #return [elm.text for elm in categories_block.findAll('a')]
    return []

def build_cats(categories):
    category = find_elms(soup, 'div', {'id': 'categories'})
    '''returns [x,y]'''
    for cat in category:
        categories.append(category)

build_cats(soup)

I have varied my code to achieve a result that looks like:

[category1,...,category1000]

However, my results have been [[category1,..,category25],[category26,...,category50],...[]] or a series of errors that lead down rabbit holes into darkness.

The source data resembles:

"<div id="categories">
<h3>Categories</h3>
<ul>
<li><a href="">CategoryA</a></li><li><a href="">CategoryB</a></li>
</ul></div>
","<div id="col1"><h3>File</h3></div>, <div id="col1">
<a href="">A.jpg</a>
<br/></div>
, <div id="col1">
<a href="">B.jpg</a>
<br/></div>
, <div id="col1">
<a href="">C.jpg</a>
<br/></div>
"
"<div id="categories">
<h3>Categories</h3>
</div>
","<div id="col1"><h3>File</h3></div>, <div id="col1">
<a href="">D.jpg</a>
<br/></div>
, <div id="col1">
<a href="">E.jpg</a>
<br/></div>
, <div id="col1">
<a href="">F.jpg</a>
<br/></div>
"
"<div id="categories">
<h3>Categories</h3>
<ul>
<li><a href="">CategoryC</a></li><li><a href="">CategoryD</a></li>
</ul></div>
","<div id="col1"><h3>File</h3></div>, <div id="col1">
<a href="">G.jpg</a>
<br/></div>
, <div id="col1">
<a href="">H.jpg</a>
<br/></div>
, <div id="col1">
<a href="">I.jpg</a>
<br/></div>
"
"<div id="categories">
<h3>Categories</h3>
<ul>
<li><a href="">CategoryA</a></li><li><a href="">CategoryE</a></li>
</ul></div>
","<div id="col1"><h3>File</h3></div>, <div id="col1">
<a href="">J.jpg</a>
<br/></div>
, <div id="col1">
<a href="">K.jpg</a>
<br/></div>
, <div id="col1">
<a href="">L.jpg</a>
<br/></div>
"

Any modifications or suggestions would be helpful. Thank you.

I pasted your source data into a text file and saved it as input.csv . I then ran the following lines of code and was able to create a list of all categories that were in the sample source data:

from bs4 import BeautifulSoup

Categories = []

path = 'input.csv'
html = open(path)
bs = BeautifulSoup(html, 'html.parser')
divs = bs.find_all('div', attrs = {'id': 'categories'})

for d in divs:
    cats = d.find_all('a')
    for c in cats:
        cat_label = c.text
        if cat_label not in Categories:
            Categories.append(cat_label)

Categories

The above code generates the following list of all categories that were in the source data:

['CategoryA', 'CategoryB', 'CategoryC', 'CategoryD', 'CategoryE']

Each category appears once in the list, regardless of whether it appeared multiple times in the source data (eg. CategoryA).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM