Create an array of keywords

Question

I am trying to create an array of keywords from a CSV column that contains html. The CSV complains incomplete data in the categories div.

categories = []

def find_elms(soup, tag, attribute):
    """Find the block using it's tag and attribute values"""
    categories_block = soup.find(tag, attribute)
    if categories_block:
        keywords = [elm.text for elm in categories_block.findAll('a')]
        return keywords
        #return [elm.text for elm in categories_block.findAll('a')]
    return []

def build_cats(categories):
    category = find_elms(soup, 'div', {'id': 'categories'})
    '''returns [x,y]'''
    for cat in category:
        categories.append(category)

build_cats(soup)

I have varied my code to achieve a result that looks like:

[category1,...,category1000]

However, my results have been [[category1,..,category25],[category26,...,category50],...[]] or a series of errors that lead down rabbit holes into darkness.

The source data resembles:

"<div id="categories">
<h3>Categories</h3>
<ul>
<li><a href="">CategoryA</a></li><li><a href="">CategoryB</a></li>
</ul></div>
","<div id="col1"><h3>File</h3></div>, <div id="col1">
<a href="">A.jpg</a>
<br/></div>
, <div id="col1">
<a href="">B.jpg</a>
<br/></div>
, <div id="col1">
<a href="">C.jpg</a>
<br/></div>
"
"<div id="categories">
<h3>Categories</h3>
</div>
","<div id="col1"><h3>File</h3></div>, <div id="col1">
<a href="">D.jpg</a>
<br/></div>
, <div id="col1">
<a href="">E.jpg</a>
<br/></div>
, <div id="col1">
<a href="">F.jpg</a>
<br/></div>
"
"<div id="categories">
<h3>Categories</h3>
<ul>
<li><a href="">CategoryC</a></li><li><a href="">CategoryD</a></li>
</ul></div>
","<div id="col1"><h3>File</h3></div>, <div id="col1">
<a href="">G.jpg</a>
<br/></div>
, <div id="col1">
<a href="">H.jpg</a>
<br/></div>
, <div id="col1">
<a href="">I.jpg</a>
<br/></div>
"
"<div id="categories">
<h3>Categories</h3>
<ul>
<li><a href="">CategoryA</a></li><li><a href="">CategoryE</a></li>
</ul></div>
","<div id="col1"><h3>File</h3></div>, <div id="col1">
<a href="">J.jpg</a>
<br/></div>
, <div id="col1">
<a href="">K.jpg</a>
<br/></div>
, <div id="col1">
<a href="">L.jpg</a>
<br/></div>
"

Any modifications or suggestions would be helpful. Thank you.

Answer 1

I pasted your source data into a text file and saved it as input.csv . I then ran the following lines of code and was able to create a list of all categories that were in the sample source data:

from bs4 import BeautifulSoup

Categories = []

path = 'input.csv'
html = open(path)
bs = BeautifulSoup(html, 'html.parser')
divs = bs.find_all('div', attrs = {'id': 'categories'})

for d in divs:
    cats = d.find_all('a')
    for c in cats:
        cat_label = c.text
        if cat_label not in Categories:
            Categories.append(cat_label)

Categories

The above code generates the following list of all categories that were in the source data:

['CategoryA', 'CategoryB', 'CategoryC', 'CategoryD', 'CategoryE']

Each category appears once in the list, regardless of whether it appeared multiple times in the source data (eg. CategoryA).

Create an array of keywords

Question

1 answers

solution1
0 ACCPTED 2019-02-06 02:39:13

Create an array of keywords

Question

1 answers

solution1 0 ACCPTED 2019-02-06 02:39:13

solution1
0 ACCPTED 2019-02-06 02:39:13