简体   繁体   中英

How to extract values of interests between two characters?

I'm working on web scraping a web page with the following HTML code:

Predecessors &middot; <i class="fa fa-sign-in"></i> / Successors &middot; <i class="fa fa-sign-out"></i>
</dt>

<dd>
    1931 &middot;
    <a class="active" href="../../../aus/party/1253">
                ALP </a> &middot;
    <i class="fa fa-sign-in"> </i> splinter

</dd>

<dd>
    1931 &middot;
    <a class="active" href="../../../aus/party/1905">
                NAT </a> &middot;
    <i class="fa fa-sign-in"> </i> successor

</dd>

The code I used to get the above output is as follows:

import urllib.request

url_pc = str('http://www.parlgov.org/explore/aus/party/1912/")
fp = urllib.request.urlopen(url_pc)
mybytes = fp.read()

mystr = mybytes.decode("utf8")
fp.close()
#print(mystr)

str1 = mystr[mystr.find('Predecessors'):]

str2 = str1.split("</div>", 1)[0]

str3 = str2.split("<dt> Party (name) changes</dt>", 1)[0]

print(str3)

I want to extract everything that is between <dd> and </dd> in each group make it into a string and then add it to a row of data. Is there a loop I can run or code I can use that will extract all the strings between <dd> and </dd> in each of the two groups?

You can use BeautifuSoup to find all <dd> and then get content of every <dd> as list. And then you can join elements of list to one string. Some elements can be object which need to be converted to string. You can also use strip() to remove some spaces but it may still need so cleaning.

text = '''Predecessors &middot; <i class="fa fa-sign-in"></i>
            / Successors &middot; <i class="fa fa-sign-out"></i>
          </dt>

            <dd>
              1931 &middot;
              <a class="active"
                 href="../../../aus/party/1253">
                ALP </a>

              &middot;
               <i class="fa fa-sign-in"> </i> 

               splinter 



            </dd>

            <dd>
              1931 &middot;
              <a class="active"
                 href="../../../aus/party/1905">
                NAT </a>

              &middot;
               <i class="fa fa-sign-in"> </i> 

               successor 



            </dd>'''

from bs4 import BeautifulSoup as BS

soup = BS(text, 'html.parser')

for item in soup.find_all('dd'):
    print(''.join(str(x).strip() for x in item.contents))

Result

1931 ·<a class="active" href="../../../aus/party/1253">
                ALP </a>·<i class="fa fa-sign-in"> </i>splinter
1931 ·<a class="active" href="../../../aus/party/1905">
                NAT </a>·<i class="fa fa-sign-in"> </i>successor

EDIT:

from bs4 import BeautifulSoup as BS

soup = BS(text, 'html.parser')

all_rows = []

for item in soup.find_all('dd'):
    #print(''.join(str(x).strip() for x in item.contents))
    row = (item.contents[0].strip()[:-2], item.find('a').get_text().strip(), item.contents[4].strip(), item.find('a').get('href')[-4:])
    row = ', '.join(row)
    print(row)
    all_rows.append(row)

text = ' | '.join(all_rows)
print(text)

Result:

1931, ALP, splinter, 1253
1931, NAT, successor, 1905
1931, ALP, splinter, 1253 | 1931, NAT, successor, 1905

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM