Extracting url within href on html site

Question

I have the following already extracted from web page:

 <a class="Directory-listLink" data-ya-track="todirectory" href="united-states/in">Indiana</a>,
 <a class="Directory-listLink" data-ya-track="todirectory" href="united-states/ia">Iowa</a>,
 <a class="Directory-listLink" data-ya-track="todirectory" href="united-states/ks">Kansas</a>,
 <a class="Directory-listLink" data-ya-track="todirectory" href="united-states/ky">Kentucky</a>,

I only want the href="united-states/il" part of each extracted. Currently I am trying something like this:

for state in soup_state.find('a',href=True):
    print(state['href'])

I continually receive the error:

AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

I want this to be ran in a for loop so I could get each state's url extracted, but am currently unable

Answer 1

I'm not sure how you got to soup_state , but try:

for state in soup_state:
     print(state['href'])

and see if it solves the problem.

Answer 2

You can use a regular expression to find these contents.

import re

lines = ['<a class="Directory-listLink" data-ya-track="todirectory" href="united-states/in">Indiana</a>',
         '<a class="Directory-listLink" data-ya-track="todirectory" href="united-states/ia">Iowa</a>',
         '<a class="Directory-listLink" data-ya-track="todirectory" href="united-states/ks">Kansas</a>',
         '<a class="Directory-listLink" data-ya-track="todirectory" href="united-states/ky">Kentucky</a>']

for l in lines:
    print(re.search('href="[^"]*"',l).group())

This will give the output:

href="united-states/in"
href="united-states/ia"
href="united-states/ks"
href="united-states/ky"

Extracting url within href on html site

Question

2 answers

solution1
2 ACCPTED 2020-08-13 20:37:18

solution2
1 2020-08-13 20:41:46

Extracting url within href on html site

Question

2 answers

solution1 2 ACCPTED 2020-08-13 20:37:18

solution2 1 2020-08-13 20:41:46

solution1
2 ACCPTED 2020-08-13 20:37:18

solution2
1 2020-08-13 20:41:46