How to substract soup.find_all() in python 3

Question

I want to change the output of my soup.find.all . In the original source we have this:

<a href="/book/nfo/?id=4756888" class="ajax nfo"></a>

my soup.find_all :

href = [b.get('href') for b in soup.find_all('a', href=re.compile(r'.*\?id\=\d{4,8}'))]

gives me this:

/book/nfo/?id=4756888

but I want this:

http://127.0.0.1/book/download/?id=4756888

Answer 1

You can use the properties of a Python string to add and replace parts to/from it:

a='/book/nfo/?id=4756888'
b = 'http://127.0.0.1' + a.replace('nfo', 'download')
print(b)

which gives:

'http://127.0.0.1/book/download/?id=4756888'

There's no need to use regex here.

Answer 2

You can prepend http://127.0.0.1 in front and replace 'nfo' by 'download' using python's re.sub() function.

re.sub(r'pattern_to_match',r'replacement_string', string)

You can implement it as follows:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup("""<a href="/book/nfo/?id=4756888" class="ajax nfo"></a>""")
c = ['http://127.0.0.1'+b.get('href') for b in soup.find_all('a', href=re.compile(r'.*\?id\=\d{4,8}'))]
print([re.sub(r'nfo',r'download',q) for q in c ])

Output:

['http://127.0.0.1/book/download/?id=4756888']

Answer 3

You could compile a regular expression and apply it in a list comprehension as follows:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup('<a href="/book/nfo/?id=4756888" class="ajax nfo"></a>', 'html.parser')
re_s = re.compile(r'(.*?\/)nfo(\/.*?)').sub
hrefs = [re_s('http://127.0.0.1' + r'\1download\2', a.get('href')) for a in soup.find_all('a', href=re.compile(r'.*\?id\=\d{4,8}'))]
print(hrefs)

Giving you:

['http://127.0.0.1/book/download/?id=4756888']

How to substract soup.find_all() in python 3

Question

3 answers

solution1
1 2016-11-28 04:25:41

solution2
0 2016-11-28 03:51:22

solution3
0 2016-11-28 09:03:39

How to substract soup.find_all() in python 3

Question

3 answers

solution1 1 2016-11-28 04:25:41

solution2 0 2016-11-28 03:51:22

solution3 0 2016-11-28 09:03:39

solution1
1 2016-11-28 04:25:41

solution2
0 2016-11-28 03:51:22

solution3
0 2016-11-28 09:03:39