简体   繁体   中英

Regex to find string in list in Python 3

How do I get base.php?id=5314 from list?

import urllib.parse
import urllib.request
from bs4 import BeautifulSoup
url = 'http://www.fansubs.ru/search.php'
values = {'Content-Type:' : 'application/x-www-form-urlencoded',
      'query' : 'Boku dake ga Inai Machi' }
d = {}
data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as response:
   the_page = response.read()
soup = BeautifulSoup(the_page, 'html.parser')
for link in soup.findAll('a'):
    d[link] = (link.get('href'))
x = (list(d.values()))

You can use the build-in function filter in combination with a regex . Example:

import re

# ... your code here ...

x = (list(d.values()))
test = re.compile("base\.php\?id=", re.IGNORECASE)
results = filter(test.search, x)

Update based on comment: You can convert the filter results into a list:

print(list(results))

Example results with the following hard-coded list:

x = ["asd/asd/asd.py", "asd/asd/base.php?id=5314",
     "something/else/here/base.php?id=666"]

You get:

['asd/asd/base.php?id=5314', 'something/else/here/base.php?id=666']

This answer is based on this page which talks about filtering lists. It has few more implementations to do the same thing, that might suit you better. Hope it helps

You can pass a regex directly to find_all which will do the filtering for you based on the href with href=re.compile(... :

import re

with urllib.request.urlopen(req) as response:
    the_page = response.read()
    soup = BeautifulSoup(the_page, 'html.parser')
    d = {link:link["href"] for link in soup.find_all('a', href=re.compile(re.escape('base.php?id='))}

find_all will only return the a tags that have a href attribute that matches the regex.

which gives you:

In [21]:d = {link:link["href"] for link in soup.findAll('a', href=re.compile(re.escape('base.php?id='))}

In [22]: d
Out[22]: {<a href="base.php?id=5314">Boku dake ga Inai Machi <small>(ТВ)</small></a>: 'base.php?id=5314'}

Considering you only seem to be looking for one link then it would make more sense just to use find:

In [36]: link = soup.find('a', href=re.compile(re.escape('base.php?id='))

In [37]: link
Out[37]: <a href="base.php?id=5314">Boku dake ga Inai Machi <small>(ТВ)</small></a>

In [38]: link["href"]
Out[38]: 'base.php?id=5314'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM