I'm trying to do some kind of web scraping with python and I'm having some trouble. I have a big mass of scraped text and I'm trying to generate a list that contains every instance between two specific strings.
A bunch of lines contain something in the format of "href= /profile/pc/WORD/matches" and I want to create a list of all the WORDs (Every word between an instance of " /profile/pc/" and "/matches").
I tried starting with something like this but I'm not even getting any output. Any help on where to go from here?
import re
url="http:examplewebsite.com"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
q = webpage.replace('"','_') #Replace quotation marks with underscores
print (re.split(r'href=_/profile/pc/', q))
PS Previously I did something like this but I was only getting the first result.
substring1 = '<a href=_/profile/pc/' #Starting string before name
substring2 = '/matches_>' #Ending string after name
my_string = q[(q.index(substring1)+len(substring1)):q.index(substring2)]
You have many lines in webpage
and you need to move through them all to try to match on them.
import re
url="http:examplewebsite.com"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
q = webpage.replace('"','_') #Replace quotation marks with underscores
for row in q:
print (re.split(r'href=_/profile/pc/', row))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.