I have parsed an html document containing javascript with beautifulsoup, and have managed to isolate the javascript within it and convert it into a string. The javascript looks like this:
<script>
[irrelevant javascript code here]
sources:[{file:"http://url.com/folder1/v.html",label:"label1"},
{file:"http://url.com/folder2/v.html",label:"label2"},
{file:"http://url.com/folder3/v.html",label:"label3"}],
[irrelevant javascript code here]
</script>
I am trying to get an array with only urls contained in this sources array, which would look like so:
urls = ['http://url.com/folder1/v.html',
'http://url.com/folder2/v.html',
'http://url.com/folder3/v.html']
The domains are unknown IPs, the folders are of random name-length consisting of lowercase letters and numbers, and there are 1-5 of them in each file(usually 3). All that is constant is that they start with http
and end with .html
.
I decided to use regular expressions to deal with this problem(which I am quite new at) and my code looks like this: urls=re.findall(r'http://[^t][^s"]+', document)
The [^t]
is there because there are other urls in the document whose domain names start with t. My problem is, there is another url with a jpg from the same domain as the urls I am extracting, which gets put into the urls array along with the others.
Example:
urls = ['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html'
'http://123.45.67.89/alwefaoewifiasdof224a/v.html',
'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html',
'http://123.45.67.89/i/0123/12345/aoief243oinsdf.jpg']
How would I go about only fetching the html urls?
You can use r'"(http.*?)"'
to get the urls within your text :
>>> s="""<script>
... [irrelevant javascript code here]
... sources:[{file:"http://url.com/folder1/v.html",label:"label1"},
... {file:"http://url.com/folder2/v.html",label:"label2"},
... {file:"http://url.com/folder3/v.html",label:"label3"}],
... [irrelevant javascript code here]
... </script>"""
>>> re.findall(r'"(http.*?)"',s,re.MULTILINE|re.DOTALL)
['http://url.com/folder1/v.html', 'http://url.com/folder2/v.html', 'http://url.com/folder3/v.html']
ans for extracting the .html
's in list of urls you can use str.endswith
:
>>> urls = ['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html',
... 'http://123.45.67.89/alwefaoewifiasdof224a/v.html',
... 'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html',
... 'http://123.45.67.89/i/0123/12345/aoief243oinsdf.jpg']
>>>
>>> [i for i in urls if i.endswith('html')]
['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html',
'http://123.45.67.89/alwefaoewifiasdof224a/v.html',
'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html']
Also as another general and flexible way for such tasks you can use fnmatch
module :
>>> from fnmatch import fnmatch
>>> [i for i in urls if fnmatch(i,'*.html')]
['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html',
'http://123.45.67.89/alwefaoewifiasdof224a/v.html',
'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html']
If the format is always the same with {file:url
look for the substring between quotes following {file:
:
s="""<script>
[irrelevant javascript code here]
sources:[{file:"http://url.com/folder1/v.html",label:"label1"},
{file:"http://url.com/folder2/v.html",label:"label2"},
{file:"http://url.com/folder3/v.html",label:"label3"}],
[irrelevant javascript code here]
</script>"""
print(re.findall("\{file\:\"(.*?)\"",s))
['http://url.com/folder1/v.html', 'http://url.com/folder2/v.html', 'http://url.com/folder3/v.html']
You could also limit your strings to search by splitting once on sources:
s="""<script>
[irrelevant javascript code here]
sources:[{file:"http://url.com/folder1/v.html",label:"label1"},
{file:"http://url.com/folder2/v.html",label:"label2"},
{file:"http://url.com/folder3/v.html",label:"label3"}],
[irrelevant javascript code here]
</script>"""
print(re.findall("\{file\:\"(.*?)\"",s.split("sources:[",1)[1]))
Which would remove all the other lines before sources:[
, presuming there are not other sources:[
.
像这样的东西?
re.findall(r'http://[^t][^s"]+\.html', document)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.