简体   繁体   English

如何使用 Python 在 web 页面上提取特定字符串

[英]How to extract specific string on a web page using Python

Here's the complete HTML Code of the page that I'm trying to scrape so please take a look first https://codepen.io/bendaggers/pen/LYpZMNv这是我试图抓取的页面的完整 HTML 代码,所以请先看看https://codepen.io/bendaggers/pen/LYpZMNv

As you can see, this is the page source of mbasic.facebook.com.可以看到,这是mbasic.facebook.com的页面源。

What I'm trying to do is scrape all the anchor tags that have a pattern like this:我想要做的是刮掉所有具有这样模式的锚标签:

Example例子

<a class="cf" href="/profile.php?id=100044454444312&amp;fref=fr_tab">

Example with wild card.带有通配符的示例。

<a class="cf" href="*">

so I decided to add a wild card identifier after href="*" since the value are dynamic.所以我决定在 href="*" 之后添加一个通配符标识符,因为该值是动态的。

Here's my (not working) Python Code.这是我的(不工作)Python 代码。

driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends')
pagex = re.compile(driver.page_source)
pattern = "<a class=\"cf\" href=\"*\">"
print(pagex.findall(pattern))

Note that in the page, there are several patterns like this so I need to capture all and print it.请注意,在页面中,有几个这样的模式,所以我需要全部捕获并打印出来。

<td class="w n" style="vertical-align: middle"><img src="https://scontent.fceb2-1.fna.fbcdn.net/v/t1.0-1/cp0/e15/q65/p50x50/79342209_112439723581175_5245034566049071104_o.jpg?_nc_cat=108&amp;_nc_sid=dbb9e7&amp;efg=eyJpIjoiYiJ9&amp;_nc_ohc=lADKURnNsk4AX8WTS1F&amp;_nc_ht=scontent.fceb2-1.fna&amp;_nc_tp=3&amp;oh=96f40cb2f95acbcfe9f6e4dc6cb31161&amp;oe=5EC27AEB" class="bo s" alt="Natividad Cruz, profile picture" /></td>
<td class="w t" style="vertical-align: middle"><a class="cf" href="/profile.php?id=100044454444312&amp;fref=fr_tab">Natividad Cruz</a>
<td class="w n" style="vertical-align: middle"><img src="https://scontent.fceb2-1.fna.fbcdn.net/v/t1.0-1/cp0/e15/q65/p50x50/10306248_10201945477974508_4213924286888352892_n.jpg?_nc_cat=109&amp;_nc_sid=dbb9e7&amp;efg=eyJpIjoiYiJ9&amp;_nc_ohc=Z2daQ-qGgpsAX8BmLKr&amp;_nc_ht=scontent.fceb2-1.fna&amp;_nc_tp=3&amp;oh=22f2b487166a7cd06e4ff650af4f7a7b&amp;oe=5EC34325" class="bo s" alt="John Vinas, profile picture" /></td>
<td class="w t" style="vertical-align: middle"><a class="cf" href="/john.vinas?fref=fr_tab">John Vinas</a>

My goal is to print or findall the anchor tags and display it in terminal.我的目标是打印或找到所有标签并将其显示在终端中。 Appreciate your help on this.感谢您对此的帮助。 Thank you!谢谢!

Tried another set of code but no luck:)尝试了另一组代码但没有运气:)

driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends')
pagex = driver.page_source
pattern = "<td class=\".*\" style=\"vertical-align: middle\"><a class=\".*\">"
x = re.findall(pattern, pagex)
print(x)

I think your wildcard match needs a dot in front like .*我认为您的通配符匹配需要在前面加一个点,例如.*

I'd also recommend using a library like Beautiful Soup for this, it might make your life easier.我还建议为此使用 Beautiful Soup 之类的库,它可能会让您的生活更轻松。

You should use a parsing library, such as BeautifulSoup or requests-html .您应该使用解析库,例如BeautifulSouprequests-html If you want to do it manually, then build on the second attempt you posted.如果您想手动执行此操作,请在您发布的第二次尝试的基础上进行。 The first won't get you what you want because you are compiling the entire page as a regular expression.第一个不会得到你想要的,因为你正在将整个页面编译为正则表达式。

import re

s = """<a class="cf" href="/profile.php?id=100044454444312&amp;fref=fr_tab">\n\n<h1>\n<a class="cf" href="/profile.php?id=20004666644312&amp;fref=fr_tab">"""

patt = r'<a.*?class[="]{2}cf.*?href.*?profile.*?>'
matches = re.findall(patt, s)

Output Output

>>>matches
['<a class="cf" href="/profile.php?id=100044454444312&amp;fref=fr_tab">',
 '<a class="cf" href="/profile.php?id=20004666644312&amp;fref=fr_tab">']

As mentioned by the previous respondent, BeautifulSoup is the best thats available out there in python to scrape web pages.正如前面的受访者所提到的,BeautifulSoup 是 python 中用于抓取 web 页面的最好的。 To import beautiful soup and other libraries use the following commands要导入漂亮的汤和其他库,请使用以下命令

  • from urllib.request import Request, urlopen from urllib.request 导入请求,urlopen
  • from bs4 import BeautifulSoup从 bs4 导入 BeautifulSoup

Post this the below set of commands should solve your purpose发布此以下命令集应该可以解决您的目的

req=Request(url,headers = {'User-Agent': 'Chrome/64.0.3282.140'})
result=urlopen(req).read()
soup = BeautifulSoup(result, "html.parser")
atags=soup('a')

url in the above command is the link you want to scrape and headers argument takes by browser specs/version上述命令中的url是您要抓取的链接,并且 headers 参数由浏览器规范/版本获取

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM