简体   繁体   English

如何从锚标记解析文本?

[英]How to parse the text from an anchor tag?

I want to parse this " <a href="javascript:8==99999?popDuelloDialog(2754288):popTeam(2386)">Gnistan</a> " and extract the text. 我想解析此“ <a href="javascript:8==99999?popDuelloDialog(2754288):popTeam(2386)">Gnistan</a> ”并提取文本。

I tried to extract a lot, but I couldn't succeed. 我尝试提取很多内容,但未能成功。

I don't know how to build a method with this format "javascript comes " :(numbers) " which are not repeating. So I need such a method that will only use the repeating part and will extract text in the body. 我不知道如何构建不重复的“ javascript comes :(numbers) ”格式的方法,因此我需要仅使用重复部分并在正文中提取文本的方法。

My code is here: 我的代码在这里:

import sys
from PyQt4.QtGui import QApplication
from PyQt4.QtCore import QUrl
from PyQt4.QtWebKit import QWebPage
import bs4 as bs
import urllib.request
import re
from bs4 import BeautifulSoup

class Client(QWebPage):

    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self.on_page_load)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def on_page_load(self):
        self.app.quit()

url = 'http://www.mackolik.com/Genis-Iddaa-Programi'
client_response = Client(url)
source = client_response.mainFrame().toHtml()
soup = bs.BeautifulSoup(source, 'html.parser')
#pattern=re.compile(r"javascript:;")
#js_test = soup.find_all('a', href='javascript')
hreff=soup.find_all("a","javascript:;")
#js_test=soup.select('a[href^="javascript:\('(.*?)'\);"]')
#print(js_test.text)
#type(href)
for i in hreff:
    print(hreff[i])

you can do like this I know it's in VB but you can take the idea... 你可以这样做,我知道它在VB中,但是你可以接受这个主意...

'look for the begining of <a href
    Dim xstr As String = "<a href=javascript:8==99999?popDuelloDialog(2754288):popTeam(2386)>Gnistan</a>"
    Dim xStart As Integer = InStr(xstr, "<a href")
    If xStart > 0 Then
        'look for the end
        Dim AHREF As Integer = InStr(xStart, xstr, ">") + 1
        'look for </a>
        Dim endAHREF As Integer = InStr(AHREF, xstr, "</a>")
        'take what you need
        Dim Result As String = Mid(xstr, AHREF, endAHREF - AHREF)


    End If

IIUC all you need is to make BeautifulSoup get all anchors tag that have the "javascript" in their href attribute. IIUC所需的只是使BeautifulSoup获得其href属性中具有"javascript"所有锚标签。 However, it seems that the content you want to parse are being created with JavaScript, and that would require using selenium and a webdriver like ChromeDriver . 但是,似乎要解析的内容是使用JavaScript创建的,这将需要使用selenium和类似ChromeDriver Using BeautifulSoup and requests we can see that the content you probably want is not in the html code, the logic for solving your issue would be this: 使用BeautifulSoup和请求,我们可以看到您可能想要的内容不在html代码中,因此解决问题的逻辑是:

from bs4 import BeautifulSoup
import requests
url = "http://www.mackolik.com/Genis-Iddaa-Programi"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html.parser')

for tag in soup.findAll('a'):
    if "javascript" in tag['href']:
        print(tag.text)

The code above checks if the substring "javascript" is in the href attribute and prints the tag's text if true. 上面的代码检查子字符串"javascript"是否in href属性中,如果为true,则打印标记的文本。

With selenium and ChromeDriver the logic is pretty much the same, but we need other methods: 使用selenium和ChromeDriver,其逻辑几乎相同,但是我们需要其他方法:

from selenium import webdriver

url = "http://www.mackolik.com/Genis-Iddaa-Programi"
driver = webdriver.Chrome()
driver.get(url)

for tag in driver.find_elements_by_tag_name("a"):
    if "javascript" in tag.get_attribute("href"):
        print(tag.text)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM