简体   繁体   English

Web Scraping 一个动态网站,它使用 javasript 和漂亮的汤和正则表达式

[英]Web Scraping a Dynamic website that uses javasript with beautiful soup and RegEx

I am trying to make an app that gives fantasy football scores for the XFL as a personal project.我正在尝试制作一个应用程序,作为个人项目为 XFL 提供梦幻足球分数。 I was able to use beautiful soup to get the source and String.split() to separate all the stats of the players in But when I try to get the rosters I get something like this:我能够使用漂亮的汤来获取源代码,并使用 String.split() 来分离球员的所有统计数据但是当我尝试获取名单时,我得到了这样的结果:

>**1**</fagtd><td style="background-color:white; border-bottom:1px solid black; border-left:none; border-right:1px solid black; border-top:none; text-align:center; vertical-align:bottom; white-space:nowrap; width:89px">**Jazz**</td><td style="background-color:white; border-bottom:1px solid black; border-left:none; border-right:1px solid black; border-top:none; text-align:center; vertical-align:bottom; white-space:nowrap; width:100px">**Ferguson**</td><td style="background-color:white; border-bottom:1px solid black; border-left:none; border-right:1px solid black; border-top:none; text-align:center; vertical-align:bottom; white-space:nowrap; width:61px">**WR**

and out of this I need to get the information 1 Jazz Ferguson and WR .因此我需要获取信息1 Jazz Ferguson 和 WR String.split() will not work for something this complex. String.split() 不适用于这种复杂的事情。 I was thinking about using regular expressions but I am not sure how.我正在考虑使用正则表达式,但我不确定如何使用。 Can any one come up with a reg ex for this or if there is a much easier way point me in the right direction?任何人都可以为此提出一个正则表达式,或者是否有更简单的方法指向正确的方向? Thank you.谢谢你。

EDIT This is the portion of the code I use to get that HTML data above.编辑这是我用来获取上面的 HTML 数据的代码部分。 It prints out the whole thing that part above is only a section.它打印出整个内容,上面的部分只是一个部分。

session = HTMLSession()
page = session.get('https://www.xfl.com/en-US/teams/dallas/renegades-articles/dallas-renegades-roster')

soup2 = BeautifulSoup(page.content, PARSER)
script = soup2.find_all('script')

for tags in script:

    if ((tags.text.find('"title":"Dallas Renegades roster"')) >= 0):

        rosterData = tags.text[(tags.text.find('College')):]
        rosterData = rosterData.replace('</td>', '').replace('\\','')

        print(rosterData)

Hi below code gets the full table as a dataframe you can filter the required data from this:-嗨,下面的代码获取完整的表格作为数据框,您可以从中过滤所需的数据:-

import requests
import pandas as pd
url = 'https://www.xfl.com/en-US/teams/dallas/renegades-articles/dallas-renegades-roster'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM