[英]Scraping specific sports data SELENIUM/BS4
am trying to scrape data from this page https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela我正在尝试从此页面https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela抓取数据
Q1:I created this code, but I don't know how to extract data for AJAX team only. Q1:我创建了这段代码,但我不知道如何为 AJAX 团队提取数据。 The data is to be saved as a list.数据将保存为列表。 later they will be saved to csv file.稍后它们将被保存到 csv 文件中。 In addition, I am not interested, for example, the sign "?"另外,我不感兴趣,例如符号“?” how to exclude it?如何排除它? I'll be grateful for your help.我会很感激你的帮助。
Q2: How can i separate anserw for "AJAX" eg with ";" Q2:我如何分离“AJAX”的anserw,例如用“;” Ajax;18;13;3;2;56:4;42;?;W;W;P;W;W; Ajax;18;13;3;2;56:4;42;?;W;W;P;W;W;
CODE代码
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup as BS
import requests
from time import sleep
driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
sleep(10)
page = driver.page_source
soup = BS(page,'html.parser')
content3 = soup.find('div',{'class':'ui-table__body'})
content_list3 = content.find_all('div',{'class':'tableCellFormIcon tableCellFormIcon--TBD'})
for i in content3:
print(i.text.split()[0])
RESULTS结果
1.PSV18141346:2443?WWWWR
2.Ajax18133256:442?WWPWW
3.Feyenoord18123342:1739?WPRWW
4.Vitesse18103525:2533?WRWWR
5.Alkmaar18102635:2332?WWWWW
6.Twente1895428:2232?RWWWR
7.Utrecht1885533:2329?RRRPW
8.Cambuur1891832:3928?RPWPW
9.Nijmegen1874724:2625?WWPPP
10.Heerenveen1874720:2525?PWRWR
11.G.A.
12.Groningen1847720:2719?PPRRW
13.Heracles18531021:2618?RWPPP
14.Willem
15.Waalwijk1837819:3016?RPPWR
16.Sparta
17.Sittard18341119:4613?PRWPP
18.Zwolle1813149:326?PPPRR
You can add it to a list:您可以将其添加到列表中:
res = []
for i in content3:
line = i.text.split()[0]
print(line)
res.append(line)
https://docs.python.org/3/tutorial/datastructures.html - https://docs.python.org/3/tutorial/datastructures.html -
list.append(x) Add an item to the end of the list. list.append(x) 将一个项目添加到列表的末尾。 Equivalent to a[len(a):] = [x].等价于 a[len(a):] = [x]。
to replace the "?"替换“?” add this:添加这个:
line = line.replace("?", "")
https://docs.python.org/3/library/stdtypes.html#str.replace - https://docs.python.org/3/library/stdtypes.html#str.replace -
str.replace(old, new[, count]) Return a copy of the string with all occurrences of substring old replaced by new. str.replace(old, new[, count]) 返回字符串的副本,其中所有出现的 ZE83AED3DDF4667DEC0DAAAACB2BB3BE0BZ 旧替换为新。 If the optional argument count is given, only the first count occurrences are replaced.如果给定了可选参数 count,则仅替换第一个 count 出现。
Added regular expressions and sorted "Ajax"
import re
...
res = []
for i in content3:
line = i.text.split()[0]
if re.search('Ajax', line):
line = line.replace("?", "")
res.append(line)
print(res)
Another question to main topic.主题的另一个问题。 How can i get olny that results with separate ";"我怎样才能用单独的“;”得到结果
Results结果
['1.Ajax20153261:548WWWWP']
expected result ( separete; and miss few rows value 20 and value 48 in this example)预期结果(单独的;在本例中缺少几行值 20 和值 48)
Ajax;15;3;2;61:5;W;W;W;W;P'
code below下面的代码
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup as BS
import requests
from time import sleep
import re
driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
sleep(10)
page = driver.page_source
soup = BS(page,'html.parser')
content3 = soup.find('div',{'class':'ui-table__body'})
content_list3 = content3.find_all('div',{'class':'tableCellFormIcon
tableCellFormIcon--TBD'})
res = []
for i in content3:
line = i.text.split()[0]
if re.search('Ajax', line):
line = line.replace("?", "")
res.append(line)
print(res)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.