简体   繁体   中英

Scraping specific sports data SELENIUM/BS4

am trying to scrape data from this page https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela

Q1:I created this code, but I don't know how to extract data for AJAX team only. The data is to be saved as a list. later they will be saved to csv file. In addition, I am not interested, for example, the sign "?" how to exclude it? I'll be grateful for your help.

Q2: How can i separate anserw for "AJAX" eg with ";" Ajax;18;13;3;2;56:4;42;?;W;W;P;W;W;

CODE

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup as BS
import requests
from time import sleep
driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
sleep(10)
page = driver.page_source
soup = BS(page,'html.parser')
content3 = soup.find('div',{'class':'ui-table__body'})
content_list3 = content.find_all('div',{'class':'tableCellFormIcon tableCellFormIcon--TBD'})

for i in content3:
    print(i.text.split()[0])

RESULTS

1.PSV18141346:2443?WWWWR
2.Ajax18133256:442?WWPWW
3.Feyenoord18123342:1739?WPRWW
4.Vitesse18103525:2533?WRWWR
5.Alkmaar18102635:2332?WWWWW
6.Twente1895428:2232?RWWWR
7.Utrecht1885533:2329?RRRPW
8.Cambuur1891832:3928?RPWPW
9.Nijmegen1874724:2625?WWPPP
10.Heerenveen1874720:2525?PWRWR
11.G.A.
12.Groningen1847720:2719?PPRRW
13.Heracles18531021:2618?RWPPP
14.Willem
15.Waalwijk1837819:3016?RPPWR
16.Sparta
17.Sittard18341119:4613?PRWPP
18.Zwolle1813149:326?PPPRR

You can add it to a list:

res = []
for i in content3:
    line = i.text.split()[0]
    print(line)
    res.append(line)

https://docs.python.org/3/tutorial/datastructures.html -

list.append(x) Add an item to the end of the list. Equivalent to a[len(a):] = [x].


to replace the "?" add this:

line = line.replace("?", "")

https://docs.python.org/3/library/stdtypes.html#str.replace -

str.replace(old, new[, count]) Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.

Added regular expressions and sorted "Ajax"
import re 
...
res = []
for i in content3:
    line = i.text.split()[0]
    if re.search('Ajax', line):
        line = line.replace("?", "")
        res.append(line)

print(res)

Another question to main topic. How can i get olny that results with separate ";"

Results

 ['1.Ajax20153261:548WWWWP']

expected result ( separete; and miss few rows value 20 and value 48 in this example)

Ajax;15;3;2;61:5;W;W;W;W;P'

code below

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup as BS
import requests
from time import sleep
import re
driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
sleep(10)
page = driver.page_source
soup = BS(page,'html.parser')
content3 = soup.find('div',{'class':'ui-table__body'})
content_list3 = content3.find_all('div',{'class':'tableCellFormIcon 
tableCellFormIcon--TBD'})
res = []
for i in content3:
   line = i.text.split()[0]
   if re.search('Ajax', line):
       line = line.replace("?", "")
       res.append(line)

print(res)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM