繁体   English   中英

如何让使用Pandas和DataFrame的Selenium/BS4程序更优化更优雅?

[英]How to make the Selenium/BS4 program using Pandas and DataFrame more optimized and elegant?

我正在学习 web 抓取,发现一个有趣的挑战是从这个页面抓取 Javascript 车把表: Samsung Knox Devices

我最终得到了我想要的 output,但我觉得它感觉“hacky”,所以我很感激任何改进以使其更优雅。

所需的 output 是一个 dataframe/csv 表,其中列 = Device、Model_Nums、OS/Platform、Knox Version。 页面上不需要其他任何东西,我将分别拆分/扩展和融化Model Nums。

import pandas as pd

# Libraries for this task: 
from bs4 import BeautifulSoup
from selenium import webdriver

# Because the target table is built using Javascript handlebars, we have to use Selenium and a webdriver
driver = webdriver.Edge("MY_PATH") # REPLACE WITH >YOUR< PATH!

# Point the driver at the target webpage:
driver.get('https://www.samsungknox.com/en/knox-platform/supported-devices')

# Get the page content
html = driver.page_source
# Typically I'd do something like: soup = BeautifulSoup(html, "lxml")
# Link below suggested the following, which works; I don't know if it matters
sp = BeautifulSoup(html, "html.parser")

# The 'table here is really a bunch of nested divs 
tables = soup.find_all("div", class_='table-row')
# https://www.angularfix.com/2021/09/how-to-extract-text-from-inside-div-tag.html
rows = []
for t in tables:
    row = t.text
    rows.append(row)

# These are the table-row div classes within each table-row from the output at the previous step that I want:    
    # div class="supported-devices pivot-fixed"
    # div class="model"
    # div class="operating system"
    # div class="knox-version"

# Define div class names:
targets = ["supported-devices pivot-fixed", "model", "operating-system", "knox-version"]

# Create an empty list and loop through each target div class; append to list
data = []
for t in targets:
    hold = sp.find_all("div", class_=t)
    for h in hold:
        row = h.text
        data.append({'column': t, 'value': row}) 

df = pd.DataFrame(data)

# This feels like a hack, but I got stuck and it works, so \shrug/
# Create Series from filtered df based on 'column' value (corresponding to the the four "targets" above)
name = pd.Series(df['value'][df['column']=='supported-devices pivot-fixed']).reset_index(drop=True)
model = pd.Series(df['value'][df['column']=='model']).reset_index(drop=True)
os = pd.Series(df['value'][df['column']=='operating-system']).reset_index(drop=True)
knox = pd.Series(df['value'][df['column']=='knox-version']).reset_index(drop=True)
# Concatenate Series into df
df2 = pd.concat([df_name, df_model, df_os, df_knox], axis=1)

# Make the first row the column names:
new_header = df2.iloc[0] #grab the first row for the header
sam_knox_table = df2[1:] #take the data less the header row
sam_knox_table.columns = new_header #set the header row as the df header

# Bob's your uncle
sam_knox_table.to_csv('sam_knox.csv', index=False)

要从DEVICEMODEL CODE列中抓取文本,您需要使用列表理解visibility_of_all_elements_located()诱导WebDriverWait创建所需文本的列表,然后使用Pandas 将其写入 DataFrame 您可以使用以下定位器策略

  • 代码块:

     driver.get("https://www.samsungknox.com/en/knox-platform/supported-devices") devices = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.table-row:not(.table-header) > div.supported-devices")))] models = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.table-row:not(.table-header) > div.model")))] df = pd.DataFrame(data=list(zip(devices, models)), columns=['DEVICE', 'MODEL CODE']) print(df) driver.quit()
  • 控制台 Output:

     DEVICE MODEL CODE 0 Galaxy A42 5G SM-A426N, SM-A426U, SM-A4260, SM-A426B 1 Galaxy A52 SM-A525F, SM-A525M 2 Galaxy A52 5G SM-A5260 3 Galaxy A52 5G SM-A526U, SC-53B, SM-A526W, SM-A526B 4 Galaxy A52s 5G SM-A528B, SM-A528N.. ... ... 371 Gear Sport SM-R600 372 Gear S3 Classic SM-R775V 373 Gear S3 Frontier SM-R765V 374 Gear S2 SM-R720, SM-R730A, SM-R730S, SM-R730V 375 Gear S2 Classic SM-R732, SM-R735, SM-R735A, SM-R735V, SM-R735S [376 rows x 2 columns]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM