Selenium - 我怎样才能刮掉这张桌子？

Question

I'm looking to scrape data from https://www.worldometers.info/coronavirus/ however it seems that the table's tr and td kept changing throughout the rows.我希望从https://www.worldometers.info/coronavirus/中抓取数据，但似乎表格的tr和td在各行中不断变化。 I have the code below so far and it is not working.到目前为止，我有下面的代码，但它不起作用。

public ArrayList<Data>getAllData(){
        ArrayList<Data>allData = new ArrayList<Data>();
        try {
        Thread.sleep(10000);
        WebDriver browser = load();
        int row = browser.findElements(By.xpath("/html[1]/body[1]/div[3]/div[3]/div[1]/div[4]/div[1]/div[1]/table[1]/tbody[1]/tr")).size();
        int col = browser.findElements(By.xpath("/html[1]/body[1]/div[3]/div[3]/div[1]/div[4]/div[1]/div[1]/table[1]/tbody[1]/tr[1]/td")).size();
        for ( int i = 3; i < row; i++) {
            for ( int j = 1; j < col; j++) {
        Data data = new Data();
        data.setId(browser.findElement(By.xpath("/html[1]/body[1]/div[3]/div[3]/div[1]/div[4]/div[1]/div[1]/table[1]/tbody[1]/tr["+i+"]/td["+j+"]")).getText());
        data.setCountry(browser.findElement(By.xpath("/html[1]/body[1]/div[3]/div[3]/div[1]/div[4]/div[1]/div[1]/table[1]/tbody[1]/tr["+i+"]/td["+j+"]")).getText());
        data.setTotalCases(browser.findElement(By.xpath("/html[1]/body[1]/div[3]/div[3]/div[1]/div[4]/div[1]/div[1]/table[1]/tbody[1]/tr["+i+"]/td["+j+"]")).getText());
        data.setNewCases(browser.findElement(By.xpath("/html[1]/body[1]/div[3]/div[3]/div[1]/div[4]/div[1]/div[1]/table[1]/tbody[1]/tr["+i+"]/td["+j+"]")).getText());
        data.setTotalDeaths(browser.findElement(By.xpath("/html[1]/body[1]/div[3]/div[3]/div[1]/div[4]/div[1]/div[1]/table[1]/tbody[1]/tr["+i+"]/td["+j+"]")).getText());
        data.setNewDeaths(browser.findElement(By.xpath("/html[1]/body[1]/div[3]/div[3]/div[1]/div[4]/div[1]/div[1]/table[1]/tbody[1]/tr["+i+"]/td["+j+"]")).getText());
        data.setTotalRecovered(browser.findElement(By.xpath("/html[1]/body[1]/div[3]/div[3]/div[1]/div[4]/div[1]/div[1]/table[1]/tbody[1]/tr["+i+"]/td["+j+"]")).getText());
        data.setActiveCases(browser.findElement(By.xpath("/html[1]/body[1]/div[3]/div[3]/div[1]/div[4]/div[1]/div[1]/table[1]/tbody[1]/tr["+i+"]/td["+j+"]")).getText());
        data.setSeriousCases(browser.findElement(By.xpath("/html[1]/body[1]/div[3]/div[3]/div[1]/div[4]/div[1]/div[1]/table[1]/tbody[1]/tr["+i+"]/td["+j+"]")).getText());
        data.setTotalCasesPerMillionPop(browser.findElement(By.xpath("/html[1]/body[1]/div[3]/div[3]/div[1]/div[4]/div[1]/div[1]/table[1]/tbody[1]/tr["+i+"]/td["+j+"]")).getText());
        data.setTotalDeathsPerMillionPop(browser.findElement(By.xpath("/html[1]/body[1]/div[3]/div[3]/div[1]/div[4]/div[1]/div[1]/table[1]/tbody[1]/tr["+i+"]/td["+j+"]")).getText());
        data.setTotalTests(browser.findElement(By.xpath("/html[1]/body[1]/div[3]/div[3]/div[1]/div[4]/div[1]/div[1]/table[1]/tbody[1]/tr["+i+"]/td["+j+"]")).getText());
        data.setTestsPerMillion(browser.findElement(By.xpath("/html[1]/body[1]/div[3]/div[3]/div[1]/div[4]/div[1]/div[1]/table[1]/tbody[1]/tr["+i+"]/td["+j+"]")).getText());       
        allData.add(data);
            }
        }
        browser.quit();
        browser.close();
        }
        catch(Exception e) {
            e.printStackTrace();
        }
    
        return allData;
        }

Answer 1

What you're currently doing is finding all the elements that match row and column - but then you're navigating the index location on the table as opposed the actual elements you found.您当前正在做的是查找与行和列匹配的所有元素 - 但随后您正在导航表上的索引位置，而不是您找到的实际元素。

If you look at the table body in dev tools you can see it contains total rows that are hidden from view.如果您查看开发工具中的表格主体，您可以看到它包含从视图中隐藏的总行数。 USA is row 3 (highlighted in devtools) but rows 4, 5 and 6 are totals. USA 是第 3 行（在 devtools 中突出显示），但第 4、5 和 6 行是总数。

If you expand then out the column numbers and content vary.如果您展开，那么列号和内容会有所不同。

Couple of things to suggest:有几点建议：

Try a smarter xpath to get all rows: (this seems to skip those headers)尝试更智能的 xpath 来获取所有行：（这似乎跳过了那些标题）

//table[@id="main_table_countries_today"]//tr[@role="row"]

Then, iterate the row elements you found with a foreach loop (not by xpath index).然后，使用 foreach 循环迭代您找到的行元素（不是通过 xpath 索引）。 And, inside that loop, get the td tags within each row.并且，在该循环内，获取每行中的 td 标签。

For example:例如：

public void GettingAllTheData(){

            //Get all the ROWS that match
            var rows = driver.findElements(By.xpath("//table[@id='main_table_countries_today']//tr[@role='row']"));

            //loop all rows
            for (var row : rows) {
                //Then get the columns within the row object!
                var cols = row.findElements(By.tagName("td"));

                //replace this with writing out your data 
                //this is jut to make sure it writes out as expected. 
                //You might not need a second loop 
                for (var col : cols)
                {
                    System.out.println(col.getText());
                }
            }
    }

I didn't want to recreate your data object so i just went with print.我不想重新创建您的数据 object 所以我只是打印。 For me this seems to consistently write the results.对我来说，这似乎始终如一地写出结果。

First iteration:第一次迭代：

1
USA
3,619,643
+2,816
140,200
+56
1,646,683
1,832,760
16,459
10,933
423
44,867,389
135,518
331,081,677

Second iteration:第二次迭代：

2
Brazil
1,972,072
+1,163
75,568
+45
1,366,775
529,729
8,318
9,275
355
4,911,063
23,098
212,620,008

Selenium - 我怎样才能刮掉这张桌子？

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-07-16 13:41:45

Selenium - 我怎样才能刮掉这张桌子？

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-07-16 13:41:45

解决方案1
0 已采纳 2020-07-16 13:41:45