简体   繁体   English

如何读取HTML表格数据w美汤? 返回“无”

[英]How to read HTML table data w Beautiful Soup? Returning 'None'

I'm using Beautiful Soup to read data from a HTML table. 我正在使用Beautiful Soup从HTML表读取数据。 Why am I not getting a result from the table and how do I fix it? 为什么我不能从表格中得到结果,如何解决? My code returns 'None'. 我的代码返回“无”。

I see that there is javascript in the page source and have read that might be an issue. 我看到页面源代码中有JavaScript,并且已阅读,可能是一个问题。 The url runs a report that is inputted into the table. 该url运行一个输入到表中的报告。

I've used soup.prettify() to check the HTML and it doesn't seem to give me the full source code. 我已经使用了soup.prettify()来检查HTML,但它似乎没有提供完整的源代码。 I'm unsure if this is an issue. 我不确定这是否是问题。

Here's the HTML of the table and the first data row: 这是表格的HTML和第一行数据:

    <table data-toggle="table"
        data-show-columns="true"
        data-show-export="true"
        data-show-toggle="true"
        class="table-data">
        <thead>
            <tr>
                <th data-field="RouteId" data-sortable="true">Route ID</th>
                <th data-field="RouteName" data-sortable="true">Route Name</th>
                <th data-field="TripId" data-sortable="true">Trip ID</th>
                <th data-field="TripName" data-sortable="true">Trip Name</th>
                <th data-field="InstanceId" data-sortable="true">INST ID</th>
                <th data-field="InstanceDate" data-sortable="true">INST Date</th>
                <th data-field="InstanceStatus" data-sortable="true">INST Status</th>
                <th data-field="InstanceCapacity" data-sortable="true">INST Cap.</th>
                <th data-field="NumOrders" data-sortable="true">Num. ORDs</th>
                <th data-field="OrderId" data-sortable="true">ORD ID</th>
                <th data-field="OrderType" data-sortable="true">ORD Type</th>
                <th data-field="OrderStatus" data-sortable="true">ORD Status</th>
                <th data-field="VehicleYear" data-sortable="true">VEH Year</th>
                <th data-field="VehicleMake" data-sortable="true">VEH Make</th>
                <th data-field="VehicleModel" data-sortable="true">VEH Model</th>
                <th data-field="VehicleRefNo1" data-sortable="true">VEH RefNo1</th>
                <th data-field="vehicleVin" data-sortable="true">VEH Vin</th>
                <th data-field="DriverId" data-sortable="true">DRV ID</th>
                <th data-field="DriverName" data-sortable="true">DRV Name</th>
                <th data-field="ScheduledPickupDateTime" data-sortable="true">Sch. Pick</th>
                <th data-field="ActualPickupPickupDateTime" data-sortable="true">Act. Pick</th>
                <th data-field="DeliveredDateTime" data-sortable="true">Hand. Rec.</th>
                <th data-field="HandheldDateTime" data-sortable="true">Del.</th>
            </tr>
        </thead>
        <tbody>

            <tr>
                <td>160</td>
                <td>8 LEG: MEM to PRES</td>
                <td>187</td>
                <td>Trip 1 - Leg 7</td>
                <td>740685</td>
                <td>2017-02-01</td>
                <td>Active</td>
                <td>9.00000</td>
                <td>9</td>
                <td>9110734</td>
                <td>LoadLegChild</td>
                <td>InRoute</td>
                <td>2015</td>
                <td>Jeep</td>
                <td>Patriot</td>
                <td>2000047350</td>
                <td>1C4NJPFBXFD318536</td>
                <td>1</td>
                <td>User, System</td>
                <td>2017-02-01 02:05 AM</td>
                <td>2017-02-01 02:20 AM</td>
                <td></td>
                <td></td>
            </tr>

Here is my attempt with Beautiful Soup: 这是我尝试美丽汤的尝试:

from urllib.request import urlopen
from bs4 import BeautifulSoup

page = urlopen(url)
soup = BeautifulSoup(page,'lxml')
print(soup.find('table',{'class':'table-data table'}))

I've also tried xpath but received an empty list: 我也尝试过xpath但收到一个空列表:

import requests
from lxml import html
NewPage = requests.get(url)
tree = html.fromstring(NewPage.content)
tree.xpath('//*[@id="content"]/div[2]/div[2]/div[2]/div[2]/table/tbody/tr[1]/td[1]')

UPDATE: I'm thinking the table I'm trying to use is dynamically created; 更新:我在想我要使用的表是动态创建的; how would I change my code to account for this? 我将如何更改我的代码以解决此问题? I've also tried using find_all to check my work but it doesn't bring back every table in the HTML, only the first one. 我也尝试过使用find_all检查我的工作,但它不会带回HTML中的每个表,只会带回第一个表。 Why is this? 为什么是这样?

page = requests.get(url)
pageText = page.text
soup = BeautifulSoup(pageText,'lxml')
print(soup.find_all('table'))

Here's the output: 这是输出:

[<table cellpadding="0" cellspacing="0" id="Login1">
<tr>
<td>
<div class="row">
<div class="col-md-6">
<div class="form-group">
<label for="UserName">Username</label>
<input class="form-control" id="Login1_UserName" name="Login1$UserName" type="text"/>
</div>
</div>
<div class="col-md-6">
<div class="form-group">
<label for="Password">Password</label>
<input class="form-control" id="Login1_Password" name="Login1$Password" type="password"/>
</div>
</div>
</div>
<div class="row">
<div class="col-md-6">
<input id="Login1_RememberMe" name="Login1$RememberMe" type="checkbox"/><label for="Login1_RememberMe">Remember my login</label>
</div>
<div class="col-md-6 text-right">
<input class="btn btn-default" id="Login1_Login" name="Login1$Login" type="submit" value="Login"/>
</div>
</div>
<p>
</p>
</td>
</tr>
</table>]

It looks to me like you are mixing up the old form used in earlier versions of beautiful soup and the newer. 在我看来,您似乎正在混淆旧版本的漂亮汤和新版本中使用的旧表格。

I would try: soup.find("table", class_="table-data") 我会尝试: soup.find("table", class_="table-data")

This is the syntax for the newer versions of beautiful soup. 这是新版美丽汤的语法。 Hopefully, that is what you are using. 希望这就是您正在使用的。

I don't have beautiful soup installed so I can't verify, but you could give that a try. 我没有安装漂亮的汤,所以无法验证,但是您可以尝试一下。

You have an error in your find call. 查找呼叫中有一个错误

You are searching for a table element that have both table-data and table classes. 您正在搜索同时具有表数据 类的元素。 But, as you can see, the table only has the class table-data , not table one. 但是,如您所见,该表仅具有类table-data ,而不具有一。 Replace your code with: 将代码替换为:

print(soup.find('table',{'class':'table-data'}))

UPDATE : It seems that the webpage, as you said in the update, is dynamically created. 更新 :好像您在更新中所说的,网页是动态创建的。 So please print the full HTML webpage (or save it to a file) and work around that code ( don't use the code you see in Google Chrome or other browser inspector , they have some code generated AFTER document loads.). 因此,请打印完整的HTML网页(或将其保存到文件中)并解决该代码( 不要使用您在Google Chrome浏览器或其他浏览器检查器中看到的代码 ,它们会在加载文档后生成一些代码。)。

  • If you have everything you need with that code, that's all. 如果您拥有该代码所需的一切,仅此而已。
  • If you don't have what you need, please consider using Ghost webkit web client, instead of urllib/requests, to get the webpage HTML dynamically created. 如果您没有所需的内容,请考虑使用Ghost webkit Web客户端而不是urllib / requests来动态创建网页HTML。 Then you can use pure JavaScript to get the element you are searching for, or use Beautiful Soup as well. 然后,您可以使用纯JavaScript来获取要搜索的元素,或者也可以使用Beautiful Soup。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM