from bs4 import BeautifulSoup
import numpy as np
import requests
from selenium import webdriver
from nltk.tokenize import sent_tokenize,word_tokenize
html = webdriver.Firefox(executable_path=r'D:\geckodriver.exe')
html.get("https://www.tsa.gov/coronavirus/passenger-throughput")
def TSA_travel_numbers(html):
print('NASEEF')
soup = BeautifulSoup(html,'lxml')
print('naseef2')
for i,rows in enumerate(soup.find_all('tr',class_='view-content')):
print('naseef3')
for texts in soup.find('td',header = 'view-field-2021-throughput-table-column'):
print('naseef4')
number = texts.text
if number is None:
continue
print('Naseef')
TSA_travel_numbers(html.page_source)
As you can see NASEEF and naseef2 gets printed into the console, but not naseef3 and naseef4, and no error to this code, it runs fine, I don't know what is happening here, anyone please point me what is really happening here? In other words it is not going inside the for loops specified in that function. please help me, and sorry for your time and advance thanks!
Your page does not contain <tr>
tags with a class
of view-content
, so find_all
is correctly returning no results. If you remove the class
restriction, you get many results:
>>> soup.find_all('tr', limit=2)
[<tr>
<th class="views-align-center views-field views-field-field-today-date views-align-center" id="view-field-today-date-table-column" scope="col">Date</th>
<th class="views-align-center views-field views-field-field-2021-throughput views-align-center" id="view-field-2021-throughput-table-column" scope="col">2021 Traveler Throughput </th>
<th class="views-align-center views-field views-field-field-2020-throughput views-align-center" id="view-field-2020-throughput-table-column" scope="col">2020 Traveler Throughput </th>
<th class="views-align-center views-field views-field-field-2019-throughput views-align-center" id="view-field-2019-throughput-table-column" scope="col">2019 Traveler Throughput </th>
</tr>, <tr>
<td class="views-field views-field-field-today-date views-align-center" headers="view-field-today-date-table-column">5/9/2021 </td>
<td class="views-field views-field-field-2021-throughput views-align-center" headers="view-field-2021-throughput-table-column">1,707,805 </td>
<td class="views-field views-field-field-2020-throughput views-align-center" headers="view-field-2020-throughput-table-column">200,815 </td>
<td class="views-field views-field-field-2019-throughput views-align-center" headers="view-field-2019-throughput-table-column">2,419,114 </td>
</tr>]
Once you change that, the inner loop is looking for <td>
tags with a header
of view-field-2021-throughput-table-column
. There are no such tags in the page either, but there are those which have a headers
field with that name.
This line is also wrong:
number = texts.text
...because texts
is a NavigableString and does not have the text
attribute.
Additionally, the word naseef
is not really clear as to what it means, so it's better to replace that with more descriptive strings. Finally, you don't really need the Selenium connection or the tokenizer, so for the purposes of this example we can leave those out. The resulting code looks like this:
from bs4 import BeautifulSoup
import numpy as np
import requests
html = requests.get("https://www.tsa.gov/coronavirus/passenger-throughput").text
def TSA_travel_numbers(html):
print('Entering parsing function')
soup = BeautifulSoup(html,'lxml')
print('Parsed HTML to soup')
for i,rows in enumerate(soup.find_all('tr')):
print('Found <tr> tag number', i)
for texts in soup.find('td',headers = 'view-field-2021-throughput-table-column'):
print('found <td> tag with headers')
number = texts
if number is None:
continue
print('Value is', number)
TSA_travel_numbers(html)
Its output looks like:
Entering parsing function
Parsed HTML to soup
Found <tr> tag number 0
found <td> tag with headers
Value is 1,707,805
Found <tr> tag number 1
found <td> tag with headers
Value is 1,707,805
Found <tr> tag number 2
found <td> tag with headers
...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.