简体   繁体   English

使用 BeautifulSoup 抓取 OSHA 网站

[英]Scraping OSHA website using BeautifulSoup

I'm looking for help with two main things: (1) scraping a web page and (2) turning the scraped data into a pandas dataframe (mostly so I can output as .csv, but just creating a pandas df is enough for now).我正在寻求两件主要事情的帮助:(1) 抓取网页和 (2) 将抓取的数据转换为 Pandas 数据帧(主要是这样我可以输出为 .csv,但现在只创建一个 Pandas df 就足够了)。 Here is what I have done so far for both:这是我迄今为止为两者所做的:

(1) Scraping the web site: (1) 抓取网站:

  • I am trying to scrape this page: https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015&id=1284178.015&id=1283809.015&id=1283549.015&id=1282631.015 .我正在尝试抓取此页面: https : //www.osha.gov/pls/imis/buildingment.inspection_detail? id = 1285328.015 & id = 1284178.015 & id = 1283809.015 & id = 1283549.015 & id =120156 My end goal is to create a dataframe that would ideally contain only the information I am looking for (ie I'd be able to select only the parts of the site that I am interested in for my df);我的最终目标是创建一个理想情况下只包含我正在寻找的信息的数据框(即,我只能为我的 df 选择我感兴趣的站点部分); it's OK if I have to pull in all the data for now.如果我现在必须提取所有数据也没关系。
  • As you can see from the URL as well as the ID hyperlinks underneath "Quick Link Reference" at the top of the page, there are five distinct records on this page.从 URL 以及页面顶部“快速链接参考”下方的 ID 超链接可以看出,该页面上有五个不同的记录。 I would like each of these IDs/records to be treated as an individual row in my pandas df.我希望这些 ID/记录中的每一个都被视为我的 Pandas df 中的一行。

EDIT : Thanks to a helpful comment, I'm including an example of what I would ultimately want in the table below.编辑:感谢有用的评论,我在下表中包含了我最终想要的示例。 The first row represents column headers/names and the second row represents the first inspection.第一行代表列标题/名称,第二行代表第一次检查。

inspection_id   open_date   inspection_type close_conference    close_case  violations_serious_initial  
1285328.015     12/28/2017    referral        12/28/2017       06/21/2018         2

Mostly relying on BeautifulSoup4, I've tried a few different options to get at the page elements I'm interested in:主要依赖于 BeautifulSoup4,我尝试了一些不同的选项来获取我感兴趣的页面元素:

# This is meant to give you the first instance of Case Status, which in the case of this page is "CLOSED".

case_status_template = html_soup.head.find('div', {"id" : "maincontain"}, 
class_ = "container").div.find('table', class_ = "table-bordered").find('strong').text

# I wasn't able to get the remaining Case Statuses with find_next_sibling or find_all, so I used a different method:

for table in html_soup.find_all('table', class_= "table-bordered"):
    print(table.text)

# This gave me the output I needed (i.e. the Case Status for all five records on the page), 
# but didn't give me the structure I wanted and didn't really allow me to connect to the other data on the page.

# I was also able to get to the same place with another page element, Inspection Details.
# This is the information reflected on the page after "Inspection: ", directly below Case Status.

insp_details_template = html_soup.head.find('div', {"id" : "maincontain"}, 
class_ = "container").div.find('table', class_ = "table-unbordered")

for div in html_soup.find_all('table', class_ = "table-unbordered"):
    print(div.text)

# Unfortunately, although I could get these two pieces of information to print,
# I realized I would have a hard time getting the rest of the information for each record.
# I also knew that it would be hard to connect/roll all of these up at the record level.

So, I tried a slightly different approach.所以,我尝试了一种稍微不同的方法。 By focusing instead on a version of that page with a single inspection record, I thought maybe I could just hack it by using this bit of code:通过专注于具有单个检查记录的该页面的版本,我想也许我可以使用以下代码来破解它:

url = 'https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
first_table = html_soup.find('table', class_ = "table-borderedu")
first_table_rows = first_table.find_all('tr')

for tr in first_table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    print(row)

# Then, actually using pandas to get the data into a df and out as a .csv.

dfs_osha = pd.read_html('https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015',header=1)
for df in dfs_osha:
    print(df)

path = r'~\foo'
dfs_osha = pd.read_html('https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015',header=1)
for df[1,3] in dfs_osha:
    df.to_csv(os.path.join(path,r'osha_output_table1_012320.csv'))

# This worked better, but didn't actually give me all of the data on the page,
# and wouldn't be replicable for the other four inspection records I'm interested in.

So, finally, I found a pretty handy example here: https://levelup.gitconnected.com/quick-web-scraping-with-python-beautiful-soup-4dde18468f1f .所以,最后,我在这里找到了一个非常方便的例子: https : //levelup.gitconnected.com/quick-web-scraping-with-python-beautiful-soup-4dde18468f1f I was trying to work through it, and had gotten as far as coming up with this code:我试图解决它,并且已经提出了这个代码:

for elem in all_content_raw_lxml:
    wrappers = elem.find_all('div', class_ = "row-fluid")
    for x in wrappers:
        case_status = x.find('div', class_ = "text-center")
        print(case_status)
        insp_details = x.find('div', class_ = "table-responsive")
        for tr in insp_details:
            td = tr.find_all('td')
            td_row = [i.text for i in td]
            print(td_row)
        violation_items = insp_details.find_next_sibling('div', class_ = "table-responsive")
        for tr in violation_items:
            tr = tr.find_all('tr')
            tr_row = [i.text for i in tr]
            print(tr_row)
        print('---------------')

Unfortunately, I ran into too many bugs with this to be able to use it so I was forced to abandon the project until I got some further guidance.不幸的是,我遇到了太多的错误而无法使用它,所以我被迫放弃该项目,直到我得到一些进一步的指导。 Hopefully the code I've shared so far at least shows the effort I've put in, even if it doesn't do much to get to the final output!希望到目前为止我共享的代码至少显示了我付出的努力,即使它对获得最终输出没有多大作用! Thanks.谢谢。

For this type of page you don't really need beautifulsoup;对于这种类型的页面,您实际上并不需要 beautifulsoup; pandas is enough.熊猫就够了。

url = 'your url above'
import pandas as pd
#use pandas to read the tables on the page; there are lots of them...
tables = pd.read_html(url)

#Select from this list of tables only those tables you need:
incident = [] #initialize a list of inspections
for i, table in enumerate(tables): #we need to find the index position of this table in the list; more below       
    if table.shape[1]==5: #all relevant tables have this shape
        case = [] #initialize a list of inspection items you are interested in       
        case.append(table.iat[1,0]) #this is the location in the table of this particular item
        case.append(table.iat[1,2].split(' ')[2]) #the string in the cell needs to be cleaned up a bit...
        case.append(table.iat[9,1])
        case.append(table.iat[12,3])
        case.append(table.iat[13,3])
        case.append(tables[i+2].iat[0,1]) #this particular item is in a table which 2 positions down from the current one; this is where the index position of the current table comes handy
        incident.append(case)        


columns = ["inspection_id",   "open_date",   "inspection_type", "close_conference",    "close_case",  "violations_serious_initial"]
df2 = pd.DataFrame(incident,columns=columns)
df2 

Output (pardon the formatting):输出(请原谅格式):

    inspection_id   open_date   inspection_type close_conference    close_case  violations_serious_initial
0   Nr: 1285328.015 12/28/2017  Referral    12/28/2017  06/21/2018  2
1   Nr: 1283809.015 12/18/2017  Complaint   12/18/2017  05/24/2018  5
2   Nr: 1284178.015 12/18/2017  Accident    05/17/2018  09/17/2018  1
3   Nr: 1283549.015 12/13/2017  Referral    12/13/2017  05/22/2018  3
4   Nr: 1282631.015 12/12/2017  Fat/Cat 12/12/2017  11/16/2018  1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM