Python-使用request和bs4进行超级简单的抓取

Question

I am trying to get the data from the main table from this page: https://www.interactivebrokers.com/en/index.php?f=2222&exch=globex&showcategories=FUTGRP#productbuffer 我正在尝试从此页面从主表中获取数据： https : //www.interactivebrokers.com/en/index.php?f=2222&exch=globex&showcategories=FUTGRP#productbuffer

I tried: 我试过了：

import requests
from bs4 import BeautifulSoup

address="https://www.interactivebrokers.com/en/index.php?f=2222&exch=globex&showcategories=FUTGRP#productbuffer"

r=requests.get(address)
soup=(r.text,"html_parser")

I know this is super basic but somehow i'm blocked here. 我知道这是超级基本的方法，但不知何故我在这里被封锁了。

I tried soup.find_all('table') but couldn't identify correctly the table i'm looking for (it seems to have no ID or distinguishable attribute). 我尝试了soup.find_all('table')但无法正确识别我要查找的表（它似乎没有ID或可区分的属性）。

I tried soup.find_all('tr') then i can see the rows i am looking for but there is also some undesired rows in the result that i don't know how to separate. 我尝试了soup.find_all('tr')然后我可以看到我要查找的行，但是结果中还有一些我不知道如何分开的行。

Anyone can help me with my first step with bs4? 有人可以帮助我进行bs4的第一步吗？

Answer 1

It seems the problem is, that the data you want actually resides outside the table tag, but in a tbody-tag. 看来问题在于，您想要的数据实际上位于表标签之外，但位于tbody标签中。 The site has 3 of these. 该站点有3个。

So a working code to grab the tds would look like this: 因此，获取tds的有效代码如下所示：

import requests
from bs4 import BeautifulSoup

url = 'https://www.interactivebrokers.com/en/index.php?f=2222&exch=globex&showcategories=FUTGRP#productbuffer'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
table = soup.find_all('tbody')[2]
trs = table.find_all('tr')

Then you just have to iterate over the trs to get the content, you're after. 然后，您只需要遍历trs即可获取内容，这很值得。 tds are in a list which has four elements. tds在具有四个元素的列表中。 You are after nr. 你在追求nr。 0, 2 and 3. Normally you could go with that. 0、2和3。通常，您可以这样做。 Since nr 1 has the same link text ('linkexternal') I went by that instead. 由于nr 1具有相同的链接文本（“ linkexternal”），因此我改用了该链接文本。

outfile = r'C:\output_file.txt'
with open(outfile, 'a', encoding='utf-8') as fd:
    for tr in trs:
        try:
            tds = tr.find_all('td')
            print_elements = ",".join([td.text for td in tds if 'linkexternal' not in str(td)])
            fd.write(print_elements+'\n')
        except:
            #some exception handling, perhaps logging
            pass

Python-使用request和bs4进行超级简单的抓取

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-08-01 08:39:01

Python-使用request和bs4进行超级简单的抓取

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-08-01 08:39:01

解决方案1
1 已采纳 2017-08-01 08:39:01