简体   繁体   English

在Python中使用BS4从网页中提取数据

[英]Extracting data from a web page using BS4 in Python

I am trying to extract data from this site: http://www.afl.com.au/fixture 我正在尝试从以下站点提取数据: http : //www.afl.com.au/fixture

in a way such that I have a dictionary having the date as key and the "Preview" links as Values in a list, like 以这样的方式,我有一个以日期为键的字典,而“ Preview”链接作为列表中的“值”,例如

dict = {Saturday, June 07: ["preview url-1, "preview url-2","preview url-3","preview url-4"]}

Please help me get it, I have used the code below: 请帮助我,我使用了以下代码:

def extractData():
    lDateInfoMatchCase = False
#     lDateInfoMatchCase = []
    global gDict
    for row in table_for_players.findAll("tr"):
        for lDateRowIndex in row.findAll("th", {"colspan" : "4"}):
            ldateList.append(lDateRowIndex.text)

    print ldateList
    for index in ldateList:
        #print index
        lPreviewLinkList = []
        for row in table_for_players.findAll("tr"):
            for lDateRowIndex in row.findAll("th", {"colspan" : "4"}):

                if lDateRowIndex.text == index:
                    lDateInfoMatchCase = True
                else:
                    lDateInfoMatchCase = False

             if lDateInfoMatchCase == True:
                     for lInfoRowIndex in row.findAll("td", {"class": "info"}):
                         for link in lInfoRowIndex.findAll("a", {"class" : "preview"}):
                             lPreviewLinkList.append("http://www.afl.com.au/" + link.get('href'))
        print lPreviewLinkList
        gDict[index] = lPreviewLinkList

My main aim is to get the all player names who are playing for a match in home and in away team according to date in a data structure. 我的主要目标是根据数据结构中的日期获取在主队和客队进行比赛的所有球员的姓名。

I prefer using CSS Selectors . 我更喜欢使用CSS选择器 Select the first table, then all rows in the tbody for ease of processing; 选择第一个表,然后选择tbody所有行以便于处理; the rows are 'grouped' by tr th rows. 这些行按tr th行“分组”。 From there you can select all next siblings that don't contain th headers and scan these for preview links: 从那里,你可以选择不包含所有未来的兄弟姐妹th头和扫描这些预览链接:

previews = {}

table = soup.select('table.fixture')[0]
for group_header in table.select('tbody tr th'):
    date = group_header.string
    for next_sibling in group_header.parent.find_next_siblings('tr'):
        if next_sibling.th:
            # found a next group, end scan
            break
        for preview in next_sibling.select('a.preview'):
            previews.setdefault(date, []).append(
                "http://www.afl.com.au" + preview.get('href'))

This builds a dictionary of lists; 这将建立一个列表字典; for the current version of the page this produces: 对于页面的当前版本,它将产生:

{u'Monday, June 09': ['http://www.afl.com.au/match-centre/2014/12/melb-v-coll'],
 u'Sunday, June 08': ['http://www.afl.com.au/match-centre/2014/12/gcfc-v-syd',
                      'http://www.afl.com.au/match-centre/2014/12/fre-v-adel',
                      'http://www.afl.com.au/match-centre/2014/12/nmfc-v-rich']}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM