简体   繁体   English

美丽的汤,html表解析

[英]Beautiful soup, html table parsing

I am currently having a bit of an issue trying to parse a table into an array. 我目前在尝试将表解析为数组时遇到了一些问题。

I have a simple table ( HERE ) which I need to parse with BS4 and put the cell contents into an array. 我有一个简单的表( HERE ),我需要用BS4解析并将单元格内容放入一个数组中。 What makes things difficult here is the fact that the cells don't contain text, but rather have images which have these titles: "Confirm" or "Site" - this is just user right's stuff. 这里的困难之处在于,单元格不包含文本,而是具有这些标题的图像:“确认”或“网站” - 这只是用户权限的东西。 [I am skipping row one which contains the checkboxes, those i can extract without problems] [我正在跳过包含复选框的第一行,我可以毫无问题地提取这些复选框]

If you look at the fiddle above, all I need to do is to parse it in such a way that the resulting array becomes: 如果你看一下上面的小提琴,我需要做的就是以这样的方式解析它,结果数组变成:

Array1[0] = User1,Confirm,Confirm,Site,Confirm
Array1[1] = User2,Confirm,Confirm,Confirm,Confirm
Array1[2] = User3,Confirm,Confirm,Confirm,Confirm
Array1[3] = User4,Confirm,Site,Site,Confirm

Which I can then do as I please with. 然后我可以随便做。 Another complication is that sometimes the number of rows will vary so the script should be able to adapt to this and recursively create the array from the table. 另一个复杂因素是,有时行数会有所不同,因此脚本应该能够适应这种情况,并从表中递归创建数组。

At the moment StackOverflow is my only hope.. I have spent the last 10 hours doing this myself with little to no success and frankly I have lost hope. 目前,StackOverflow是我唯一的希望..过去10个小时我一直在做这件事,几乎没有成功,坦白说我失去了希望。 Closest I got to getting something out was extractin the enclosed tags, but could not parse further for some weird reason, perhaps it's bs4's nesting limitation? 最接近我得到的东西是在封闭的标签中提取,但由于一些奇怪的原因无法进一步解析,也许这是bs4的嵌套限制? Could anyone have a look, please, and see if they can find a way of doing this? 任何人都可以看看,看看他们是否能找到这样做的方法? Or at least explain how to get there? 或至少解释如何到达那里?

var explanations: rightml - the soup for the table. var解释:rightml - 桌子上的汤。

allusers = []
rows = rightml.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        if (td.find(title="Group")) or (td.find(title="User")):
            text = ''.join(td.text.strip())
            allusers.append(text)
print allusers

gifrights = []

rows7 = rightml.findAll('td')
#print rows7
for tr7 in rows:
    cols7 = tr7.findAll('img')
    for td7 in cols7:
        if (td7.find(title="Confirm")) or (td7.find(title="Site")):
            text = ''.join(td7.text.strip())
            text2 = text.split(' ')
            print text2
            gifrights.append(text2)

I could be WAY off with this code.. but I gave it the ol' college try. 我可以用这个代码来解决问题......但我给了它'大学尝试'。

Would something like this work: 会这样的工作:

rows = soup.find('tbody').findAll('tr')

for row in rows:
    cells = row.findAll('td')

    output = []

    for i, cell in enumerate(cells):
        if i == 0:
            output.append(cell.text.strip())
        elif cell.find('img'):
            output.append(cell.find('img')['title'])
        elif cell.find('input'):
            output.append(cell.find('input')['value'])
    print output

This outputs the following: 这输出如下:

[u'Logged-in users', u'True', u'True', u'True', u'True']
[u'User 1', u'Confirm', u'Confirm', u'Site', u'Confirm']
[u'User 2', u'Confirm', u'Confirm', u'Confirm', u'Confirm']
[u'User 3', u'Confirm', u'Confirm', u'Confirm', u'Confirm']
[u'User 4', u'Confirm', u'Site', u'Site', u'Confirm']

I think it's faster to use list comprehension over the rows as such. 我认为在行上使用列表理解会更快。

rows = soup.find('tbody').findAll('tr')

for i in rows[1:]: # the first row is thrown out
    [j['title'] for j in i.findAll('img')]

Which gives you 哪个给你

['User', 'Confirm', 'Confirm', 'Site', 'Confirm']
['User', 'Confirm', 'Confirm', 'Confirm', 'Confirm']
['User', 'Confirm', 'Confirm', 'Confirm', 'Confirm']
['User', 'Confirm', 'Site', 'Site', 'Confirm']

You can cut out even more steps using nested list comprehension: 您可以使用嵌套列表理解来删除更多步骤:

# superpythonic
[[j['title'] for j in i.findAll('img')] for i in rows[1:]]

# all together now, but not so pythonic
[[j['title'] for j in i.findAll('img')] for i in soup.find('tbody').findAll('tr')[1:]]

You don't really need a User#, since the user# is the index number + 1. 你真的不需要User#,因为用户#是索引号+ 1。

[[j['title'] for j in i.findAll('img') if j['title'] != 'User'] for i in rows[1:]]

But, if you -must- have one... 但是,如果你 - 必须有...

for i in xrange(len(users)):
    users[i].append("User " + str(i+1))

But, if you were to insist on doing this, I would use a namedtuple as a data structure instead of a list. 但是,如果你坚持这样做,我会使用一个namedtuple作为数据结构而不是列表。 namedtuple namedtuple

from collections import namedtuple
# make these actual non-obfuscated names, not column numbers
User = namedtuple('User', ('num col_1 col_2 col_3 col_4') 

And then, once you have an instance of namedtuple for, say, User 1 as user , you can... 然后,一旦你有一个namedtuple实例,比如用户1作为user ,你就可以......

>>> user.num
... 1
>>> user.col_1
... 'Confirm'
>>> user.col_2
... 'Confirm'
>>> user.col_3
... 'Site'
>>> user.col_4
... 'Confirm'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM