[英]conditionally adding multiple items to a list of lists via python list comprehension
I am scraping a page, pulling data from a table, with the desired end product being a list of lists. 我正在抓取页面,从表中提取数据,所需的最终产品是列表列表。
import urllib2
from bs4 import BeautifulSoup
html = BeautifulSoup(urllib2.urlopen('http://domain.com').read(), 'lxml')
tagged_data = [row('td') for row in html('table',{'id' : 'targeted_table'})[0]('tr') ]
# One of the <td>'s has an a tag in it that I need to grab the link from, hence the conditional
clean_data = [[(item.string if item.string is not None else ([item('a')[0].string, item('a')[0]['href']])) for item in info ] for info in tagged_data ]
The above code generates the following structure: 上面的代码生成以下结构:
[[[u'data 01',
'http://domain1.com'],
u'data 02',
u'data 03',
u'data 04'],
[[u'data 11',
'http://domain2.com'],
u'data 12',
u'data 13',
u'data 14'],
[[u'data 01',
'http://domain1.com'],
u'data 22',
u'data 23',
u'data 24']]
But what I'd really like is: 但是我真正想要的是:
[[u'data 01',
u'http://domain1.com',
u'data 02',
u'data 03',
u'data 04'],
[u'data 11',
u'http://domain2.com',
u'data 12',
u'data 13',
u'data 14'],
[u'data 01',
u'http://domain1.com',
u'data 22',
u'data 23',
u'data 24']]
I also tried: 我也尝试过:
clean_data = [[(item.string if item.string is not None else (item('a')[0].string, item('a')[0]['href'])) for item in info ] for info in tagged_data ]
But it puts a tuple(I think) in the first item of the sublist. 但这将一个元组(我认为)放在子列表的第一项中。
[(u'data01',
'http://domain1.com'),
u'data02',
u'data03',
u'data04']
So suggestions? 有什么建议吗?
Example Data 示例数据
<table id='targeted_table'>
<tr>
<td><a href="http://domain.com">data 01</a></td>
<td>data 02</td>
<td>data 03</td>
<td>data 04</td>
</tr>
<tr>
<td><a href="http://domain.com">data 11</a></td>
<td>data 12</td>
<td>data 13</td>
<td>data 14</td>
</tr>
<tr>
<td><a href="http://domain.com">data 01</a></td>
<td>data 22</td>
<td>data 23</td>
<td>data 24</td>
</tr>
<tr>
<td><a href="http://domain.com">data 01</a></td>
<td>data 32</td>
<td>data 33</td>
<td>data 34</td>
</tr>
</table>
The line 线
html = BeautifulSoup(urllib2.urlopen('http://domain.com').read(), 'lxml')
implies you have lxml installed, so you could use an XPath using |
表示您已经安装了lxml,因此可以使用带有|
的XPath |
to pull out text or attribute values: 提取文本或属性值:
import urllib2
import lxml.html as LH
html = LH.parse(urllib2.urlopen('http://domain.com'))
clean_data = [[elt for elt in tr.xpath('td/a/text() | td/a/@href | td/text()')]
for tr in html.xpath('//table[@id="targeted_table"]/tr')]
print(clean_data)
yields 产量
[['http://domain.com', 'data 01', 'data 02', 'data 03', 'data 04'],
['http://domain.com', 'data 11', 'data 12', 'data 13', 'data 14'],
['http://domain.com', 'data 01', 'data 22', 'data 23', 'data 24'],
['http://domain.com', 'data 01', 'data 32', 'data 33', 'data 34']]
You could also do it with a single call to the xpath
method: 您也可以通过一次调用xpath
方法来完成此操作:
pieces = iter(html.xpath('''//table[@id="targeted_table"]/tr/td/a/text()
| //table[@id="targeted_table"]/tr/td/a/@href
| //table[@id="targeted_table"]/tr/td/text()'''))
clean_data = zip(*[pieces]*5)
You're trying to have the list comprehension emit two elements some of the time, and a single element at other times. 您试图使列表理解有时会发出两个元素,而在其他时候会发出一个元素。
You can do something like this by enclosing a comprehension over your "one if [criteria] else two" code. 您可以通过对“一个(如果[准则]否则为两个)”代码进行理解来进行类似的操作。
clean_data = [[res for item in info for res in (
[item.string] if item.string is not None else
([item('a')[0].string, item('a')[0]['href']])
)]
for info in tagged_data]
Granted, I don't think this method is very clean. 当然,我认为这种方法不是很干净。 If you're parsing HTML / XML, I'd recommend that you use the tools for the job and avoid messy tree traversal. 如果要解析HTML / XML,建议您使用该工具进行工作,并避免凌乱的树遍历。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.