![](/img/trans.png)
[英]How do I scrape the info from a table on a webpage when the info is created dynamically by javascript?
[英]How do I avoid data from different tabs to be concatenated in one cell when I scrape a table?
我抓了这个页面https://www.capfriendly.com/teams/bruins ,专门寻找Cap Hit(Fowards,Defense,GoalTenders)标签下的表格。
我使用Python和BeautifulSoup4和CSV作为输出格式。
import requests, bs4
r = requests.get('https://www.capfriendly.com/teams/bruins')
soup = bs4.BeautifulSoup(r.text, 'lxml')
table = soup.find(id="team")
with open("csvfile.csv", "w", newline='') as team_data:
for tr in table('tr', class_=['odd', 'even']): # get all tr whose class is odd or even
row = [td.text for td in tr('td')] # extract td's text
writer = csv.writer(team_data)
writer.writerow(row)
这是我得到的输出:
['Krejci, David "A"', 'NMC', 'C', 'NHL', '30', '$7,250,000$7,250,000NMC', '$7,250,000$7,500,000NMC', '$7,250,000$7,500,000NMC', '$7,250,000$7,000,000Modified NTC', '$7,250,000$7,000,000Modified NTC', 'UFA', '']
['Bergeron, Patrice "A"', 'NMC', 'C', 'NHL', '31', '$6,875,000$8,750,000NMC', '$6,875,000$8,750,000NMC', '$6,875,000$6,875,000$6,000,000NMC', '$6,875,000$4,375,000$3,500,000NMC', '$6,875,000$4,375,000$1,000,000Modified NTC, NMC', '$6,875,000$4,375,000$1,000,000Modified NTC, NMC', 'UFA']
['Backes, David', 'NMC', 'C, RW', 'NHL', '32', '$6,000,000$8,000,000$3,000,000NMC', '$6,000,000$8,000,000$3,000,000NMC', '$6,000,000$6,000,000$3,000,000NMC', '$6,000,000$4,000,000$3,000,000Modified NTC', '$6,000,000$4,000,000$1,000,000Modified NTC', 'UFA', '']
['Marchand, Brad', 'M-NTC', 'LW', 'NHL', '28', '$4,500,000$5,000,000Modified NTC', '$6,125,000$8,000,000$4,000,000NMC', '$6,125,000$8,000,000$3,000,000NMC', '$6,125,000$7,500,000$4,000,000NMC', '$6,125,000$5,000,000$1,000,000NMC', '$6,125,000$6,500,000$4,000,000NMC', '$6,125,000$5,000,000$3,000,000Modified NTC']
如您所见,来自不同选项卡的数据连接在一起:
'$7,250,000$7,000,000Modified NTC'
有人建议我使用javascript刮表,它应该解决我的问题?
根据源代码,这是特定行中的某些文本,这些文本有条件可见,具体取决于您所使用的选项卡(如标题所示)。 当要隐藏在特定选项卡上时,类.hide
将添加到td
的子元素。
当您解析td
元素以检索文本时,您可以过滤掉那些被认为是隐藏的元素。 这样,您可以检索可见的文本,就像您在Web浏览器中查看页面一样。
在下面的代码片段中,我添加了一个parse_td
函数,该函数使用一类hide
来过滤子span
元素。 从那里,返回相应的文本。
import requests, bs4, csv
r = requests.get('https://www.capfriendly.com/teams/bruins')
soup = bs4.BeautifulSoup(r.text, 'lxml')
table = soup.find(id="team")
with open("csvfile.csv", "w", newline='') as team_data:
def parse_td(td):
filtered_data = [tag.text for tag in td.find_all('span', recursive=False)
if 'hide' not in tag.attrs['class']]
return filtered_data[0] if filtered_data else td.text;
for tr in table('tr', class_=['odd', 'even']):
row = [parse_td(td) for td in tr('td')]
writer = csv.writer(team_data)
writer.writerow(row)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.