简体   繁体   English

美丽的汤刮台

[英]Beautiful Soup Scraping table

I have this small piece of code to scrape table data from a web site and then display in a csv format. 我有这小段代码可以从网站上抓取表格数据,然后以csv格式显示。 The issue is that for loop is printing the records multiple time . 问题是for循环多次打印记录。 I am not sure if it is due to 我不确定是否是由于
tag. 标签。 btw I am new to Python. 顺便说一句,我是Python的新手。 Thanks for your help! 谢谢你的帮助!

#import needed libraries
import urllib
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import sys
import re


# read the data from a URL
url = requests.get("https://www.top500.org/list/2018/06/")

# parse the URL using Beauriful Soup
soup = BeautifulSoup(url.content, 'html.parser')

newtxt= ""
for record in soup.find_all('tr'):
    tbltxt = ""
    for data in record.find_all('td'):
        tbltxt = tbltxt + "," + data.text
        newtxt= newtxt+ "\n" + tbltxt[1:]
        print(newtxt)
from bs4 import BeautifulSoup
import requests

url = requests.get("https://www.top500.org/list/2018/06/")
soup = BeautifulSoup(url.content, 'html.parser')
table = soup.find_all('table', attrs={'class':'table table-condensed table-striped'})
for i in table:
    tr = i.find_all('tr')
    for x in tr:
        print(x.text)

Or the best way to parse table using pandas 或使用Pandas解析表格的最佳方法

import pandas as pd
table = pd.read_html('https://www.top500.org/list/2018/06/', attrs={
    'class': 'table table-condensed table-striped'}, header = 1)
print(table)

It's printing much of the data multiple times because the newtext variable, which you are printing after getting the text of each <td></td> , is just accumulating all the values. 它多次打印大量数据,因为在获取每个<td></td>的文本之后要打印的newtext变量只是累加了所有值。 Easiest way to get this to work is probably to just move the line print(newtxt) outside of both for loops - that is, leave it totally unindented. 最简单的方法是将行print(newtxt)移到两个for循环之外-也就是说,完全不缩进。 You should then see a list of all the text, with that from each row on a new line, and that from each individual cell in a row separated by commas. 然后,您应该看到所有文本的列表,其中每一行的内容都位于换行符之间,并且每一行中的每个单独单元格都由逗号分隔。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM