简体   繁体   English

python web抓取和excel人口

[英]python web scraping and excel population

I am relatively new to programming and completely new to stack overflow. 我对编程比较陌生,对于堆栈溢出则是全新的。 I thought a good way to learn would be with a python & excel based project, but am stuck. 我以为学习的好方法是使用基于python和excel的项目,但是被卡住了。 My plan was to scrape a website of addresses using beautiful soup look up the zillow estimates of value for those addresses and populate them into tabular form in excel. 我的计划是使用漂亮的汤来刮擦地址的网站,以查找这些地址的价值估算值,然后在excel中将其填充为表格形式。 I am unable to figure out how to get the addresses (the html on the site I am trying to scrape seems pretty messy), but was able to pull google address links from the site. 我无法弄清楚如何获取地址(我要抓取的网站上的html看起来很乱),但是却能够从该网站提取google地址链接。 Sorry if this is a very basic question, any advice would help though: 抱歉,如果这是一个非常基本的问题,则任何建议都会有所帮助:

from bs4 import BeautifulSoup

from urllib.request import Request, 
urlopen

import re

import pandas as pd

req = Request("http://www.tjsc.com/Sales/TodaySales")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, "lxml")

count = 0
links = []
for link in soup.findAll('a'):
    links.append(link.get('href'))
    count = count +1

print(links)
print("count is", count)

po = links

pd.DataFrame(po).to_excel('todaysale.xlsx', header=False, index=False) pd.DataFrame(po).to_excel('todaysale.xlsx',header = False,index = False)

you are on the right track. 您走在正确的轨道上。 Instead of 'a', you need to use different html tag 'td' for the rows. 而不是'a',您需要为行使用不同的html标签'td'。 Also 'th' for column names. 列名也为“ th”。 here is one way to implement it. 这是实现它的一种方法。 list_slide function converts each 14 elements to one row since the original table has 14 columns. 由于原始表有14列,因此list_slide函数将每14个元素转换为一行。

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

url = "http://www.tjsc.com/Sales/TodaySales"
r = requests.get(url, verify=False)
text = r.text
soup = bs(text, 'lxml')

# Get column headers from the html file
header = []
for c_name in  soup.findAll('th'):
    header.append(c_name)
# clean up the extracted header content
header = [h.contents[0].strip() for h in header]

# get each row of the table
row = []
for link in soup.find_all('td'):
    row.append(link.get_text().strip())

def list_slice(my_list, step):
"""This function takes any list, and divides it to chunks of size of "step"
"""
   return [my_list[x:x + step] for x in range(0, len(my_list), step)]

# creating the final dataframe
df = pd.DataFrame(list_slice(row, 14), columns=header[:14])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM