简体   繁体   English

Python BeautifulSoup 解析/爬表

[英]Python BeautifulSoup parsing / crawl table

For my own interst I want to crawl the table of properties from ```"https://thinkimmo.com/search?noReset=true".为了我自己的兴趣,我想从“https://thinkimmo.com/search?noReset=true”中抓取属性表。 After having clicked on "TABELLE" (TABLE) you can see all properties listed in a table.点击“TABELLE”(TABLE)后,您可以看到表格中列出的所有属性。


Whit the following code I am able to see the table:
driver.get("https://thinkimmo.com/search?noReset=true")
driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[2]/div/div[2]/div/div[2]/div/div/div/div[1]/div/div/button[2]/span[1]').click()

Now I am able to crawl some parts of the table with the following code:
soup = BeautifulSoup(driver.page_source, 'html.parser')
htmltable = soup.find('table', { 'class' : 'MuiTable-root' })
def tableDataText(table):       
    rows = []
    trs = table.find_all('tr')
    headerow = [td.get_text(strip=True) for td in trs[0].find_all('th')] # header row
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append([td.get_text(strip=True) for td in tr.find_all('td')]) # data row
    return rows
list_table = tableDataText(htmltable)
list_table

The result however is not what I expect. I only get the first 7 headings, but all other headings are not returned.

After I had a closer look in the HTMl of the webpage I am not sure how to get all headings and results of the table.

I am looking forward to solve the problem of getting only some parts of the heading. And more closely I am interested in why I am failing.


What I see in the result of table = soup.find("table") is that after the 7th heading title the table closes.  

Thanks in advance.

Steffen

The site uses a backend api you can edit to bulk download data:该站点使用后端 api 您可以编辑以批量下载数据:

import requests
import pandas as pd

results = 1000

url = f'https://api.thinkimmo.com/immo?active=true&type=APARTMENTBUY&sortBy=publishDate,desc&from=0&size={str(results)}&grossReturnAnd=false&allowUnknown=false&excludePlatforms=ebk,immowelt&favorite=false&noReset=true&excludedFields=true&geoSearches=[]&averageAggregation=buyingPrice%3BpricePerSqm%3BsquareMeter%3BconstructionYear%3BrentPrice%3BrentPricePerSqm%3BrentPricePerSqm%3BrunningTime&termsAggregation=platforms.name.keyword,60'

resp = requests.get(url).json()
df = pd.DataFrame(resp['results'])
df.to_csv('thinkimmo.csv',index=False)
print('Saved to thinkimmo.csv')

This is alot of unstructured data but should help.这是很多非结构化数据,但应该有所帮助。 If you want to inspect what is in this api call and only get certain parts of the returned JSON then you can open your browser's Developer Tools - Network - fetch/XHR and reload the page to see all the backend requests fire.如果您想检查此 api 调用中的内容并且只获取返回的 JSON 的某些部分,那么您可以打开浏览器的开发人员工具 - 网络 - 获取/XHR 并重新加载页面以查看所有后端请求触发。 You are looking for one that starts "immo?"您正在寻找一个以“immo”开头的产品? take a look at the Payload and Preview to see all the data.查看有效负载和预览以查看所有数据。 That's what we are scraping above.这就是我们在上面抓取的内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM