简体   繁体   English

如何抓取由 javascript 填充的表?

[英]How to scrape a table that is populated by javascript?

I am learning scraping.我正在学习刮痧。 Scraping site . 抓取网站

I am able to select:我可以选择:

जिला:, अनुमंडल:, अंचल:

from dropdown using selenium.从下拉列表中使用硒。

I can select from मौजा का नाम चुने: .我可以从मौजा का नाम चुने:选择मौजा का नाम चुने:

在此处输入图片说明

Afterwards, I am able to click on the खाता खोजें button.之后,我可以点击खाता खोजें按钮。

在此处输入图片说明

As a result, a table is populated at the bottom by javascript.结果,javascript 在底部填充了一个表。

在此处输入图片说明

The button's div code:按钮的div代码:

<input type="submit" name="ctl00$ContentPlaceHolder1$BtnSearch" value="खाता खोजें" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;ctl00$ContentPlaceHolder1$BtnSearch&quot;, &quot;&quot;, true, &quot;S&quot;, &quot;&quot;, false, false))" id="ctl00_ContentPlaceHolder1_BtnSearch" style="width:146px;">

Pagination is done by:分页是通过以下方式完成的:

javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1','Page$11')

I am not able to scrape this table.我无法刮这张桌子。

What I have tried:我尝试过的:

  1. phantomjs isn't supported with selenium selenium 不支持 phantomjs
  2. Table's id, ctl00_ContentPlaceHolder1_GridView1 , is not in HTML source code.表的 id ctl00_ContentPlaceHolder1_GridView1不在 HTML 源代码中。 Tried some approaches, no luck so far.尝试了一些方法,到目前为止没有运气。
#p_element = driver.find_element_by_id(id_='ctl00_ContentPlaceHolder1_GridView1')

p_element = driver.find_element_by_xpath('//*[@id="aspnetForm"]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[3]/td/table/tbody/tr/td/table/tbody/tr[4]')

print(p_element.text)


path_for_table='//*[@id="aspnetForm"]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[3]/td/table/tbody/tr/td/table/tbody/tr[4]'

table_list = WebDriverWait(driver, 2).until(lambda driver: driver.find_element_by_xpath(path_for_table))

print(table_list)

Pages I have looked at:我看过的页面:

First, let's get the site.首先,让我们获取站点。 I am using BeautifulSoup to scrape along with Selenium.我正在使用BeautifulSoup和 Selenium 一起刮。

import bs4 as Bs
from selenium import webdriver

DRIVER_PATH = 'D:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)

driver.get('http://lrc.bih.nic.in/ViewRor.aspx?DistCode=36&SubDivCode=2&CircleCode=9')

Then click on a village name (change according to your need)然后点击一个村庄名称(根据您的需要更改)

driver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_GridView2"]/tbody/tr[3]/td[1]').click()

click on "खाता खोजें" button:点击“खाताखोजें”按钮:

driver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_BtnSearch"]').click()

Get the page's source using BeautifulSoup使用 BeautifulSoup 获取页面的源代码

page_src = Bs.BeautifulSoup(driver.page_source)

find the id: ctl00_ContentPlaceHolder1_UpdatePanel2 and find all td s in it:找到 id: ctl00_ContentPlaceHolder1_UpdatePanel2并在其中找到所有td

table_elements = page_src.find("div",{"id":"ctl00_ContentPlaceHolder1_UpdatePanel2"}).find_all("td")

Get columns and get the text out of them获取列并从中取出文本

columns = table_elements[:6]
column_names = [e.text for e in header]

columns : columns

[<td>क्रम</td>,
 <td>रैयतधारी का नाम</td>,
 <td style="white-space:nowrap;">पिता/पति का नाम</td>,
 <td>खाता संख्या</td>,
 <td>खेसरा संख्या</td>,
 <td>अधिकार<br/>अभिलेख</td>]

column_names : column_names

['क्रम',
 'रैयतधारी का नाम',
 'पिता/पति का नाम',
 'खाता संख्या',
 'खेसरा संख्या',
 'अधिकारअभिलेख']

Next get the body of the table接下来获取表的主体

body_of_table = table_elements[6:-4] 

Then create chunks of 6 columns for each entry and get the text out然后为每个条目创建 6 列的块并取出文本

chunks = [body_of_table[x:x+6] for x in range(0, len(body_of_table), 6)]
data = [[e.text.strip('\n') for e in chunk] for chunk in chunks]

data:

[['1', 'अरूण कुमार', 'शिवलाल पासवान', '55', '406', 'देखें'],
 ['2', 'इन्द्रदेव प्रसाद', '\xa0', '98', '789', 'देखें'],
 ['3', 'ईश्वर मांझी', 'चमारी मांझी', '78', '42', 'देखें'],
 ['4', 'कवलसिया देवी', 'तुलसी मांझी', '120', '41', 'देखें'],
 ['5', 'कामदेव पांडे', 'शिवदानी पांडे', '210', '457, 459, 461, 474', 'देखें'],
 ['6', 'कामेश्वर मांझी', 'उत्ती मांझी', '78', '43', 'देखें'],
 ['7', 'कारू मांझी', 'राधे मांझी', '78', '42', 'देखें'],
 ['8', 'कारू मांझी', 'मेघन मांझी', '78', '42', 'देखें'],
 ['9', 'कौशल्या देवी', 'केदार महतो', '253', '757', 'देखें'],
 ['10', 'गणेश साव', 'छेदी साव', '156', '236', 'देखें'],

....

Now import Pandas and use it to create a dataframe out of this list of lists:现在import Pandas并使用它从这个列表列表中创建一个数据框:

import pandas as pd
df = pd.DataFrame(data, columns = column_names)

# set क्रम as index
df.set_index(df.columns[0])

Final result:最后结果:

import time # using time.sleep for illustration only. You should use explicit wait
import bs4 as Bs
import pandas as pd
from selenium import webdriver

DRIVER_PATH = 'D:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)

driver.get('http://lrc.bih.nic.in/ViewRor.aspx?DistCode=36&SubDivCode=2&CircleCode=9')

time.sleep(4)

#click on a village name
driver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_GridView2"]/tbody/tr[3]/td[1]').click()

time.sleep(2)

# click on खाता खोजें
driver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_BtnSearch"]').click()

time.sleep(2)


# ----------- table extracting part ------------------


# get page source
page_src = Bs.BeautifulSoup(driver.page_source)

# find the id: ctl00_ContentPlaceHolder1_UpdatePanel2 and find all tds in it
table_elements = page_src.find("div",{"id":"ctl00_ContentPlaceHolder1_UpdatePanel2"}).find_all("td")

# get columns and get the text out of them
columns = table_elements[:6]
column_names = [e.text for e in header]

# get the body of the table
body_of_table = table_elements[6:-4]

# create chunks of 6 columns for each entry
chunks = [body_of_table[x:x+6] for x in range(0, len(body_of_table), 6)]

# get the text out
data = [[e.text.strip('\n') for e in chunk] for chunk in chunks]

df = pd.DataFrame(data, columns = column_names)

# set क्रम as index
df.set_index(df.columns[0])

print(df)

在此处输入图片说明

... ...


To scape the next pages:转义下一页:

  • Click on the next button using Selenium.使用 Selenium 单击下一步按钮。
  • Wait for the page to load等待页面加载
  • Rerun the table extracting part (by putting it into a function)重新运行表提取部分(通过将其放入函数中)
  • discard the column names (we already have them)丢弃列名(我们已经有了它们)
  • append data into the already created data frame将数据附加到已经创建的数据框中
  • repeat the above steps of all pages (you can add a while loop and try clicking on a page till an exception occurs, see try and except )对所有页面重复上述步骤(你可以添加一个while循环并尝试点击页面直到发生异常,参见try和except

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM