简体   繁体   English

用于解析的Selenium-webdriver(Ruby)

[英]Selenium-webdriver for parsing (Ruby)

So, here is the website (translate it into English, first) that uses javascript to show information about companies (You need to click on "address and telephone number"). 因此,这是一个使用javascript显示有关公司的信息的网站 (首先将其翻译成英文)(您需要单击“地址和电话号码”)。 I did it with selenium (clicking on links) and now I'm trying collect information about these companies(phone, address,etc) using css selectors and save to the database. 我是用硒完成的(单击链接),现在我正在尝试使用css选择器收集有关这些公司(电话,地址等)的信息,并将其保存到数据库中。 But I can't properly save information to the database, because I can't save the information about companies in the required variables. 但是我无法正确地将信息保存到数据库中,因为我无法在所需的变量中保存有关公司的信息。

Here is my code(it is wrong): 这是我的代码(错了):

require 'rubygems'
require 'selenium-webdriver'

driver = Selenium::WebDriver.for :firefox
driver.get 'http://www.ypag.ru/cat/komp249/page0.html'


driver.find_elements(:css, '.p2 div a').each {|link| link.click}
driver.find_elements
(:css, '.p3 a, .firm, .p2 table tr:nth-child(1) .p, .p2 table tr:nth-child(2) .p,
p2 table tr:nth-child(3) .p, .p2 table tr:nth-child(4) .p').each {|n,r,c,k,l,m| 
name = n
region = r
field1 = c
field1 k
field1 l
field1 m }

My purpose is to save each css selector in the right variable, is it possible? 我的目的是将每个CSS选择器保存在正确的变量中,这可能吗? I already asked this question , but then I didn't have css selectors for address, phone, etc. 我已经问过这个问题 ,但是后来我没有用于地址,电话等的CSS选择器。

If I should add additional information, tell me 如果我应该添加其他信息,请告诉我

The 1st findElements returns 20 items, and your code block will click on each one. 第一个findElements返回20个项目,您的代码块将单击每个项目。

HOwever, your 2nd findElements returns 48 items, and from your code block, and I do not understand what your code block is trying to achieve. 但是,您的第二个findElements从您的代码块中返回了48个项目,而我不明白您的代码块正在试图实现什么。

The '.p3 a, .firm, .p2 table tr:nth-child(1) .p, .p2 table tr:nth-child(2) .p, p2 table tr:nth-child(3) .p, .p2 table tr:nth-child(4) .p' css selector returns all matching elements (the "," is used as an "or" seperator). '.p3a,.firm,.p2表tr:nth-​​child(1).p,.p2表tr:nth-​​child(2).p,p2表tr:nth-​​child(3).p, .p2表tr:nth-​​child(4).p'css选择器返回所有匹配的元素(“,”用作“或”分隔符)。

However, iterating the array only returns one element at a time. 但是,迭代数组一次仅返回一个元素。 Are you thinking that you can access all the fields for one company in each iteration? 您是否认为每次迭代都可以访问一家公司的所有字段? If so, you can't. 如果是这样,则不能。

Using this will return the 20 address blocks; 使用此方法将返回20个地址块。

driver.find_elements
(:css, 'div[id*='adressSelector']')

YOu can iterate these, performing findElement to get the fields you want. 您可以迭代这些,执行findElement以获取所需的字段。

The html for the page is not very nice - ie there are no good identifiers to relate data. 该页面的html不太好-即没有良好的标识符来关联数据。 For example, only the relative positioning allows you to related the company name to the address. 例如,只有相对位置允许您将公司名称与地址相关联。

The below solution makes assumptions on the placement of text, which is brittle, but is the best I could think of. 下面的解决方案对文本的放置进行了假设,这虽然很脆弱,但是却是我能想到的最好的选择。

require 'rubygems'
require 'selenium-webdriver'

driver = Selenium::WebDriver.for :firefox
driver.navigate.to 'http://www.ypag.ru/cat/komp249/page0.html'

# The table that contains all of the data
# This xpath is not ideal (brittle) but I could not find a better identifier)
table = driver.find_element(:xpath, '/html/body/table[4]/tbody/tr/td[2]/table')

# Expand all of the address links
table.find_elements(:css, 'a[href *= loadadress]').each(&:click)

# Get all of the rows that contain data
# Need to ignore blanks, ads, etc.
data_elements = table.find_elements(:xpath, './tbody/tr').keep_if do |row|
  row.find_elements(:css, '.p3, .p, .p2').length > 0
end

# Of the rows we have, each set of three rows represents a company
# Iterate through each set of three rows to collect data
data_elements.each_slice(3) do |company|
  name = company[0].find_element(:css, '.p3').text
  firm = company[0].find_element(:css, '.firm').text
  firm_split = firm.split(' » ')
  country = firm_split[0]
  city = firm_split[1]

  description = company[1].text

  # Get the address values matching, using the icons to determine the rows meaning
  # Note that not every company has each detail, in which case the value will be ''
  url = ''
  email = ''
  phone = ''
  address = ''

  # Wait to ensure the address block has been loaded
  wait = Selenium::WebDriver::Wait.new(:timeout => 10) # seconds
  begin
    element = wait.until { company[2].find_element(:css, 'div[id*=adressSelector]') }
  end

  sub_table_data = company[2].find_elements(:css, 'div[id*=adressSelector] tr') 
  sub_table_data.each do |row|
    cells = row.find_elements(:css, 'td')
    case cells[0].find_element(:css, 'img').attribute('src')
      when /papers/
        url = cells[1].text
      when /mail/
        email = cells[1].text
      when /mobile/
        phone = cells[1].text               
      when /map/
        address = cells[1].text                     
    end
  end

  # Output the results (or whatever you want them for)
  puts name
  puts country
  puts city
  puts description
  puts url
  puts email
  puts phone
  puts address
  puts
end

As an example, the above code will give the following details about the first company (note that this is from the page translated to English): 例如,上面的代码将提供有关第一家公司的以下详细信息(请注意,这是从翻译为英语的页面中获得的):

Storm-Print
Russia »Moscow
Printing Services: stationery, flyers, leaflets, brochures.
http://www.storm-print.ru
info@storm-print.ru
+7 (495) 101-37-62 multichannel Fax: +7 (495) 101-37-62 multichannel
Russia "Moscow ul.Suschevsky shaft 16, page 4, 127018

For reference, the html of a company looks like: 供参考,公司的html看起来像:

<tr>
   <td align="left" class="p3">
      <a href="http://www.msyp.ru/cat/kompaniy992511/s-779665944.html">
         <font>
            <font class="">
               Storm-Print
            </font>
         </font>
      </a>
   </td>
   <td align="right" class="firm">
      <font>
         <font>
             Russia >
             Moscow 
         </font>
      </font>
   </td>
</tr>
<tr>
   <td align="left" colspan="2" width="100%" class="p">
      <font>
         <font class="">
             Printing Services: stationery, flyers, leaflets, brochures. 
         </font>
      </font>
      <br>
   </td>
</tr>
<tr>
   <td colspan="2" align="right">
      <font>
         <font>
            Rating: 
         </font>
      </font>
      <a class="iframe2" href="reit/r.php?id=992511">
         <img src="fon/star_reit_off.png" border="0">
         <img src="fon/star_reit_off.png" border="0">
         <img src="fon/star_reit_off.png" border="0">
         <img src="fon/star_reit_off.png" border="0">
         <img src="fon/star_reit_off.png" border="0">
      </a>
   </td>
</tr>
<tr>
   <td colspan="2">
      <table class="p2" border="0" width="100%" cellpadding="0" cellspacing="0">
         <tbody>
            <tr>
               <td align="left">
                  <div id="adressSelector992511">
                     <table>
                        <tbody>
                           <tr>
                              <td>
                                 <img src="http://www.ypag.ru/fon/papers.gif" border="0">
                              </td>
                              <td class="p">
                                 <a href="http://www.storm-print.ru" target="_blank">
                                    <font>
                                       <font class="">
                                          http://www.storm-print.ru
                                       </font>
                                    </font>
                                 </a>
                              </td>
                           </tr>
                           <tr>
                              <td>
                                 <img src="http://www.ypag.ru/fon/mail.gif" border="0">
                              </td>
                              <td class="p">
                                 <a href="mailto:info@storm-print.ru">
                                    <font>
                                       <font class="">
                                          info@storm-print.ru
                                       </font>
                                    </font>
                                 </a>
                              </td>
                           </tr>
                           <tr>
                              <td>
                                 <img src="http://www.ypag.ru/fon/mobile.gif" border="0">
                              </td>
                              <td class="p">
                                 <font>
                                    <font class="">
                                       +7 (495) 101-37-62 multichannel Fax: +7 (495) 101-37-62 multichannel
                                    </font>
                                 </font>
                              </td>
                           </tr>
                           <tr>
                              <td>
                                 <img src="http://www.ypag.ru/fon/map.gif" border="0">
                              </td>
                              <td class="p">
                                 <font>
                                    <font class="">
                                       Russia "Moscow ul.Suschevsky shaft 16, page 4, 127018
                                    </font>
                                 </font>
                              </td>
                           </tr>
                           <tr>
                              <td>
                                 <img src="http://ypag.ru/fon/editdelete.png" border="0">
                              </td>
                              <td align="left" class="p">
                                 <a href="http://www.ypag.ru/edit_kompany.php?idkomp=992511&amp;c=3770450052" target="_blank" onclick="popupWin = window.open(this.href, 'contacts', 'location,width=600,height=500,top=0,scrollbars=yes'); popupWin.focus(); return false;">
                                    <font>
                                       <font>
                                          Report incorrect data
                                       </font>
                                    </font>
                                 </a>
                              </td>
                           </tr>
                        </tbody>
                     </table>
                  </div>
               </td>
            </tr>
         </tbody>
      </table>
   </td>
</tr>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM