简体   繁体   中英

Selenium-webdriver for parsing (Ruby)

So, here is the website (translate it into English, first) that uses javascript to show information about companies (You need to click on "address and telephone number"). I did it with selenium (clicking on links) and now I'm trying collect information about these companies(phone, address,etc) using css selectors and save to the database. But I can't properly save information to the database, because I can't save the information about companies in the required variables.

Here is my code(it is wrong):

require 'rubygems'
require 'selenium-webdriver'

driver = Selenium::WebDriver.for :firefox
driver.get 'http://www.ypag.ru/cat/komp249/page0.html'


driver.find_elements(:css, '.p2 div a').each {|link| link.click}
driver.find_elements
(:css, '.p3 a, .firm, .p2 table tr:nth-child(1) .p, .p2 table tr:nth-child(2) .p,
p2 table tr:nth-child(3) .p, .p2 table tr:nth-child(4) .p').each {|n,r,c,k,l,m| 
name = n
region = r
field1 = c
field1 k
field1 l
field1 m }

My purpose is to save each css selector in the right variable, is it possible? I already asked this question , but then I didn't have css selectors for address, phone, etc.

If I should add additional information, tell me

The 1st findElements returns 20 items, and your code block will click on each one.

HOwever, your 2nd findElements returns 48 items, and from your code block, and I do not understand what your code block is trying to achieve.

The '.p3 a, .firm, .p2 table tr:nth-child(1) .p, .p2 table tr:nth-child(2) .p, p2 table tr:nth-child(3) .p, .p2 table tr:nth-child(4) .p' css selector returns all matching elements (the "," is used as an "or" seperator).

However, iterating the array only returns one element at a time. Are you thinking that you can access all the fields for one company in each iteration? If so, you can't.

Using this will return the 20 address blocks;

driver.find_elements
(:css, 'div[id*='adressSelector']')

YOu can iterate these, performing findElement to get the fields you want.

The html for the page is not very nice - ie there are no good identifiers to relate data. For example, only the relative positioning allows you to related the company name to the address.

The below solution makes assumptions on the placement of text, which is brittle, but is the best I could think of.

require 'rubygems'
require 'selenium-webdriver'

driver = Selenium::WebDriver.for :firefox
driver.navigate.to 'http://www.ypag.ru/cat/komp249/page0.html'

# The table that contains all of the data
# This xpath is not ideal (brittle) but I could not find a better identifier)
table = driver.find_element(:xpath, '/html/body/table[4]/tbody/tr/td[2]/table')

# Expand all of the address links
table.find_elements(:css, 'a[href *= loadadress]').each(&:click)

# Get all of the rows that contain data
# Need to ignore blanks, ads, etc.
data_elements = table.find_elements(:xpath, './tbody/tr').keep_if do |row|
  row.find_elements(:css, '.p3, .p, .p2').length > 0
end

# Of the rows we have, each set of three rows represents a company
# Iterate through each set of three rows to collect data
data_elements.each_slice(3) do |company|
  name = company[0].find_element(:css, '.p3').text
  firm = company[0].find_element(:css, '.firm').text
  firm_split = firm.split(' » ')
  country = firm_split[0]
  city = firm_split[1]

  description = company[1].text

  # Get the address values matching, using the icons to determine the rows meaning
  # Note that not every company has each detail, in which case the value will be ''
  url = ''
  email = ''
  phone = ''
  address = ''

  # Wait to ensure the address block has been loaded
  wait = Selenium::WebDriver::Wait.new(:timeout => 10) # seconds
  begin
    element = wait.until { company[2].find_element(:css, 'div[id*=adressSelector]') }
  end

  sub_table_data = company[2].find_elements(:css, 'div[id*=adressSelector] tr') 
  sub_table_data.each do |row|
    cells = row.find_elements(:css, 'td')
    case cells[0].find_element(:css, 'img').attribute('src')
      when /papers/
        url = cells[1].text
      when /mail/
        email = cells[1].text
      when /mobile/
        phone = cells[1].text               
      when /map/
        address = cells[1].text                     
    end
  end

  # Output the results (or whatever you want them for)
  puts name
  puts country
  puts city
  puts description
  puts url
  puts email
  puts phone
  puts address
  puts
end

As an example, the above code will give the following details about the first company (note that this is from the page translated to English):

Storm-Print
Russia »Moscow
Printing Services: stationery, flyers, leaflets, brochures.
http://www.storm-print.ru
info@storm-print.ru
+7 (495) 101-37-62 multichannel Fax: +7 (495) 101-37-62 multichannel
Russia "Moscow ul.Suschevsky shaft 16, page 4, 127018

For reference, the html of a company looks like:

<tr>
   <td align="left" class="p3">
      <a href="http://www.msyp.ru/cat/kompaniy992511/s-779665944.html">
         <font>
            <font class="">
               Storm-Print
            </font>
         </font>
      </a>
   </td>
   <td align="right" class="firm">
      <font>
         <font>
             Russia >
             Moscow 
         </font>
      </font>
   </td>
</tr>
<tr>
   <td align="left" colspan="2" width="100%" class="p">
      <font>
         <font class="">
             Printing Services: stationery, flyers, leaflets, brochures. 
         </font>
      </font>
      <br>
   </td>
</tr>
<tr>
   <td colspan="2" align="right">
      <font>
         <font>
            Rating: 
         </font>
      </font>
      <a class="iframe2" href="reit/r.php?id=992511">
         <img src="fon/star_reit_off.png" border="0">
         <img src="fon/star_reit_off.png" border="0">
         <img src="fon/star_reit_off.png" border="0">
         <img src="fon/star_reit_off.png" border="0">
         <img src="fon/star_reit_off.png" border="0">
      </a>
   </td>
</tr>
<tr>
   <td colspan="2">
      <table class="p2" border="0" width="100%" cellpadding="0" cellspacing="0">
         <tbody>
            <tr>
               <td align="left">
                  <div id="adressSelector992511">
                     <table>
                        <tbody>
                           <tr>
                              <td>
                                 <img src="http://www.ypag.ru/fon/papers.gif" border="0">
                              </td>
                              <td class="p">
                                 <a href="http://www.storm-print.ru" target="_blank">
                                    <font>
                                       <font class="">
                                          http://www.storm-print.ru
                                       </font>
                                    </font>
                                 </a>
                              </td>
                           </tr>
                           <tr>
                              <td>
                                 <img src="http://www.ypag.ru/fon/mail.gif" border="0">
                              </td>
                              <td class="p">
                                 <a href="mailto:info@storm-print.ru">
                                    <font>
                                       <font class="">
                                          info@storm-print.ru
                                       </font>
                                    </font>
                                 </a>
                              </td>
                           </tr>
                           <tr>
                              <td>
                                 <img src="http://www.ypag.ru/fon/mobile.gif" border="0">
                              </td>
                              <td class="p">
                                 <font>
                                    <font class="">
                                       +7 (495) 101-37-62 multichannel Fax: +7 (495) 101-37-62 multichannel
                                    </font>
                                 </font>
                              </td>
                           </tr>
                           <tr>
                              <td>
                                 <img src="http://www.ypag.ru/fon/map.gif" border="0">
                              </td>
                              <td class="p">
                                 <font>
                                    <font class="">
                                       Russia "Moscow ul.Suschevsky shaft 16, page 4, 127018
                                    </font>
                                 </font>
                              </td>
                           </tr>
                           <tr>
                              <td>
                                 <img src="http://ypag.ru/fon/editdelete.png" border="0">
                              </td>
                              <td align="left" class="p">
                                 <a href="http://www.ypag.ru/edit_kompany.php?idkomp=992511&amp;c=3770450052" target="_blank" onclick="popupWin = window.open(this.href, 'contacts', 'location,width=600,height=500,top=0,scrollbars=yes'); popupWin.focus(); return false;">
                                    <font>
                                       <font>
                                          Report incorrect data
                                       </font>
                                    </font>
                                 </a>
                              </td>
                           </tr>
                        </tbody>
                     </table>
                  </div>
               </td>
            </tr>
         </tbody>
      </table>
   </td>
</tr>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM