简体   繁体   English

抓取XML节点+ Nokogiri和xpath的文本值

[英]Grabbing the text value of an XML node + Nokogiri and xpath

I have built a rake file to insert all of the information I grab about a certain into my database. 我已经构建了一个rake文件,将我抓取的所有信息插入到我的数据库中。 This is all working, but the values for my keys are not being populated with any data. 这一切都正常,但我的键的值没有填充任何数据。 Am I possibly making my at_xpath calls incorrectly? 我可能错误地使我的at_xpath调用? I'll post an example below -- 我将在下面发布一个例子 -

information = {
            "street_address" => property.at_xpath("/Address/AddressLine1/text()"),
            "city" => property.at_xpath("/Address/City/text()"),
            "zipcode" => property.at_xpath("/Address/PostalCode/text()"),
            "short_description" => property.at_xpath("/Information/ShortDescription/text()"),
            "long_description" => property.at_xpath("Information/LongDescription/text()"),
            "rent" => property.at_xpath("/Information/Rents/StandardRent/text()"),
            "application_fee" => property.at_xpath("/Fee/ApplicationFee/text()"),
            "bedrooms" => property.at_xpath("/Floorplan/Room[@RoomType='Bedroom']/Count/text()"),
            "bathrooms" => property.at_xpath("/Floorplan/Room[@RoomType='Bathroom']/Count/text()"),
            "bathrooms" => property.at_xpath("/ILS_Unit/Availability/VacancyClass/text()")
        }

I know everything is working perfectly aside from putting the data into the actual value spaces in the hash listed above. 我知道除了将数据放入上面列出的哈希中的实际值空间之外,一切都完美无缺。 I also know that nokogiri and xpath are working properly as I have narrowed down the number of s down from 33,000+ to 1,068. 我也知道nokogiri和xpath工作正常,因为我将s的数量从33,000+缩小到1,068。

Any guidance would be super appreciated! 任何指导都将非常感激! Thank you :) 谢谢 :)

========================= UPDATE ============================ 更新======================== ====

I thought seeing the whole loop might help add clarity -- 我认为看到整个循环可能有助于增加清晰度 -

doc.xpath("//Property/PropertyID/Identification[@OrganizationName='northsteppe']").each do |property|

        # GATHER EACH PROPERTY'S INFORMATION
        information = {
            "street_address" => property.at_xpath("/Address/AddressLine1/text()"),
            "city" => property.at_xpath("/Address/City/text()"),
            "zipcode" => property.at_xpath("/Address/PostalCode/text()"),
            "short_description" => property.at_xpath("/Information/ShortDescription/text()"),
            "long_description" => property.at_xpath("Information/LongDescription/text()"),
            "rent" => property.at_xpath("/Information/Rents/StandardRent/text()"),
            "application_fee" => property.at_xpath("/Fee/ApplicationFee/text()"),
            "bedrooms" => property.at_xpath("/Floorplan/Room[@RoomType='Bedroom']/Count/text()"),
            "bathrooms" => property.at_xpath("/Floorplan/Room[@RoomType='Bathroom']/Count/text()"),
            "bathrooms" => property.at_xpath("/ILS_Unit/Availability/VacancyClass/text()")
        }


        # CREATE NEW PROPERTY WITH INFORMATION HASH CREATED ABOVE
        if Property.create!(information)
            puts "yay!"
        else
            puts "oh no! this sucks!"
        end

    end # ENDS XPATH EACH LOOP

============================ ANOTHER UPDATE ========================== ============================另一个更新==================== ======

so I tried swapping out the "/text()" at the end of each at_xpath path with "/inner_text()" and received the following error -- 所以我尝试用“/ inner_text()”交换每个at_xpath路径末尾的“/ text()”并收到以下错误 -

rake aborted! 耙子流产了! Invalid expression: /Address/AddressLine1/inner_text() 表达式无效:/ Address / AddressLine1 / inner_text()

I then tried switching my "at_xpath" calls to "at_css" calls and doing something like -- 然后我尝试将我的“at_xpath”调用切换为“at_css”调用并执行类似的操作 -

"street_address" => property.at_css(".AddressLine1").text

but recieved the following error -- 但收到以下错误 -

rake aborted! 耙子流产了! undefined method `text' for nil:NilClass nil的未定义方法`text':NilClass

============================= UPDATE TO SHOW XML =========================== =============================更新显示XML ================= ==========

<Property IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
  <PropertyID>
    <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="northsteppe" IDType="property"/>
    <Identification IDValue="6e1e61523972d5f0e260e3d38eb488337424f21e" OrganizationName="northsteppe" IDType="Company"/>
    <MarketingName>Spacious House Central Campus OSU, available fall</MarketingName>
    <WebSite>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</WebSite>
    <Address AddressType="property">
      <Description>Address of Available Listing</Description>
      <AddressLine1>1689 N 4th St </AddressLine1>
      <City>Columbus</City>
      <State>OH</State>
      <PostalCode>43201</PostalCode>
      <Country>US</Country>
    </Address>
    <Phone PhoneType="office">
      <PhoneNumber>(614) 299-4110</PhoneNumber>
    </Phone>
    <Email>northsteppe.nsr@gmail.com</Email>
  </PropertyID>
  <ILS_Identification ILS_IdentificationType="Apartment" RentalType="Market Rate">
    <Latitude>39.997694</Latitude>
    <Longitude>-82.99903</Longitude>
    <LastUpdate Month="11" Day="11" Year="2013"/>
  </ILS_Identification>
  <Information>
    <StructureType>Standard</StructureType>
    <UnitCount>1</UnitCount>
    <ShortDescription>Spacious House Central Campus OSU, available fall</ShortDescription>
    <LongDescription>One of our favorites! This great house is perfect for students or a single family. With huge living and sleeping rooms, there is plenty of space. The kitchen is totally modernized with new appliances, and the bathroom has been updated. Natural woodwork and brick accents are seen within the house, and the decorative mantles. Ceiling fans and mini-blinds are included, as well as a FREE stack washer and dryer. The front and side deck. On site parking available.</LongDescription>
    <Rents>
      <StandardRent>2000.00</StandardRent>
    </Rents>
    <PropertyAvailabilityURL>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</PropertyAvailabilityURL>
  </Information>
  <Fee>
    <ProrateType>Standard</ProrateType>
    <LateType>Standard</LateType>
    <LatePercent>0</LatePercent>
    <LateMinFee>0</LateMinFee>
    <LateFeePerDay>0</LateFeePerDay>
    <NonRefundableHoldFee>0</NonRefundableHoldFee>
    <AdminFee>0</AdminFee>
    <ApplicationFee>30.00</ApplicationFee>
    <BrokerFee>0</BrokerFee>
  </Fee>
  <Deposit DepositType="Security Deposit">
    <Amount AmountType="Actual">
      <ValueRange Exact="2000.00" Currency="USD"/>
    </Amount>
  </Deposit>
  <Policy>
    <Pet Allowed="false"/>
  </Policy>
  <Phase IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
    <Name/>
    <Description/>
    <UnitCount>1</UnitCount>
    <RentableUnits>1</RentableUnits>
    <TotalSquareFeet>0</TotalSquareFeet>
    <RentableSquareFeet>0</RentableSquareFeet>
  </Phase>
  <Building IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
    <Name/>
    <Description/>
    <UnitCount>1</UnitCount>
    <SquareFeet>0</SquareFeet>
  </Building>
  <Floorplan IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
    <Name/>
    <UnitCount>1</UnitCount>
    <Room RoomType="Bedroom">
      <Count>4</Count>
      <Comment/>
    </Room>
    <Room RoomType="Bathroom">
      <Count>1</Count>
      <Comment/>
    </Room>
    <SquareFeet Min="0" Max="0"/>
    <MarketRent Min="2000" Max="2000"/>
    <EffectiveRent Min="2000" Max="2000"/>
  </Floorplan>
  <ILS_Unit IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
    <Units>
      <Unit>
        <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="UL Portfolio"/>
        <MarketingName>Spacious House Central Campus OSU, available fall</MarketingName>
        <UnitBedrooms>4</UnitBedrooms>
        <UnitBathrooms>1.0</UnitBathrooms>
        <MinSquareFeet>0</MinSquareFeet>
        <MaxSquareFeet>0</MaxSquareFeet>
        <SquareFootType>internal</SquareFootType>
        <UnitRent>2000.00</UnitRent>
        <MarketRent>2000.00</MarketRent>
        <Address AddressType="property">
          <AddressLine1>1689 N 4th St </AddressLine1>
          <City>Columbus</City>
          <PostalCode>43201</PostalCode>
          <Country>US</Country>
        </Address>
      </Unit>
    </Units>
    <Availability>
      <VacateDate Month="7" Day="23" Year="2014"/>
      <VacancyClass>Unoccupied</VacancyClass>
      <MadeReadyDate Month="7" Day="23" Year="2014"/>
    </Availability>
    <Amenity AmenityType="Other">
      <Description>All new stainless steel appliances!  Refinished hardwood floors</Description>
    </Amenity>
    <Amenity AmenityType="Other">
      <Description>Ceramic tile</Description>
    </Amenity>
    <Amenity AmenityType="Other">
      <Description>Ceiling fans</Description>
    </Amenity>
    <Amenity AmenityType="Other">
      <Description>Wrap-around porch</Description>
    </Amenity>
    <Amenity AmenityType="Dryer">
      <Description>Free Washer and Dryer</Description>
    </Amenity>
    <Amenity AmenityType="Washer">
      <Description>Free Washer and Dryer</Description>
    </Amenity>
    <Amenity AmenityType="Other">
      <Description>off-street parking available</Description>
    </Amenity>
  </ILS_Unit>
  <File Active="true" FileID="820982141">
    <FileType>Photo</FileType>
    <Description>Unit Photo</Description>
    <Name/>
    <Caption/>
    <Format>image/jpeg</Format>
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/31077069-6e81-4373-8a89-508c57585543/medium.jpg</Src>
    <Width>360</Width>
    <Height>300</Height>
    <Rank>1</Rank>
  </File>
  <File Active="true" FileID="820982145">
    <FileType>Photo</FileType>
    <Description>Unit Photo</Description>
    <Name/>
    <Caption/>
    <Format>image/jpeg</Format>
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/84e1be40-96fd-4717-b75d-09b39231a762/medium.jpg</Src>
    <Width>350</Width>
    <Height>265</Height>
    <Rank>2</Rank>
  </File>
  <File Active="true" FileID="820982149">
    <FileType>Photo</FileType>
    <Description>Unit Photo</Description>
    <Name/>
    <Caption/>
    <Format>image/jpeg</Format>
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/cd419635-c37f-4676-a43e-c72671a2a748/medium.jpg</Src>
    <Width>350</Width>
    <Height>265</Height>
    <Rank>3</Rank>
  </File>
  <File Active="true" FileID="820982152">
    <FileType>Photo</FileType>
    <Description>Unit Photo</Description>
    <Name/>
    <Caption/>
    <Format>image/jpeg</Format>
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/6b68dbd5-2cde-477c-99d7-3ca33f03cce8/medium.jpg</Src>
    <Width>350</Width>
    <Height>265</Height>
    <Rank>4</Rank>
  </File>
  <File Active="true" FileID="820982155">
    <FileType>Photo</FileType>
    <Description>Unit Photo</Description>
    <Name/>
    <Caption/>
    <Format>image/jpeg</Format>
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/17b6c7c0-686c-4e46-865b-11d80744354a/medium.jpg</Src>
    <Width>350</Width>
    <Height>265</Height>
    <Rank>5</Rank>
  </File>
  <File Active="true" FileID="820982157">
    <FileType>Photo</FileType>
    <Description>Unit Photo</Description>
    <Name/>
    <Caption/>
    <Format>image/jpeg</Format>
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/3545ac8b-471f-404a-94b2-fcd00dd16e25/medium.jpg</Src>
    <Width>350</Width>
    <Height>265</Height>
    <Rank>6</Rank>
  </File>
  <File Active="true" FileID="820982160">
    <FileType>Photo</FileType>
    <Description>Unit Photo</Description>
    <Name/>
    <Caption/>
    <Format>image/jpeg</Format>
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/02471172-2183-4bf1-a3d7-33415f902c1c/medium.jpg</Src>
    <Width>350</Width>
    <Height>265</Height>
    <Rank>7</Rank>
  </File>
</Property>

In your loop you do: 在你的循环中你做:

doc.xpath("//Property/PropertyID/Identification[@OrganizationName='northsteppe']").each do |property|

Then, for your values you do things like: 然后,为了你的价值观,你可以做以下事情:

property.at_xpath("/Address/AddressLine1/text()")

You can't use /Address/AddressLine1/text() relative to property with XPath. 您不能使用/Address/AddressLine1/text()相对于XPath property

Nokogiri will search for /Address/AddressLine1/text() , which means, start at the absolute path, which would be starting from the top of the document / , find the Address node immediately below it, find the AddressLine1 node under it.... 引入nokogiri将搜索/Address/AddressLine1/text()这意味着,在绝对路径,这会从文档的顶部开始启动/ ,找到Address节点立即在它下面,找到AddressLine1其下的节点.. ..

Instead use: 而是使用:

Address/AddressLine1/text()

Which means search relative to property and results in the full XPath: 这意味着搜索对于property并导致完整的XPath:

//Property/PropertyID/Identification[@OrganizationName='northsteppe']/Address/AddressLine1/text()

Looking at the XML you added... 查看您添加的XML ...

The paths you want don't exist. 您想要的路径不存在。 Looking at it in PRY: 在PRY中看着它:

[16] (pry) main: 0> puts doc.xpath("//Property/PropertyID/Identification[@OrganizationName='northsteppe']").to_xml
<Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="northsteppe" IDType="property"/><Identification IDValue="6e1e61523972d5f0e260e3d38eb488337424f21e" OrganizationName="northsteppe" IDType="Company"/>

Neither of the property nodes have children. 两个property节点都没有子节点。 Only the node for property exists, so all the values you're looking for, which are child-nodes, aren't there. 只存在property节点,因此您要查找的所有值(子节点)都不存在。

Instead, it looks like you want to find the Property node and work downward: 相反,看起来您想要找到Property节点并向下工作:

Your first XPath is too deep. 你的第一个XPath太深了。 It returns an Identification where you need a PropertyID. 它返回一个需要PropertyID的标识。 Try this: 尝试这个:

doc.xpath("//Property/PropertyID[ Identification/@OrganizationName = 'northsteppe' ]").each do |property|
    # GATHER EACH PROPERTY'S INFORMATION
    information = {
        "street_address" => property.at_xpath("Address/AddressLine1/text()").to_s,
        "city" => property.at_xpath("Address/City/text()").to_s,
        "zipcode" => property.at_xpath("Address/PostalCode/text()").to_s
        }
    p information
end

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM