简体   繁体   中英

Haskell - How to parse XML response into Haskell datatypes?

I'm a beginner trying to learn Haskell by doing some simple parsing problems. I have this XML files. It's a Goodreads' API response.

<GoodreadsResponse>
    <Request>
        <authentication>true</authentication>
        <key>API_KEY</key>
        <method>search_search</method>
    </Request>
    <search>
        <query>fantasy</query>
        <results-start>1</results-start>
        <results-end>20</results-end>
        <total-results>53297</total-results>
        <source>Goodreads</source>
        <query-time-seconds>0.15</query-time-seconds>
        <results>
            <work>
                <id type="integer">4640799</id>
                <books_count type="integer">640</books_count>
                <ratings_count type="integer">5640935</ratings_count>
                <text_reviews_count type="integer">90100</text_reviews_count>
                <original_publication_year type="integer">1997</original_publication_year>
                <original_publication_month type="integer">6</original_publication_month>
                <original_publication_day type="integer">26</original_publication_day>
                <average_rating>4.46</average_rating>
                <best_book type="Book">
                    <id type="integer">3</id>
                    <title>Harry Potter and the Sorcerer's Stone (Harry Potter, #1)</title>
                    <author>
                        <id type="integer">1077326</id>
                        <name>J.K. Rowling</name>
                    </author>
                    <image_url>https://images.gr-assets.com/books/1474154022m/3.jpg</image_url>
                    <small_image_url>https://images.gr-assets.com/books/1474154022s/3.jpg</small_image_url>
                </best_book>
            </work>
              ...
              ...
              ...
              ...

This is what I've got so far

{-# LANGUAGE DeriveGeneric #-}

module Lib where

import           Data.ByteString.Lazy (ByteString)
import           Data.Text            (Text)
import           GHC.Generics         (Generic)
import           Network.HTTP.Conduit (simpleHttp)
import           Text.Pretty.Simple   (pPrint)
import           Text.XML.Light

data GRequest = GRequest { authentication :: Text
                         , key            :: Text
                         , method         :: Text
                         }
              deriving (Generic, Show)

data GSearch = GSearch { query              :: Text
                       , results_start      :: Int
                       , results_end        :: Int
                       , total_results      :: Int
                       , source             :: Text
                       , query_time_seconds :: Float
                       , search_results     :: GResults
                       }
             deriving (Generic, Show)

data GResults = GResults { results :: [Work] }
              deriving (Generic, Show)


data Work = Work { id                       :: Int
                 , booksCount               :: Int
                 , ratingsCount             :: Int
                 , text_reviewsCount        :: Int
                 , originalPublicationYear  :: Int
                 , originalPublicationMonth :: Int
                 , originalPublicationDay   :: Int
                 , averageRating            :: Float
                 , bestBook                 :: Book
                 }
            deriving (Generic, Show)

data Book = Book { bID            :: Int
                 , bTitle         :: Text
                 , bAuthor        :: Author
                 , bImageURL      :: Maybe Text
                 , bSmallImageURL :: Maybe Text
                 }
            deriving (Generic, Show)


data Author = Author { authorID   :: Int
                     , authorName :: Text
                     }
              deriving (Generic, Show)


data GoodreadsResponse = GoodreadsResponse { request :: GRequest
                                           , search  :: GSearch
                                           }
                         deriving (Generic, Show)



main :: IO ()
main = do
  x <- simpleHttp apiString :: IO ByteString -- apiString is the API URL
  let listOfElements = onlyElems $ parseXML x
      filteredElements = concatMap (findElements (simpleName "work")) listOfElements
      simpleName s = QName s Nothing Nothing
  pPrint $ filteredElements

Ultimately what I want to do is put every aspect of <work></work> (from <results> .. </results> ) into haskell workable types.

But I'm not sure how to go about doing that. I'm using the xml package to parse it into default types. But don't know how to put that into my custom types.

It looks like the most pertinent types that you'll want to pattern match on can be found here. Namely you'll want to take the [Content] results that the parseXML function from Text.XML.Light.Input returns and pattern match on each individual Content instance, mostly ignoring the CRef data constructor and instead focusing on Elem s because those are the XML tags that you care about (in addition to the Text constructors, which contain the non-XML strings found inside an XML tag).

For example you'll want to do something like the following:

#!/usr/bin/env stack
-- stack --resolver lts-12.24 --install-ghc runghc --package xml
import Text.XML.Light
import Data.Maybe

data MyXML =
    MyXML String [MyXML] -- Nested XML elements
  | Leaf  String         -- Leaves in the XML doc
  | Unit
  deriving (Show)

c2type :: Content -> Maybe MyXML
c2type (Text s) = Just $ Leaf $ cdData s
c2type (CRef _) = Nothing
c2type (Elem e) = Just $ MyXML (qName $ elName e) (mapMaybe c2type (elContent e))

main :: IO ()
main = do
  dat <- readFile "input.xml"
  let xml = parseXML dat
--  print xml
  print $ mapMaybe c2type xml

For the above code snippet, say input.xml contains the following XML:

<work>
  <a>1</a>
  <b>2</b>
</work>

Then running the example produces:

$ ./xml.hs 
[MyXML "work" [Leaf "\n  ",MyXML "a" [Leaf "1"],Leaf "\n  ",MyXML "b" [Leaf "2"],Leaf "\n"],Leaf "\n"]

The functions you'll probably find most interesting for your more extensive use case will probably include:

(qName . elName) -- Get the name of a tag in String format from an Elem
elContent -- Recursively extract the XML tag contents of an Elem
elAttribs -- Can check those 'type' attributes on some of your tags

In order to take a look at the general structure of the data types that the XML parser returns for your code, I strongly recommend eg uncommenting the print xml line in the code example above and inspecting the list of Contents it spits out on the command line. That alone should tell you exactly which fields you care about. For example this is what you get for my more minimal XML input example:

[Elem (Element {elName = QName {qName = "work", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "\n  ", cdLine = Just 1}),Elem (Element {elName = QName {qName = "a", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "1", cdLine = Just 2})], elLine = Just 2}),Text (CData {cdVerbatim = CDataText, cdData = "\n  ", cdLine = Just 2}),Elem (Element {elName = QName {qName = "b", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "2", cdLine = Just 3})], elLine = Just 3}),Text (CData {cdVerbatim = CDataText, cdData = "\n", cdLine = Just 3})], elLine = Just 1}),Text (CData {cdVerbatim = CDataText, cdData = "\n", cdLine = Just 4})]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM