[英]Haskell - How to parse XML response into Haskell datatypes?
I'm a beginner trying to learn Haskell by doing some simple parsing problems. 我是一个初学者,尝试通过做一些简单的解析问题来学习Haskell。 I have this XML files. 我有这个XML文件。 It's a Goodreads' API response. 这是Goodreads的API响应。
<GoodreadsResponse>
<Request>
<authentication>true</authentication>
<key>API_KEY</key>
<method>search_search</method>
</Request>
<search>
<query>fantasy</query>
<results-start>1</results-start>
<results-end>20</results-end>
<total-results>53297</total-results>
<source>Goodreads</source>
<query-time-seconds>0.15</query-time-seconds>
<results>
<work>
<id type="integer">4640799</id>
<books_count type="integer">640</books_count>
<ratings_count type="integer">5640935</ratings_count>
<text_reviews_count type="integer">90100</text_reviews_count>
<original_publication_year type="integer">1997</original_publication_year>
<original_publication_month type="integer">6</original_publication_month>
<original_publication_day type="integer">26</original_publication_day>
<average_rating>4.46</average_rating>
<best_book type="Book">
<id type="integer">3</id>
<title>Harry Potter and the Sorcerer's Stone (Harry Potter, #1)</title>
<author>
<id type="integer">1077326</id>
<name>J.K. Rowling</name>
</author>
<image_url>https://images.gr-assets.com/books/1474154022m/3.jpg</image_url>
<small_image_url>https://images.gr-assets.com/books/1474154022s/3.jpg</small_image_url>
</best_book>
</work>
...
...
...
...
This is what I've got so far 这就是我到目前为止
{-# LANGUAGE DeriveGeneric #-}
module Lib where
import Data.ByteString.Lazy (ByteString)
import Data.Text (Text)
import GHC.Generics (Generic)
import Network.HTTP.Conduit (simpleHttp)
import Text.Pretty.Simple (pPrint)
import Text.XML.Light
data GRequest = GRequest { authentication :: Text
, key :: Text
, method :: Text
}
deriving (Generic, Show)
data GSearch = GSearch { query :: Text
, results_start :: Int
, results_end :: Int
, total_results :: Int
, source :: Text
, query_time_seconds :: Float
, search_results :: GResults
}
deriving (Generic, Show)
data GResults = GResults { results :: [Work] }
deriving (Generic, Show)
data Work = Work { id :: Int
, booksCount :: Int
, ratingsCount :: Int
, text_reviewsCount :: Int
, originalPublicationYear :: Int
, originalPublicationMonth :: Int
, originalPublicationDay :: Int
, averageRating :: Float
, bestBook :: Book
}
deriving (Generic, Show)
data Book = Book { bID :: Int
, bTitle :: Text
, bAuthor :: Author
, bImageURL :: Maybe Text
, bSmallImageURL :: Maybe Text
}
deriving (Generic, Show)
data Author = Author { authorID :: Int
, authorName :: Text
}
deriving (Generic, Show)
data GoodreadsResponse = GoodreadsResponse { request :: GRequest
, search :: GSearch
}
deriving (Generic, Show)
main :: IO ()
main = do
x <- simpleHttp apiString :: IO ByteString -- apiString is the API URL
let listOfElements = onlyElems $ parseXML x
filteredElements = concatMap (findElements (simpleName "work")) listOfElements
simpleName s = QName s Nothing Nothing
pPrint $ filteredElements
Ultimately what I want to do is put every aspect of <work></work>
(from <results> .. </results>
) into haskell workable types. 最终,我想做的是将<work></work>
各个方面(来自<results> .. </results>
)放入haskell可行的类型中。
But I'm not sure how to go about doing that. 但是我不确定该怎么做。 I'm using the xml package to parse it into default types. 我正在使用xml包将其解析为默认类型。 But don't know how to put that into my custom types. 但是不知道如何将其放入我的自定义类型。
It looks like the most pertinent types that you'll want to pattern match on can be found here. 您似乎可以在此处找到要进行模式匹配的最相关类型。 Namely you'll want to take the [Content]
results that the parseXML
function from Text.XML.Light.Input
returns and pattern match on each individual Content
instance, mostly ignoring the CRef
data constructor and instead focusing on Elem
s because those are the XML tags that you care about (in addition to the Text
constructors, which contain the non-XML strings found inside an XML tag). 也就是说你要取[Content]
的结果是在parseXML
从功能Text.XML.Light.Input
回报和模式匹配每个单独的Content
例如,大多忽略CRef
数据构造和闷头Elem
是因为这些都是您关心的XML标记(除了Text
构造函数之外, Text
构造函数包含在XML标记内找到的非XML字符串)。
For example you'll want to do something like the following: 例如,您需要执行以下操作:
#!/usr/bin/env stack
-- stack --resolver lts-12.24 --install-ghc runghc --package xml
import Text.XML.Light
import Data.Maybe
data MyXML =
MyXML String [MyXML] -- Nested XML elements
| Leaf String -- Leaves in the XML doc
| Unit
deriving (Show)
c2type :: Content -> Maybe MyXML
c2type (Text s) = Just $ Leaf $ cdData s
c2type (CRef _) = Nothing
c2type (Elem e) = Just $ MyXML (qName $ elName e) (mapMaybe c2type (elContent e))
main :: IO ()
main = do
dat <- readFile "input.xml"
let xml = parseXML dat
-- print xml
print $ mapMaybe c2type xml
For the above code snippet, say input.xml
contains the following XML: 对于上面的代码段,说input.xml
包含以下XML:
<work>
<a>1</a>
<b>2</b>
</work>
Then running the example produces: 然后运行示例将产生:
$ ./xml.hs
[MyXML "work" [Leaf "\n ",MyXML "a" [Leaf "1"],Leaf "\n ",MyXML "b" [Leaf "2"],Leaf "\n"],Leaf "\n"]
The functions you'll probably find most interesting for your more extensive use case will probably include: 对于更广泛的用例,您可能会发现最有趣的功能可能包括:
(qName . elName) -- Get the name of a tag in String format from an Elem
elContent -- Recursively extract the XML tag contents of an Elem
elAttribs -- Can check those 'type' attributes on some of your tags
In order to take a look at the general structure of the data types that the XML parser returns for your code, I strongly recommend eg uncommenting the print xml
line in the code example above and inspecting the list of Contents it spits out on the command line. 为了看一下XML解析器为您的代码返回的数据类型的一般结构,我强烈建议例如取消注释上面代码示例中的print xml
行,并检查它在命令行中显示的内容列表。 。 That alone should tell you exactly which fields you care about. 仅此一项就可以准确告诉您您关心的领域。 For example this is what you get for my more minimal XML input example: 例如,这是我最小限度的XML输入示例的内容:
[Elem (Element {elName = QName {qName = "work", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "\n ", cdLine = Just 1}),Elem (Element {elName = QName {qName = "a", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "1", cdLine = Just 2})], elLine = Just 2}),Text (CData {cdVerbatim = CDataText, cdData = "\n ", cdLine = Just 2}),Elem (Element {elName = QName {qName = "b", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "2", cdLine = Just 3})], elLine = Just 3}),Text (CData {cdVerbatim = CDataText, cdData = "\n", cdLine = Just 3})], elLine = Just 1}),Text (CData {cdVerbatim = CDataText, cdData = "\n", cdLine = Just 4})]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.