简体   繁体   English

代数数据类型的特定用例

[英]A specific use case of algebraic data types

I was writing an generic enumerator to scrape sites as an exercise and I did it, and it is complete and works fine, but I have a question. 我当时正在编写一个通用的枚举数来刮擦站点,但是我做到了,它是完整的并且可以正常工作,但是我有一个问题。 You can find it here: https://github.com/mindreader/scrape-enumerator if you want to look at the code. 您可以在这里找到它: https : //github.com/mindreader/scrape-enumerator如果要查看代码。

The basic idea is I wanted an enumerator that spits out site defined entries on pages like search engines, blogs, things where you have to fetch a page, and it will have 25 entries, and you want one entry at a time. 基本思想是我想要一个枚举器,在搜索引擎,博客等您必须获取页面的页面上吐出站点定义的条目,该条目将有25个条目,并且您一次想要一个条目。 But at the same time I didn't want to write the plumbing for every site, so I wanted a generic interface. 但是同时,我不想为每个站点都编写管道,所以我想要一个通用接口。 What I came up with is this (this uses type families): 我想到的是这个(它使用类型族):

class SiteEnum a where
  type Result a :: *
  urlSource :: a -> InputUrls (Int,Int)
  enumResults :: a -> L.ByteString -> Maybe [Result a]

data InputUrls state =
  UrlSet [URL] |
  UrlFunc state (state -> (state,URL)) |
  UrlPageDependent URL (L.ByteString -> Maybe URL)

In order to do this on every type of site, this requires a url source of some sort, which could be a list (possibly infinite) of pregenerated urls, or it could be an initial state and something to generate urls from it (like if the urls contained &page=1, &page=2, etc), and then for really screwed up pages like google, give an initial url and then provide a function that will search the body for the next link and then use that. 为了在每种类型的网站上执行此操作,这需要某种类型的url源,它可以是预生成的url的列表(可能是无限个),也可以是初始状态,并可以从中生成url(例如网址包含&page = 1,&page = 2等),然后为真正搞砸的网页(如google)提供一个初始网址,然后提供一个函数,该函数将在正文中搜索下一个链接,然后使用该链接。 Your site makes a data type an instance of SiteEnum and gives a type to Result which is site dependent and now the enumerator deals with all the I/O, and you don't have to think about it. 您的站点将数据类型作为SiteEnum的实例,并为Result赋予类型,该类型取决于站点,现在枚举器处理所有I / O,而您不必考虑它。 This works perfectly and I implemented one site with it. 这完美地工作了,我用它实现了一个站点。

My question is that there is an annoyance with this implementation is the InputUrls datatype. 我的问题是此实现的烦人之处在于InputUrls数据类型。 When I use UrlFunc everything is golden. 当我使用UrlFunc时,一切都是黄金。 When I use UrlSet or UrlPageDependent, it isn't all fun and games because the state type is undefined, and I have to cast it to :: InputUrls () in order for it to compile. 当我使用UrlSet或UrlPageDependent时,并不是所有的娱乐和游戏,因为状态类型是不确定的,因此必须将其强制转换为:: InputUrls()以便进行编译。 This seems totally unnecessary as that type variable due to the way the program is written, will never be used for the majority of sites, but I don't know how to get around it. 这似乎完全没有必要,因为由于程序的编写方式,该类型变量永远不会用于大多数站点,但是我不知道如何解决它。 I'm finding that I want to use types like this in a lot of different contexts, and I always end up with stray type variables that only are needed in certain pieces of the datatype, but it doesn't feel like I should be using it this way. 我发现我想在许多不同的上下文中使用这样的类型,并且我总是以杂散类型变量结尾,这些变量仅在数据类型的某些片段中才需要,但我觉得我不应该使用这样。 Is there a better way of doing this? 有更好的方法吗?

Why do you need the UrlFunc case at all? 为什么根本需要UrlFunc案例? From what I understand, the only thing you're doing with the state function is using it to build a list like the one in UrlSet anyway, so instead of storing the state function, just store the resulting list. 据我了解,状态函数唯一要做的就是使用它来构建一个类似UrlSet的列表的列表,因此,除了存储状态函数外,还只需存储结果列表即可。 That way, you can eliminate the state type variable from your data type, which should eliminate the ambiguity problems. 这样,您可以从数据类型中消除state类型变量,从而消除歧义性问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM