简体   繁体   English

不区分大小写的标签与xml-conduit匹配?

[英]case insensitive tag matching with xml-conduit?

What's the best way to perform case-insensitive tag and attribute name matching using xml-conduit ? 使用xml-conduit执行不区分大小写的标记和属性名称匹配的最佳方法是什么?

For example, consider the findNodes function from the HTML parsing example on FP Complete's School of Haskell: 例如,考虑FP Complete的Haskell学校HTML解析示例中的findNodes函数:

https://www.fpcomplete.com/school/starting-with-haskell/libraries-and-frameworks/text-manipulation/tagsoup https://www.fpcomplete.com/school/starting-with-haskell/libraries-and-frameworks/text-manipulation/tagsoup

-- The data we're going to search for
findNodes :: Cursor -> [Cursor]
findNodes = element "span" >=> attributeIs "class" "sb_count" >=> child

(I've modified this line to that it will work with the Bing's current page structure.) (我将这一行修改为可以与Bing当前页面结构一起使用。)

My experiments indicate that element and attributeIs do not perform case-insensitive comparisons when matching names. 我的实验表明,匹配名称时elementattributeIs不执行不区分大小写的比较。 Is there an easy way to change this? 有没有简单的方法可以改变这一点?

I've found a work-around... still interested in a cleaner solution. 我找到了一种解决方法...仍然对更清洁的解决方案感兴趣。

Basically we just create our own version of Text.HTML.DOM which fixes up the tag and attribute names in tag event stream just before the XML tree is created. 基本上,我们只是创建自己的Text.HTML.DOM版本,该版本在创建XML树之前就在标签事件流中修复了标签和属性名称。

The function eventConduit begins like this: 函数eventConduit像这样开始:

eventConduit :: Monad m => Conduit S.ByteString m XT.Event
eventConduit =
    TS.tokenStream =$= go []
  where
    go stack = do
        mx <- await
        case fmap (entities . fmap' (decodeUtf8With lenientDecode)) mx of
            Nothing -> closeStack stack
...

We change the case fmap ... line to: 我们将case fmap ...行更改为:

        case fmap (entities . fixNames . fmap' (decodeUtf8With lenientDecode)) mx of

where fixNames is defined as: fixNames定义为:

fixNames :: TS.Token' Text -> TS.Token' Text
fixNames (TS.TagOpen x pairs b) = TS.TagOpen (T.toLower x) (map (T.toLower *** id) pairs) b
fixNames (TS.TagClose x)        = TS.TagClose (T.toLower x)
fixNames t                      = t

Now we just use lowercase names in element and attributeIs . 现在我们只在elementattributeIs使用小写名称。

You can use laxElement to ignore case when matching elements. 匹配元素时,可以使用laxElement忽略大小写。 It will also ignore namespaces. 它还将忽略名称空间。 It should be pretty easy to write a wrapper around checkName that has the exact semantics you're looking for. checkName周围checkName具有您要查找的确切语义的包装应该很容易。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM