[英]case insensitive tag matching with xml-conduit?
What's the best way to perform case-insensitive tag and attribute name matching using xml-conduit
? 使用
xml-conduit
执行不区分大小写的标记和属性名称匹配的最佳方法是什么?
For example, consider the findNodes
function from the HTML parsing example on FP Complete's School of Haskell: 例如,考虑FP Complete的Haskell学校HTML解析示例中的
findNodes
函数:
https://www.fpcomplete.com/school/starting-with-haskell/libraries-and-frameworks/text-manipulation/tagsoup https://www.fpcomplete.com/school/starting-with-haskell/libraries-and-frameworks/text-manipulation/tagsoup
-- The data we're going to search for
findNodes :: Cursor -> [Cursor]
findNodes = element "span" >=> attributeIs "class" "sb_count" >=> child
(I've modified this line to that it will work with the Bing's current page structure.) (我将这一行修改为可以与Bing当前页面结构一起使用。)
My experiments indicate that element
and attributeIs
do not perform case-insensitive comparisons when matching names. 我的实验表明,匹配名称时
element
和attributeIs
不执行不区分大小写的比较。 Is there an easy way to change this? 有没有简单的方法可以改变这一点?
I've found a work-around... still interested in a cleaner solution. 我找到了一种解决方法...仍然对更清洁的解决方案感兴趣。
Basically we just create our own version of Text.HTML.DOM
which fixes up the tag and attribute names in tag event stream just before the XML tree is created. 基本上,我们只是创建自己的
Text.HTML.DOM
版本,该版本在创建XML树之前就在标签事件流中修复了标签和属性名称。
The function eventConduit
begins like this: 函数
eventConduit
像这样开始:
eventConduit :: Monad m => Conduit S.ByteString m XT.Event
eventConduit =
TS.tokenStream =$= go []
where
go stack = do
mx <- await
case fmap (entities . fmap' (decodeUtf8With lenientDecode)) mx of
Nothing -> closeStack stack
...
We change the case fmap ...
line to: 我们将
case fmap ...
行更改为:
case fmap (entities . fixNames . fmap' (decodeUtf8With lenientDecode)) mx of
where fixNames
is defined as: fixNames
定义为:
fixNames :: TS.Token' Text -> TS.Token' Text
fixNames (TS.TagOpen x pairs b) = TS.TagOpen (T.toLower x) (map (T.toLower *** id) pairs) b
fixNames (TS.TagClose x) = TS.TagClose (T.toLower x)
fixNames t = t
Now we just use lowercase names in element
and attributeIs
. 现在我们只在
element
和attributeIs
使用小写名称。
You can use laxElement to ignore case when matching elements. 匹配元素时,可以使用laxElement忽略大小写。 It will also ignore namespaces.
它还将忽略名称空间。 It should be pretty easy to write a wrapper around
checkName
that has the exact semantics you're looking for. 在
checkName
周围checkName
具有您要查找的确切语义的包装应该很容易。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.