简体   繁体   English

正则表达式(iPhone上的HTML解析)

[英]Regular Expressions (HTML parsing on iPhone)

I am trying to pull data from a website using objective-c. 我正在尝试使用Objective-C从网站提取数据。 This is all very new to me, so I've done some research. 这对我来说是很新的,所以我做了一些研究。 What I know now is that I need to use xpath, and I have another wrapper for that called hpple for the iPhone. 我现在所知道的是,我需要使用xpath,并且我为iPhone使用了另一个包装hpple。 I've got it up and running in my project. 我已经在项目中启动它并运行它。

I am confused about the way I retrieve information from the site. 我对从网站检索信息的方式感到困惑。 Apparently I am to use regular expressions in this line of code: 显然,我要在以下代码行中使用正则表达式:

NSArray * a = [doc search:@"//a[@class='sponsor']"];

This is just an example. 这只是一个例子。 Is that stuff in the search:@"...." the regular expression? search:@“ ....”中的正则表达式吗? If so, I guess I can develop the hundreds of patterns that I will need for my program to parse the site (I need a lot of data), but is there a better way? 如果是这样,我想我可以开发程序解析站点所需的数百种模式(我需要大量数据),但是还有更好的方法吗? I'm very lost in this. 我对此非常迷失。 Any help is appreciated. 任何帮助表示赞赏。

The parameter is an XPath, not a regular expression. 该参数是XPath,而不是正则表达式。 Here's a breakdown: 这是一个细分:

  • All xpaths are interpreted relative to a context node . 所有xpath都是相对于上下文节点解释的。 In this case, it's the root node. 在这种情况下,它是根节点。
  • // is an abbreviation meaning "all descendents" //是缩写,表示“所有后代”
  • a means "all child nodes with a node type of 'a'" (in HTML, that's anchors ) a意思是“节点类型为'a'的所有子节点 ”(在HTML中,是anchors
  • [...] contains a predicate , refining just which a to match [...]包含了谓词 ,炼油只是其中a相匹配
    • @ is an abbreviation for attribute nodes @是属性节点的缩写
    • @class means an attribute named "class" @class表示一个名为“ class”的属性
    • @class='sponsor' means a class attribute equal to "sponsor". @class='sponsor'表示等于“ sponsor”的类属性。 Note this will not match nodes with a class containing "sponsor", such as <a class="big sponsor" ...> ; 请注意,这将与包含 “ sponsor”的类(例如<a class="big sponsor" ...>节点不匹配; the class must be equal . 班级必须平等

All together, we have "'a' nodes descending from the root that have class equal to 'sponsor'". 总之,我们有“'a'个节点,它们从根开始降级,其类等于'sponsor'”。

That is an XPath expression, not a regular expression. 那是一个XPath表达式,而不是正则表达式。 The W3C has an XPath reference here: http://www.w3.org/TR/xpath/ . W3C在此处具有XPath参考: http : //www.w3.org/TR/xpath/ Basically you are searching for <a> elements with the class "sponsor". 基本上,您正在搜索具有“ sponsor”类的<a>元素。

Note that this is a good thing! 请注意,这是一件好事! Regular expressions are bad for parsing HTML. 正则表达式对解析HTML不利。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM