[英]Regular Expressions (HTML parsing on iPhone)
I am trying to pull data from a website using objective-c. 我正在尝试使用Objective-C从网站提取数据。 This is all very new to me, so I've done some research. 这对我来说是很新的,所以我做了一些研究。 What I know now is that I need to use xpath, and I have another wrapper for that called hpple for the iPhone. 我现在所知道的是,我需要使用xpath,并且我为iPhone使用了另一个包装hpple。 I've got it up and running in my project. 我已经在项目中启动它并运行它。
I am confused about the way I retrieve information from the site. 我对从网站检索信息的方式感到困惑。 Apparently I am to use regular expressions in this line of code: 显然,我要在以下代码行中使用正则表达式:
NSArray * a = [doc search:@"//a[@class='sponsor']"];
This is just an example. 这只是一个例子。 Is that stuff in the search:@"...." the regular expression? search:@“ ....”中的正则表达式吗? If so, I guess I can develop the hundreds of patterns that I will need for my program to parse the site (I need a lot of data), but is there a better way? 如果是这样,我想我可以开发程序解析站点所需的数百种模式(我需要大量数据),但是还有更好的方法吗? I'm very lost in this. 我对此非常迷失。 Any help is appreciated. 任何帮助表示赞赏。
The parameter is an XPath, not a regular expression. 该参数是XPath,而不是正则表达式。 Here's a breakdown: 这是一个细分:
//
is an abbreviation meaning "all descendents" //
是缩写,表示“所有后代” a
means "all child nodes with a node type of 'a'" (in HTML, that's anchors ) a
意思是“节点类型为'a'的所有子节点 ”(在HTML中,是anchors ) [...]
contains a predicate , refining just which a
to match [...]
包含了谓词 ,炼油只是其中a
相匹配
@
is an abbreviation for attribute nodes @
是属性节点的缩写 @class
means an attribute named "class" @class
表示一个名为“ class”的属性 @class='sponsor'
means a class attribute equal to "sponsor". @class='sponsor'
表示等于“ sponsor”的类属性。 Note this will not match nodes with a class containing "sponsor", such as <a class="big sponsor" ...>
; 请注意,这将与包含 “ sponsor”的类(例如<a class="big sponsor" ...>
节点不匹配; the class must be equal . 班级必须平等 。 All together, we have "'a' nodes descending from the root that have class equal to 'sponsor'". 总之,我们有“'a'个节点,它们从根开始降级,其类等于'sponsor'”。
That is an XPath expression, not a regular expression. 那是一个XPath表达式,而不是正则表达式。 The W3C has an XPath reference here: http://www.w3.org/TR/xpath/ . W3C在此处具有XPath参考: http : //www.w3.org/TR/xpath/ 。 Basically you are searching for <a> elements with the class "sponsor". 基本上,您正在搜索具有“ sponsor”类的<a>元素。
Note that this is a good thing! 请注意,这是一件好事! Regular expressions are bad for parsing HTML. 正则表达式对解析HTML不利。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.