HTML 解析（在 Java/Android 中）然后从中提取数据，这是获取网页内容的有效方法吗？

Question

So, I'm using HTTP Post Requests in Android Java to log into a website, before extracting the entire HTML code.因此，在提取整个 HTML 代码之前，我使用 HTTP Post Requests in Android Java 登录网站。 After that, I use Pattern/Matcher (regex) to find all the elements I need before extracting them from the HTML data, and deleting everything unnecessary.之后，我使用 Pattern/Matcher (regex) 找到我需要的所有元素，然后从 HTML 数据中提取它们，并删除所有不需要的元素。 For instance when I extract this:例如，当我提取这个时：

String extractions = <td>Good day sir</td>

Then I use:然后我使用：

extractions.replaceAll("<td>", "").replaceAll("</td>", "");

I do this multiple times until I have all the data needed from that site, before I display it in some kind of list.在我将其显示在某种列表中之前，我会多次执行此操作，直到我拥有该站点所需的所有数据。

I'm not particularly stuck on anything, but please, can you tell me if this is an effective/efficient/fast way of getting data from a page and processing it, or are there ways to do this faster?我并没有特别坚持任何事情，但是请你告诉我这是否是一种有效/高效/快速地从页面获取数据并处理它的方法，或者有什么方法可以更快地做到这一点？ Because sometimes it's like my program takes a lot of time to get certain data (although mostly that's when I'm on 3G on my phone).因为有时就像我的程序需要花费很多时间来获取某些数据（尽管大多数情况下是在我手机上使用 3G 时）。

Answer 1

Using regex to parse a website is always a bad idea:使用正则表达式来解析网站总是一个坏主意：

How to use regular expressions to parse HTML in Java? 如何用正则表达式解析Java中的HTML？

Using regular expressions to parse HTML: why not? 用正则表达式解析HTML：为什么不行？

Answer 2

Like others have said, regex is not the best tool for this job.正如其他人所说，正则表达式并不是这项工作的最佳工具。 But in this case, the particular way you use regex is even more inefficient than it would normally be.但在这种情况下，您使用正则表达式的特定方式比通常情况下效率更低。

In any case, let me offer one more possible solution (depending on your use case).无论如何，让我提供另一种可能的解决方案（取决于您的用例）。

It's called YQL (Yahoo Query Language).它被称为 YQL（雅虎查询语言）。 http://developer.yahoo.com/yql/ http://developer.yahoo.com/yql/

Here is a console for it so you can play around with it.这是它的控制台，因此您可以使用它。 http://developer.yahoo.com/yql/console/ http://developer.yahoo.com/yql/console/

YQL is the lazy developer's way to build your own api on the fly. YQL 是懒惰的开发人员动态构建自己的 api 的方法。 The main inconvenience is that you have to use Yahoo as a go-between, but if you're ok with that, then I'd suggest you go that route.主要的不便之处在于您必须使用 Yahoo 作为中间人，但如果您对此没有意见，那么我建议您使用 go 这条路线。 Using YQL is probably the quickest way to get that kind of work done (especially if the html you're targeting keeps on changing and if its html tags are not always valid).使用 YQL 可能是完成此类工作的最快方式（特别是如果您定位的 html 不断变化并且其 html 标签并不总是有效）。

Answer 3

Have a look at the Apache Tika library for extracting text from HTML - there are many other parsers also available, such as PDF etc. : http://tika.apache.org/查看用于从 HTML 中提取文本的 Apache Tika 库 - 还有许多其他解析器也可用，例如 PDF 等： http://tika.apache.org/

HTML 解析（在 Java/Android 中）然后从中提取数据，这是获取网页内容的有效方法吗？

问题描述

3 个解决方案

解决方案1
0 2012-04-04 08:58:25

解决方案2
0 已采纳 2012-04-04 09:29:04

解决方案3
0 2012-04-04 09:34:19

HTML 解析（在 Java/Android 中）然后从中提取数据，这是获取网页内容的有效方法吗？

问题描述

3 个解决方案

解决方案1 0 2012-04-04 08:58:25

解决方案2 0 已采纳 2012-04-04 09:29:04

解决方案3 0 2012-04-04 09:34:19

解决方案1
0 2012-04-04 08:58:25

解决方案2
0 已采纳 2012-04-04 09:29:04

解决方案3
0 2012-04-04 09:34:19