简体   繁体   English

通过正则表达式或至少在没有外部库的情况下从 Java 中的 html 内容中获取推文

[英]Get a tweet from html content in Java through either regex or at least without external libraries

How can I get the latest tweet from html content through either regex or without any external libraries.如何通过正则表达式或在没有任何外部库的情况下从 html 内容中获取最新的推文。 I am happy to use external libraries I would just prefer not to.我很高兴使用我不想使用的外部库。 I just wanted to know how it would be possible.我只是想知道这怎么可能。 I have written the html download part in Java and if anyone wants I will post it here.我已经用 Java 编写了 html 下载部分,如果有人想要,我会在这里发布。 So I'll do a pit of pseudo code so that I'm not only targeting Java developers This is how my program looks so far.所以我会做一个伪代码的坑,这样我就不会只针对Java开发人员这是我的程序到目前为止的样子。

1.)Load site("www.twitter.com/user123")
2.)Get initial string and write it to variable->buffer
3.)Loop start
4.)    Append string->buffer
5.)    If there is no more ->break
6.)print buffer

Obviously the variable buffer will now have raw html content.显然,变量缓冲区现在将具有原始 html 内容。 How can I sort this out to get the tweet.我怎样才能解决这个问题以获得推文。 I have found a way but this is too inconsistent.我找到了一种方法,但这太不一致了。 The way I managed it was to find the string which held the tweets and get the content surrounded by the code.我管理它的方法是找到保存推文的字符串并获取代码包围的内容。 However there were too many changes in this section.然而,这一部分有太多的变化。 What I mean is some content inside of it changes, like the font size.我的意思是它里面的一些内容发生了变化,比如字体大小。 I could write multiple if statements but is there a neater solution?我可以写多个 if 语句,但有更简洁的解决方案吗?

Let me just start off by saying that jsoup is an amazing lightweight HTML parsing library.首先让我说jsoup是一个了不起的轻量级 HTML 解析库。 You can use things like CSS selectors and whatnot.您可以使用 CSS 选择器之类的东西。 If you ever decide to use a library jsoup will make your life a lot easier.如果您决定使用库 jsoup 将使您的生活更轻松。

You can just query for the element with the class of TweetTextSize , then get the text content.您可以使用TweetTextSize类查询元素,然后获取文本内容。 This will give you all text, hashtags, and links.这将为您提供所有文本、主题标签和链接。 (The downside being pictures are also given in links) (缺点是图片也在链接中给出)

Otherwise, you'll need to manually traverse the DOM.否则,您将需要手动遍历 DOM。 For example, use regex to find the beginning of the first TweetTextSize , and then just keep all text which is not between a < and a > .例如,使用正则表达式查找第一个TweetTextSize的开头,然后只保留不在<>之间的所有文本。

Unfortunately, this second solution is volatile and may not work in the future, and you'll end up with a big glob of code which is overly complex and hard to debug.不幸的是,第二个解决方案是不稳定的,将来可能无法使用,并且您最终会得到大量过于复杂且难以调试的代码。

Simple answer if you want a regex and not a sophisticated third party library.如果您想要正则表达式而不是复杂的第三方库,那么简单的答案。

<p[^>]+js-tweet-text[^>]*>(.*)</p>

Try the above on the "view-source" of https://twitter.com/ahttps://twitter.com/a的“查看源”中尝试以上操作

Thanks.谢谢。

EDIT: Source Code:编辑:源代码:

import java.io.ByteArrayOutputStream;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class TweetSucker {

    public static void main(String[] args) throws Exception {
        URLConnection urlConnection = new URL("https://twitter.com/a").openConnection();
        InputStream inputStream = urlConnection.getInputStream();
        String encoding = urlConnection.getContentEncoding();

        ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();

        byte[] buffer = new byte[8192];
        int len = 0;
        while ((len = inputStream.read(buffer)) != -1) {
            byteArrayOutputStream.write(buffer, 0, len);
        }
        String htmlContent = null;
        if (encoding != null) {
            htmlContent = new String(byteArrayOutputStream.toByteArray(), encoding);
        } else {
            htmlContent = new String(byteArrayOutputStream.toByteArray());
        }
        Pattern TWEET_PATTERN = Pattern.compile("(<p[^>]+js-tweet-text[^>]*>(.*)</p>)", Pattern.CASE_INSENSITIVE);
        Matcher matcher = TWEET_PATTERN.matcher(htmlContent);
        while (matcher.find()) {
            System.out.println("Tweet Found: " + matcher.group(2));
        }
    }
}

I know that you don't want any libraries but if you want something really quick this is working code in C#:我知道你不想要任何库,但如果你想要一些非常快速的东西,这是 C# 中的工作代码:

    using (IE browser = new IE())
    {
        browser.GoTo("https://twitter.com/user");
        List tweets = browser.List(Find.ById("stream-items-id"));
        if (tweets != null)
        {
            foreach (var tweet in tweets.ListItems)
            {
                var tweetText = tweet.Paras.FirstOrDefault();
                if (tweetText != null)
                {
                    MessageBox.Show(tweetText.Text);
                }
            }
        }
    }

This program uses a library called WatiN (if you use Visual Studio go to Tools Menu, select "NuGet Package Manager" then select "Manage Nuget Packages for Solution" and then select "Browse" and then type "Watin" on the search box, after you find the library hit "Install", after it is installed you just add a reference in your code and then a using statement:该程序使用名为 WatiN 的库(如果您使用 Visual Studio,请转到“工具”菜单,选择“NuGet 包管理器”,然后选择“管理解决方案的 Nuget 包”,然后选择“浏览”,然后在搜索框中键入“Watin”,找到库点击“安装”后,安装后,您只需在代码中添加一个引用,然后添加一个 using 语句:

using WatiN.Core;

You can just copy and paste the code I wrote above in a button handler and it'll work, you need to change the twitter.com/XXXXXX user name to list all their tweets.您只需将我在上面编写的代码复制并粘贴到按钮处理程序中即可,它会起作用,您需要更改 twitter.com/XXXXXX 用户名以列出他们的所有推文。 Modify code accordingly to meet your needs.相应地修改代码以满足您的需求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM