简体   繁体   English

在 Java 中打印 web 页面的内容

[英]Printing the content of web page in Java

I'm trying to read the content of https://example.com/ using HttpURLconnection class.我正在尝试使用 HttpURLconnection class 读取https://example.com/的内容。 I've removed the html tags between angled braces but I'm failing in removing the words between curled braces.我已经删除了尖括号之间的 html 标签,但我没有删除花括号之间的单词。 Also there's no space between words that needs to be printed.此外,需要打印的单词之间没有空格。

Here is the code:这是代码:

    URL url = new URL("https://example.com/");
    Scanner sc = new Scanner(url.openStream());
    StringBuffer sb = new StringBuffer();
    while(sc.hasNext()) {
        sb.append(sc.next());
         }
    String result = sb.toString();

    //Removing the HTML tags
    result = result.replaceAll("<[^>]*>", " ");
    
    System.out.println("Contents of the web page: "+result);

And this is the output I'm getting:这是我得到的 output:

Contents of the web page: ExampleDomain body{background-color:#f0f0f2;margin:0;padding:0;font-family:-apple-system,system-ui,BlinkMacSystemFont,"SegoeUI","OpenSans","HelveticaNeue",Helvetica,Arial,sans-serif;}div{width:600px;margin:5emauto;padding:2em;background-color:#fdfdff;border-radius:0.5em;box-shadow:2px3px7px2pxrgba(0,0,0,0.02);}a:link,a:visited{color:#38488f;text-decoration:none;}@media(max-width:700px){div{margin:0auto;width:auto;}} ExampleDomain Thisdomainisforuseinillustrativeexamplesindocuments.Youmayusethisdomaininliteraturewithoutpriorcoordinationoraskingforpermission. web 页面的内容: ExampleDomain body{background-color:#f0f0f2;margin:0;padding:0;font-family:-apple-system,system-ui,BlinkMacSystemFont,"SegoeUI","OpenSans","HelveticaNeue" ,Helvetica,Arial,sans-serif;}div{width:600px;margin:5emauto;padding:2em;background-color:#fdfdff;border-radius:0.5em;box-shadow:2px3px7px2pxrgba(0,0,0, 0.02);}a:link,a:visited{color:#38488f;text-decoration:none;}@media(max-width:700px){div{margin:0auto;width:auto;}} ExampleDomain Thisdomainisforuseinillustativeexamplesindocuments.Youmayusethisdomaininliterature withoutpriorcoordinationoraskingforpermission . Moreinformation...更多信息...

How to remove the content between curled braces?如何删除花括号之间的内容? and how to put space between the words in sentences?以及如何在句子中的单词之间放置空格?

For the removal of content between curly braces, you can use String#replaceAll(String, String) .要删除花括号之间的内容,可以使用String#replaceAll(String, String) Javadoc Javadoc

str.replaceAll("\\{.*\\}", "");

This regex matches all characters between opening and closing braces.此正则表达式匹配左大括号和右大括号之间的所有字符。 So your code would be:所以你的代码是:

URL url = new URL("https://example.com/");
Scanner sc = new Scanner(url.openStream());
StringBuffer sb = new StringBuffer();
while (sc.hasNext()) {
    sb.append(" " + sc.next());
}
String result = sb.toString();

// Removing the HTML tags
result = result.replaceAll("<[^>]*>", "");

// Removing the CSS stuff
result = result.replaceAll("\\{.*\\}", "");

System.out.println("Contents of the web page: " + result);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM