[英]How I can split a word without a delimiter and perform operations over the split string in Java 8?
[英]java Jsoup question how can i split by word?
我想獲取沒有標簽的HTML內容,結果是
word
word
word
因此,我嘗試了以下方法。
public class PreProcessing {
public static void main(String\[\] args) throws Exception {
PrintWriter out = new PrintWriter("filename.txt");
URL url = new URL("[https://en.wikipedia.org/wiki/Distributed\_computing](https://en.wikipedia.org/wiki/Distributed_computing)");
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine = "";
String input = "";
while ((inputLine = in.readLine()) != null)
{
input += inputLine;
// System.out.println(inputLine);
}
//create Jsoup document from HTML
Document jsoupDoc = Jsoup.parse(input);
//set pretty print to false, so \\n is not removed
jsoupDoc.outputSettings(new OutputSettings().prettyPrint(false));
//select all <br> tags and append \\n after that
// [jsoupDoc.select](https://jsoupDoc.select)("br").after("\\\\n");
//select all <p> tags and prepend \\n before that
// [jsoupDoc.select](https://jsoupDoc.select)("p").before("\\\\n");
//get the HTML from the document, and retaining original new lines
String str = jsoupDoc.html().replaceAll(" ", "\n");
// str.replaceAll("\t", "");
String strWithNewLines = Jsoup.clean(str, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
strWithNewLines.replaceAll("\t", "\n");
strWithNewLines.replaceAll("\\"", "");
strWithNewLines.replaceAll(".", "");
System.out.println(strWithNewLines);
out.print(strWithNewLines);
}
}
這是我的代碼我試過en.wiki~ distributed_computin
G和讀取BufferedReader
和使用jsoupDoc
,我想替換單詞" "
到"\\n"
,因為我想word \\n word\\n word\\n
這樣。
那么結果是
Distributed
computing
-
Wikipedia Distributed
computing From
Wikipedia,
the
free
encyclopedia Jump
to
navigation Jump
to
search "Distributed
application"
redirects
here.
For
trustless
applications,
see
但是我想要這樣的結果
Distributed
computing
-
Wikipedia
Distributed
computing
From
Wikipedia
the
free
encyclopedia
Jump
to
navigation
Jump
to
search
Distributed
application
redirects
here
For
trustless
applications
see
我試過像
strWithNewLines.replaceAll("\\"", "");
strWithNewLines.replaceAll(".", "");
但這沒有用。 為什么不起作用? 我做了谷歌搜索,但是找不到解決方案。
在最后幾行中嘗試此操作。 這將使您更接近所需的結果:
String strWithNewLines = Jsoup.clean ...;
String result = strWithNewLines.replaceAll("\t", "\n")
.replaceAll("\"", "");
//.replaceAll(".", "");
System.out.println(result);
您的代碼中的問題是String是不可變的,因此String.replaceAll
將不替換原始String中的任何內容,但會在替換完成后生成一個新字符串。 但是您永遠不會使用結果。
.replaceAll(".", "")
。 這將為您提供一個空字符串,因為.
匹配每個字符,它將被一個空字符串替換。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.