简体   繁体   English

替换Java中其他两个字符之间的字符串中的所有字符

[英]Replace all characters in a string that are between two other characters in Java

First time coding Java here, so bear with me :PI am trying to make a program in Java that opens a html file and edits it so that it removes all its html tags, but only them and not everything else. 第一次在这里编码Java,所以请耐心:PI正在尝试用Java编写一个程序,该程序打开html文件并对其进行编辑,以使其删除所有html标记,但仅删除它们,而不删除其他所有标记。 I am assuming that the file already exists and I don't need to create it. 我假设该文件已经存在,不需要创建它。 For now i have been working with a .txt file that has html code in it, in order to get me started faster. 现在,我一直在使用包含html代码的.txt文件,以使我更快地入门。 So far i have managed to edit the file so that it simply removes the html tag and replaces it with nothing. 到目前为止,我已经设法编辑了该文件,以便仅删除html标记并将其替换为空。 However what i really want is to remove anything that is inside the opening and the closing brackets. 但是,我真正想要的是删除左括号和右括号内的所有内容。 I will show an example of what i need: 我将显示我需要的示例:

<html>
<body>
<p> blah blah blah 
</p> 
</body> 
</html>

After my program has been executed, the txt file should have only "blah blah blah" in it. 执行我的程序后,txt文件中应仅包含“ blah blah blah”。 In order to replace the tag , i am using: 为了替换标签,我正在使用:

    if(myString.contains("<html>"))
        {
          // do stuff
         }

So here is my question: is there something like an escape character in java that allows me to say: 所以这是我的问题:java中是否有类似转义符的内容允许我说:

if(myString.contains("<") && it is followed by as many characters as the file wants by (">") )
//then remove everything in between them.

For the sake of our minds, lets assume that the html code inside the .txt file has no errors. 为了我们的想法,让我们假设.txt文件中的html代码没有错误。 I will post the code if you want me to, but it is really bad structured and I don't think it will help you guys understand what i am doing at all. 如果您愿意,我会发布代码,但是它的结构确实很糟糕,我认为它根本不会帮助你们了解我在做什么。 That is because i have been trying a lot of things simultaneously and i have kept whatever i may find useful as a comment. 那是因为我一直在尝试很多事情,并且把我认为有用的东西都保留下来作为评论。 Thank you for your time! 感谢您的时间!

You can use String.replaceAll with a regular expression. 您可以将String.replaceAll与正则表达式一起使用。

"<html><p>foo bar</p></html>".replaceAll("</?[A-Za-z]+>", "");

Results in: 结果是:

foo bar

However, be careful to not try to parse the HTML with regular expressions. 但是,请注意不要尝试使用正则表达式来解析HTML。

使用JSoup,您可以非常简单地剥离HTML页面中的所有标签:

Jsoup.parse(myString).text()

Try to use regular expression like this. 尝试像这样使用正则表达式。 Here any string starting with < and ending with > and containing any no of any character inside these two angle brackets will be replaced by empty string so your code will remain bla bla ... 在这里,任何以<开头和>结束并且在这两个尖括号内包含任何字符的任何字符的字符串都将被空字符串替换,因此您的代码将保持bla bla ...

str = str.replaceAll("<.*>", "");

You can test the regex here . 您可以在此处测试正则表达式

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM