简体   繁体   English

如何从Java中的字符串中提取子字符串

[英]How to extract a substring from a string in java

What I am doing is validating URLs from my code. 我正在做的是从我的代码中验证URL。 So I have a file with url's in it and I want to see if they exist or not. 所以我有一个带有url的文件,我想看看它们是否存在。 If they exist, the web page contains xml code in which there will be an email address I want to extract. 如果存在,则网页包含xml代码,其中将包含我要提取的电子邮件地址。 I go round a while loop and in each instance, if the url exists, The xml is added to a string. 我绕了一个while循环,在每种情况下,如果URL存在,则xml被添加到字符串中。 This one big string contains the xml code. 这个大字符串包含xml代码。 What I want to do is extract the email address from this string with the xml code in it. 我想做的是从其中包含xml代码的字符串中提取电子邮件地址。 I can't use the methods in the string api as they require you to specify the sarting index which I don't know as it varies each time. 我无法使用字符串api中的方法,因为它们要求您指定不知道的销售索引,因为它每次都在变化。

What I was hoping to do was search the string for a sub-string starting with (eg " <email id> ") and ending with (eg " </email id> ") and add the string between these strings to a seperate string. 我希望做的是在字符串中搜索以(例如“ <email id> ”)和(例如“ </email id> ”)结尾的子字符串,并将这些字符串之间的字符串添加到单独的字符串中。

Does anyone know if this is possible to do or if there is an easier/different way of doing what I want to do? 有谁知道这是否可行,或者是否有更简单/不同的方式来做我想做的事情?

Thanks. 谢谢。

If you know well the structure of the XML document, I'll recommand to use XPath . 如果您很了解XML文档的结构,我将建议您使用XPath

For example, with emails contained in <email>a@b.com</email>, there will a XPath request like /root/email (depends on your xml structure) 例如,对于<email> a@b.com </ email>中包含的电子邮件,将有一个XPath请求,如/ root / email(取决于您的xml结构)

By executing this XPath query on your XML file, you will automatically get all <email> element ( Node ) returned in an array. 通过在XML文件上执行此XPath查询,您将自动获得数组中返回的所有<email>元素( Node )。 And if you have XML element, you have XML content. 如果您有XML元素,那么您就有XML内容。 (#getNodeValue) (#getNodeValue)

To answer your subject question: .indexOf, or, regular expressions. 要回答您的主题问题:.indexOf或正则表达式。

But after a brief review of your question, you should really be processing the XML document properly. 但是,在简短回顾您的问题之后,您实际上应该正确地处理XML文档。

A regular expression that will find and return strings between two " characters: 一个正则表达式,它将查找并返回两个“字符之间的字符串:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

private final static Pattern pattern = Pattern.compile("\"(.*?)\"");

private void doStuffWithStringsBetweenQuotes(String source) {
    Matcher matcher = pattern.matcher(source);
    while (matcher.find()) {
        String match = matcher.group(1);
    }
}

Have you try to use Regex? 您是否尝试过使用Regex? Probably a sample document will be very useful for this kind of question. 样本文档可能对于此类问题非常有用。

Check out the org.xml.sax API. 查看org.xml.sax API。 It is very easy to use and allows you to parse through XML and do whatever you want with the contents whenever you come across anything of interest. 它非常易于使用,并允许您通过XML进行解析,并在遇到任何感兴趣的内容时对内容进行任何处理。 So you could easily add some logic to look for < email > start elements then save the contents (characters) which will contain your email address. 因此,您可以轻松地添加一些逻辑以查找<email>起始元素,然后保存将包含您的电子邮件地址的内容(字符)。

If I understand your question correctly you are extracting pieces of XML from multiple web pages and concatenating them into a big 'xml' string, 如果我正确理解了您的问题,则说明您是从多个网页中提取XML片段并将它们串联成一个大的“ xml”字符串,

something that looks like 看起来像


"<somedata>blah</somedata>
<email>a.b@c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>a.c@c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>a.d@c.com</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
"

I'd advise making that a somewhat valid xml document by including a root element. 我建议通过包含根元素来使该XML文档有些有效。

" <?xml version="1.0" encoding="ISO-8859-1"?> <newRoot> <somedata>blah</somedata> <email>a.b@c.com</email> <somedata>blah</somedata> <somedata>blah</somedata> <email>a.c@c.com</email> <somedata>blah</somedata> <somedata>blah</somedata> <email>a.d@c.com</email> <somedata>blah</somedata> <somedata>blah</somedata> </newroot>"

Then you could load that into an Xml Document object and can use Xpath expressions to extract the email nodes and their values. 然后,您可以将其加载到Xml Document对象中,并可以使用Xpath表达式提取电子邮件节点及其值。

If you don't want to do that that you could use the indexOf(String str, int fromIndex) method to find the <email > and </email > (or whatever they are called) positions. 如果您不想这样做,则可以使用indexOf(String str, int fromIndex)方法来找到<email >和</email >(或称为它们的任何位置)位置。 and then substring based on those. 然后基于这些子字符串。 That's not a particularly clean or easy to read way of doing it though. 不过,这并不是一种特别干净或易于阅读的方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM