简体   繁体   English

Java - 基于正则表达式从行中提取

[英]Java - extract from line based on regex

Small question regarding a Java job to extract information out of lines from a file please.请关于从文件中提取信息的 Java 作业的小问题。

Setup, I have a file, in which one line looks like this:设置,我有一个文件,其中一行如下所示:

bla,bla42bla()bla=bla+blablaprimaryKey="(ZAPDBHV7120D41A,USA,blablablablablabla

The file contains many of those lines (as describe above) In each of the lines, there are two particular information I am interested in, the primaryKey , and the country .该文件包含许多这些行(如上所述)在每一行中,有两个我感兴趣的特定信息, primaryKeycountry

In my example, ZAPDBHV7120D41A and USA在我的示例中, ZAPDBHV7120D41A美国

For sure, each line of the file has exactly once the primaryKey, and exactly once the country, they are separated by a comma.当然,文件的每一行都有一个primaryKey,一个country,它们用逗号分隔。 It is there exactly once.它只存在一次。 in no particular order (it can appear at the start of the line, middle, end of the line, etc).没有特定的顺序(它可以出现在行首、中间、行尾等)。

The primary key is a combination of alphabet in caps [A, B, C, ... Y, Z] and numbers [0, 1, 2, ... 9].主键是大写字母 [A, B, C, ... Y, Z] 和数字 [0, 1, 2, ... 9] 的组合。 It has no particular predefined length.它没有特定的预定义长度。

The primary key is always in between primaryKey="( {primaryKey} , {country} , Meaning, the actual primaryKey is found after the string primaryKey-equal-quote-open parenthesis . And before another comma three letters country comma .主键总是在 primaryKey="( {primaryKey} , {country} ,含义,实际的primaryKey在字符串primaryKey-equal-quote-open括号之后找到。并且在另一个逗号之前三个字母country comma

I would like to write a program, in which I can extract all the primary key, as well as all countries from the file.我想编写一个程序,在其中我可以从文件中提取所有主键以及所有国家/地区。

Input:输入:

bla,bla42bla()bla=bla+blablaprimaryKey="(ZAPDBHV7120D41A,USA,blablablablablabla
bla++blabla()bla=bla+blablaprimaryKey="(AA45555DBMW711DD4100,ARG,bla
[...]

Result:结果:

The primaryKey is ZAPDBHV7120D41A
The country is USA

The primaryKey is AA45555DBMW711DD4100
The country is ARG

Therefore, I tried following:因此,我尝试了以下操作:

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.regex.Pattern;

public class RegexExtract {

    public static void main(String[] args) throws Exception {
        final String             csvFile = "my_file.txt";
        try (final BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
            String line;
            while ((line = br.readLine()) != null) {
                Pattern.matches("", line); // extract primaryKey and country based on regex
                String primaryKey = ""; // extract the primary from above
                String country = ""; // extract the country from above
                System.out.println("The primaryKey is " + primaryKey);
                System.out.println("The country is " + country);
            }
        }
    }
}

But I am having a hard time constructing the regular expression needed to match and extract.但是我很难构建匹配和提取所需的正则表达式。

May I ask what is the correct code in order to extract from the line based on above information?请问根据上述信息从该行中提取的正确代码是什么?

Thank you谢谢

Explanations after the code.代码后的解释。

import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexExtract {

    public static void main(String[] args) {
        Path path = Paths.get("my_file.txt");
        try (BufferedReader br = Files.newBufferedReader(path)) {
            Pattern pattern = Pattern.compile("primaryKey=\"\\(([A-Z0-9]+),([A-Z]+)");
            String line = br.readLine();
            while (line != null) {
                Matcher matcher = pattern.matcher(line);
                if (matcher.find()) {
                    String primaryKey = matcher.group(1);
                    String country = matcher.group(2);
                    System.out.println("The primaryKey is " + primaryKey);
                    System.out.println("The country is " + country);
                }
                line = br.readLine();
            }
        }
        catch (IOException xIo) {
            xIo.printStackTrace();
        }
    }
}

Running the above code produces the following output (using the two sample lines in your question).运行上面的代码会产生以下输出(使用问题中的两个示例行)。

The primaryKey is ZAPDBHV7120D41A
The country is USA
The primaryKey is AA45555DBMW711DD4100
The country is ARG

The regular expression looks for the following [literal] string正则表达式查找以下 [literal] 字符串

primaryKey="(

The double quote is escaped since it is within a string literal.双引号被转义,因为它在字符串文字中。
The opening parenthesis is escaped because it is a metacharacter and the double backslash is required since Java does not recognize \( in a string literal.左括号被转义,因为它是一个元字符,并且需要双反斜杠,因为 Java 无法识别字符串文字中的\(

Then the regular expression groups together the string of consecutive capital letters and digits that follow the previous literal up to (but not including) the comma.然后,正则表达式将前一个文字后面直到(但不包括)逗号的连续大写字母和数字字符串组合在一起。

Then there is a second group of capital letters up to the next comma.然后是第二组大写字母,直到下一个逗号。

Refer to the Regular Expressions lesson in Oracle's Java tutorials.请参阅 Oracle 的 Java 教程中的正则表达式课程。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM