RegEx for Complex String

Question

I am new to using RegEx and am trying to use it with the Java engine. 我是使用RegEx的新手，我试图将它与Java引擎一起使用。 An example string that I am trying to parse is the following: 我要解析的示例字符串如下：

name:"SFATG";affil:100;aup:1;bu:FALSE name:"SF TAC 1";affil:29.3478;aup:19;bu:FALSE name:"SF TAC 2";affil:22.2222;aup:14;bu:FALSE name:"SF TAC 3";affil:44.4444;aup:0;bu:FALSE name:"SF DISP 4";affil:82.4742;aup:0;bu:FALSE

What I would hope the RegEx to achieve would be to only extract the values that appear after the : and before the ; 我希望RegEx实现的目标只是提取出现在:和之后出现的值; . 。 In addition, I would not want to include the quotes within the entries for name . 另外，我不想在name的条目中包含引号。 However, I (in this very particular case) would like to keep the space which appears in the entry for bu . 但是，我（在这个非常特殊的情况下）想保留出现在bu的条目中的空间。 I would not, however, want to have the name field appear for the data entry of bu though. 但是，我不希望为bu的数据输入显示name字段。 So I'd want FALSE , not FALSE name for this field. 所以我想要FALSE ，而不是这个字段的FALSE name 。

My ultimate goal for using this RegEx would be to create an array from all of the groups/data values so that the array would contain the following: 我使用此RegEx的最终目标是从所有组/数据值创建一个数组，以便该数组包含以下内容：

[0]: SFATG
[1]: 100
[2]: 1
[3]: FALSE 
[4]: SF TAC 1
...Etc.

I was thinking about creating groups for each value because then I would be able to easily create an array by combining the Pattern and Matcher classes, such that: 我正在考虑为每个值创建组，因为我可以通过组合Pattern和Matcher类轻松地创建一个数组，这样：

String regEx = "Some really fancy RegEx that actually works";
Pattern p = Pattern.compile(regEx);
Matcher m = p.matcher("Some really really long String that follows the outlined format");
// I'd probably want to use an Object array since my data values vary by type
// I can also create 4 different arrays (one for name, another for affil, etc.),
// Any advice on which approach to take?
Object[] dataValues = new Object[m.groupCount()];

The RegEx that I've so far been able to come up with is as follows: 我到目前为止能够提出的RegEx如下：

name:"(\w+)";affil:(\d+);aup:(\d+);bu:(\w+\s)

However, this seems to only work on the first 4 data values and none beyond that. 但是，这似乎只适用于前4个数据值，除此之外没有。

Would anyone be able to assist me on creating a RegEx for the data that I am working with? 是否有人能够协助我为我正在使用的数据创建RegEx？ Any assistance on this would be greatly appreciated! 对此的任何帮助将不胜感激！ I'm also open to any ideas on how else to approach this, such as using a different data type for storing the data afterwards (other than creating an Object array). 我也对任何有关如何处理此问题的想法持开放态度，例如使用不同的数据类型来存储数据（除了创建Object数组）。 The key is to somehow obtain the data values from the string that I've mentioned and storing them for processing that will occur later on. 关键是以某种方式从我提到的字符串中获取数据值并存储它们以便稍后进行处理。

Additional Question I'd imagine that there may be external libraries that may have been better fit to perform this task. 附加问题我想象可能有外部库可能更适合执行此任务。 Is anyone aware of a library that would work for this? 有人知道一个适用于此的库吗？

Answer 1

One regex to rule them all 一个正则规则来统治它们

\w+:(?:"([^"]+)"|(\d+)(?=;|\Z)|(\d+\.\d+)|([A-Z]+\s))

See a demo on regex101.com . 请参阅regex101.com上的演示 。

Broken down, this says: 细分，这说：

 \\w+: # 1+ word characters, followed by : (?: # a non-capturing group "([^"]+)" # "(...)" | # or (\\d+)(?=;|\\Z) # only digits (no floats) | # or (\\d+\\.\\d+) # floats | # or ([AZ]+\\s) # only UPPERCASE, followed by space )

Here, you'll need to see which capture group was filled, additionally two backslashes are needed in Java (ie. \\\\d+ instead of \\d+ ). 在这里，您需要查看填充了哪个捕获组，另外在Java中需要两个反斜杠（即。 \\\\d+而不是\\d+ ）。 To check which group was matched, you'll need some programming logic, eg https://ideone.com/sbgZxY (I'm not a Java guy though). 要检查哪个组匹配，您需要一些编程逻辑，例如https://ideone.com/sbgZxY （虽然我不是Java人）。

Answer 2

While this regex is less general purpose than @Jan's answer, it does restrict matches to the fields in your data, so it will provide syntax checking: 虽然这个正则表达式比@ Jan的答案更不通用，但它确实限制了数据中字段的匹配，因此它将提供语法检查：

name:"([^"]+)";affil:([\d.]+);aup:(\d+);bu:(TRUE|FALSE) ?

Regarding the approach to extracting the values, I'd create a thin wrapper object to provide type safety: 关于提取值的方法，我创建了一个瘦包装器对象来提供类型安全性：

public class RowParser {
    private static final Pattern ROW_PATTERN = Pattern.compile("name:\"([^\"]+)\";affil:([\\d.]+);aup:(\\d+);bu:(TRUE|FALSE) ?");

    public static void main(String[] args) {
        String data = "name:\"SFATG\";affil:100;aup:1;bu:FALSE name:\"SF TAC 1\";affil:29.3478;aup:19;bu:FALSE name:\"SF TAC 2\";affil:22.2222;aup:14;bu:FALSE name:\"SF TAC 3\";affil:44.4444;aup:0;bu:FALSE name:\"SF DISP 4\";affil:82.4742;aup:0;bu:TRUE \n";
        System.out.println(parseRows(data));
    }

    public static List<Row> parseRows(String data) {
        Matcher matcher = ROW_PATTERN.matcher(data);
        List<Row> rows = new ArrayList<>();
        while (matcher.find()) {
            rows.add(new Row(matcher));
        }
        return rows;
    }

    // Wrapper object for individual data rows
    public static class Row {
        private String name;
        private double affil;
        private int aup;
        private boolean bu;

        Row(Matcher matcher) {
            this.name = matcher.group(1);
            this.affil = Double.parseDouble(matcher.group(2));
            this.aup = Integer.parseInt(matcher.group(3));
            this.bu = Boolean.parseBoolean(matcher.group(4));
        }

        public String getName() {
            return name;
        }

        public double getAffil() {
            return affil;
        }

        public int getAup() {
            return aup;
        }

        public boolean isBu() {
            return bu;
        }

        @Override
        public String toString() {
            return "name:\"" + name + '"' + ";affil:" + affil + ";aup:" + aup + ";bu:" + String.valueOf(bu).toUpperCase();
        }
    }
}

RegEx for Complex String

问题描述

2 个解决方案

解决方案1
4 2017-09-13 20:22:59

解决方案2
1 已采纳 2017-09-18 19:42:33

RegEx for Complex String

问题描述

2 个解决方案

解决方案1 4 2017-09-13 20:22:59

解决方案2 1 已采纳 2017-09-18 19:42:33

解决方案1
4 2017-09-13 20:22:59

解决方案2
1 已采纳 2017-09-18 19:42:33