简体   繁体   English

多行标题C#的正则表达式

[英]Regex for multiline header c#

I'm a new c# programmer. 我是一名新的C#程序员。 I'm trying to make a simple c# application which will extract headers from a pdf file(book) if they are in this format : 我正在尝试制作一个简单的C#应用​​程序,如果该格式为,则会从pdf文件(书)中提取标头:

1.1 THE ELECTRICAL/ELECTRONICS INDUSTRY 1.1电气/电子行业

1.2 A BRIEF HISTORY 1.2简要历史

1.3 UNITS OF MEASUREMENT 1.3测量单位

I'm using the code: 我正在使用代码:

string pattern = @"(\d+)(\.)(\d+) ([A-Z]+).([A-Z]+).([A-Z]+).([A-Z]+).([A-Z]+)";
Regex.match(strText,pattern); 

which works fine for single line headers but doesn't work for two line/multiline headers. 它对单行标题有效,但对两行/多行标题无效。 Can anyone help please ? 有人可以帮忙吗?

I'm unfamiliar with C# style regex, but isn't a . 我不熟悉C#样式的正则表达式,但不是. an any character match (except new line)? 是否有任何字符匹配(换行符除外)?

If you need new lines then you're going to also have to include an actual \\n at the end, probably with a ? 如果您需要换行,则还必须在末尾添加实际的\\n ,可能带有? as well unless you plan to have an alternative as well. 除非您也计划有其他选择。

But I'm kind of surprised that this regex isn't causing any issues, unless the formatting of book so happens to be perfect. 但令我感到惊讶的是,除非书的格式恰好是完美的,否则此正则表达式不会引起任何问题。

Assuming that you have already get the required table of contents in single string and the only problem is to parse second level headers. 假设您已经用单个字符串获取了所需的目录,并且唯一的问题是解析第二级标头。

Regular expression modified for matching only capital letters. 修改正则表达式以仅匹配大写字母。

You can achieve the required result with the following code: 您可以使用以下代码获得所需的结果:

    string pattern = @"((\d+\.\d+) ([A-Z\s]+)\n)+";
    var match = Regex.Match(input, pattern);

    var headers = new List<string>();
    for (var i = 0; i < match.Groups[1].Captures.Count; i++)
    {
        headers.Add(match.Groups[1].Captures[i].Value);
    }

And after it headers will contain all required data. 之后, headers将包含所有必需的数据。

Assuming that input contains input data. 假设input包含输入数据。 Also, note that \\n is new line character. 另外,请注意\\n是换行符。

Your regex simplified. 您的正则表达式已简化。

(\\d+\\.\\d+) stands for sequence of "one or more numeric character", dot, "one or more numeric character", space. (\\d+\\.\\d+)代表“一个或多个数字字符”,点,“一个或多个数字字符”,空格的序列。

([AZ\\s]+)\\n - "one or more capital letter or space", "new line character" ([AZ\\s]+)\\n “一个或多个大写字母或空格”,“换行符”

Also, read the following article to get familiar with C# regular expressions. 另外,请阅读以下文章以熟悉C#正则表达式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM