简体   繁体   English

在Java中使用正则表达式获取n个单词

[英]Get n Number of words using regex in Java

I have a section of a book, complete with punctuation, line breaks etc. and I want to be able to extract the first n words from the text, and divide that into 5 parts. 我有一本书的一部分,包括标点符号,换行符等,我希望能够从文本中提取前n个单词,并将其分为5部分。 Regex mystifies me. 正则表达式使我迷惑。 This is what I am trying. 这就是我正在尝试的。 I creates an array of index size 0, with all the input text: 我创建一个索引大小为0的数组,其中包含所有输入文本:

public static String getNumberWords2(String s, int nWords){
    String[] m = s.split("([a-zA-Z_0-9]+\b.*?)", (nWords / 5));
    return "Part One: \n" + m[1] + "\n\n" + 
           "Part Two: \n" + m[2] + "\n\n" + 
           "Part Three: \n" + m[3] + "\n\n" +
           "Part Four: \n" + m[4] + "\n\n" + 
           "Part Five: \n" + m[5];
}

Thanks! 谢谢!

I think the simplest, and most efficient way, is to simply repeatedly find a "word": 我认为最简单,最有效的方法就是简单地反复查找“单词”:

Pattern p = Pattern.compile("(\\w+)");
Matcher m = p.matcher(chapter);
while (m.find()) {
  String word = m.group();
  ...
}

You can vary the definition of "word" by modifying the regex. 您可以通过修改正则表达式来更改“单词”的定义。 What I wrote just uses regex's notion of word characters, and I wonder if it might be more appropriate than what you're trying to do. 我写的只是使用正则表达式的单词字符概念,我想知道它是否可能比您尝试做的更合适。 But it won't for instance include quote characters, which you may need to allow within a word. 但是,例如,它不会包含引号字符,您可能需要在单词中允许使用引号字符。

there is a better alternative made just for this using BreakIterator . 使用BreakIterator为此有更好的选择。 That would be the most correct way to parse for words in Java. 这将是解析Java中单词的最正确方法。

(See below the break for the next go at this. Leaving this top part here because of thought process...) (请参见下面的中断部分,以便下一步进行。由于思想过程,将顶部留在这里...)

Based on my reading of the split() javadoc, I think I know what's going on. 根据我对split() javadoc的阅读,我想我知道发生了什么事。

You want to split the string based on whitespace, up to n times. 您要基于空格分割字符串,最多n次。

String [] m = s.split("\\b", nWords);

Then stitch them back together with token whitespace if you must: 然后,如果需要,将它们与令牌空格重新缝合在一起:

StringBuffer strBuf = new StringBuffer();
for (int i = 0; i < nWords; i++) {
    strBuf.append(m[i]).append(" ");
}

Finally, chop that into five equal strings: 最后,将其切成五个相等的字符串:

String [] out = new String[5];
String str = strBuf.toString();
int length = str.length();
int chopLength = length / 5;
for (int i = 0; i < 5; i++) {
    int startIndex = i * chopLength;
    out[i] = str.substring(startIndex, startIndex + choplength); 
}

It's late at night for me, so you might want to check that one yourself for correctness. 对我而言,这是深夜,所以您可能需要检查一下自己是否正确。 I think I got it somewhere in the area code of correct. 我想我在正确的区号中找到了它。


OK, here's try number 3. Having run it through a debugger, I can verify that the only problem left is the integer math of slicing strings that aren't factors of 5 into five pieces, and how best to deal with the remaining characters. 好的,这里是第3个尝试。通过调试器运行它之后,我可以验证剩下的唯一问题是将不是5的因子切成五段的整数进行整数运算,以及如何最好地处理其余字符。

It ain't pretty, but it works. 它不是很漂亮,但是可以。

String[] sliceAndDiceNTimes(String victim, int slices, int wordLimit) {
    // Add one to the wordLimit here, because the rest of the input string
    // (past the number of times split() does its magic) will be in the last
    // array member
    String [] words = victim.split("\\s", wordLimit + 1);
    StringBuffer partialVictim = new StringBuffer();

    for (int i = 0; i < wordLimit; i++) {
        partialVictim.append(words[i]).append(' ');
    }

    String [] resultingSlices = new String[slices];
    String recycledVictim = partialVictim.toString().trim();
    int length = recycledVictim.length();
    int chopLength = length / slices;

    for (int i = 0; i < slices; i++) {
        int chopStartIdx = i * chopLength;
        resultingSlices[i] = recycledVictim.substring(chopStartIdx, chopStartIdx + chopLength);
    }

    return resultingSlices;
}

Important notes: 重要笔记:

  • "\\s" is the correct regex. “ \\ s”是正确的正则表达式。 Using \\b ends up with lots of extra splits due to there being word boundaries at the beginning and end of words. 使用\\ b会导致很多额外的分割,因为单词的开头和结尾都有单词边界。
  • Added one to the number of times split runs, because the last array member in the String array is the remaining input string that wasn't split. 在拆分运行次数上增加了一个,因为String数组中的最后一个数组成员是未拆分的剩余输入字符串。 You could also just split the entire string and just use the for loop as-is. 您也可以拆分整个字符串,并按原样使用for循环。
  • The integer division remainder is still an exercise left for the questioner. 整数除法余数仍然是发问者的一项练习。 :-) :-)

I'm just going to guess what you need here; 我只是猜测你在这里需要什么; hopefully this is close: 希望这是接近的:

public static void main(String[] args) {
    String text = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, " +
        "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. " +
        "Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris " +
        "nisi ut aliquip ex ea commodo consequat. Rosebud.";

    String[] words = text.split("\\s+");
    final int N = words.length;
    final int C = 5;
    final int R = (N + C - 1) / C;
    for (int r = 0; r < R; r++) {
        for (int x = r, i = 0; (i < C) && (x < N); i++, x += R) {
            System.out.format("%-15s", words[x]);
        }
        System.out.println();
    }
}

This produces: 这将产生:

Lorem          sed            dolore         quis           ex             
ipsum          do             magna          nostrud        ea             
dolor          eiusmod        aliqua.        exercitation   commodo        
sit            tempor         Ut             ullamco        consequat.     
amet,          incididunt     enim           laboris        Rosebud.       
consectetur    ut             ad             nisi           
adipisicing    labore         minim          ut             
elit,          et             veniam,        aliquip        

Another possible interpretation 另一种可能的解释

This uses java.util.Scanner : 这使用java.util.Scanner

static String nextNwords(int n) {
    return "(\\S+\\s*){N}".replace("N", String.valueOf(n));
}   
static String[] splitFive(String text, final int N) {
    Scanner sc = new Scanner(text);
    String[] parts = new String[5];
    for (int r = 0; r < 5; r++) {
        parts[r] = sc.findInLine(nextNwords(N / 5 + (r < (N % 5) ? 1 : 0)));
    }
    return parts;
}
public static void main(String[] args) {
    String text = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, " +
      "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. " +
      "Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris " +
      "nisi ut aliquip ex ea commodo consequat. Rosebud.";

    for (String part : splitFive(text, 23)) {
        System.out.println(part);
    }
}

This prints the first 23 words of text , 这会打印text的前23个单词,

Lorem ipsum dolor sit amet, 
consectetur adipisicing elit, sed do 
eiusmod tempor incididunt ut labore 
et dolore magna aliqua. Ut 
enim ad minim 

Or if 7: 或者,如果7:

Lorem ipsum 
dolor sit 
amet, 
consectetur 
adipisicing 

Or if 3: 或者,如果3:

Lorem 
ipsum 
dolor 
<blank>
<blank>

I have a really really ugly solution: 我有一个非常丑陋的解决方案:

public static Object[] getNumberWords(String s, int nWords, int offset){
    Object[] os = new Object[2];
    Pattern p = Pattern.compile("(\\w+)");
    Matcher m = p.matcher(s);
    m.region(offset, m.regionEnd());
    int wc = 0;
    String total = "";
    while (wc <= nWords && m.find()) {
      String word = m.group();
      total += word + " ";
      wc++;
    }
    os[0] = total;
    os[1] = total.lastIndexOf(" ") + offset;
    return os; }

    String foo(String s, int n){
    Object[] os = getNumberWords(s, n, 0);
    String a = (String) os[0];
    String m[] = new String[5];
    int indexCount = 0;
    int lastEndIndex = 0;
    for(int count = (n / 5); count <= n; count += (n/5)){
        if(a.length()<count){count = a.length();}
        os = getNumberWords(a, (n / 5), lastEndIndex);
        lastEndIndex = (Integer) os[1];
        m[indexCount] = (String) os[0];
        indexCount++;
    }
    return "Part One: \n" + m[0] + "\n\n" + 
    "Part Two: \n" + m[1] + "\n\n" + 
    "Part Three: \n" + m[2] + "\n\n" +
    "Part Four: \n" + m[3] + "\n\n" + 
    "Part Five: \n" + m[4];
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM