简体   繁体   中英

Regex splitting a string to an array using java and Tika

Im trying to take the Tika output (pdf to text) and split the result into an array of words - or groups of characters.

im using something like....

String str = contenthandler.toString();
  String[] splitArray = str.split("\\s+");

  for (String word : splitArray){
    System.out.println(word);
  }

but im not getting splits where i expect them - between the words. I'd like to preserve line breaks, pages, tabs, etc... and just nix the white space. Sample text from Tika looks like:

"...or supplemented except by a written instrument signed by both parties.  The unenforceability of any provision on this Agreement shall not affect the enforceability of any other provision of this Agreement.  Neither this Agreement nor the disclosure of any Confidential Information pursuant to this Agreement by any party shall restrict such party from disclosing any of its Confidential Information to any third party...."

I was playing around with regex on http://java-regex-tester.appspot.com/

patterns like [^a-zA-Z] find the spaces, while /s+ does not. How do I split on these guys?

Tabs and line breaks are whitespace. If you simply want to split on one or more space characters, you need to do

String[] splitArray = str.split(" +");

EDIT

In response to OP comments - it would appear that the spaces are not being matched by \\s+ . In which case, the characters (spaces) between words are none of [" ",\\t, \\n, \\x0B, \\f, \\r\\] *. You could try matching \\b (this is a word boundary). To really find out what the characters are - paste the string into a good text editor and view the raw characters (eg in Notepad++ it would be view -> show all characters). Note the hex code of the characters between words and check what that is.

EDIT following OP test

By examining the hex representation of the text (via edithex.com ), OP determined that the space character was a non-breaking space (0xA0). Thus, this code satisfies the requirement:

String[] splitArray = str.split("\xA0")

It seems that PDFs commonly encode spaces as characters other than the standard space (0xA0). This blogpost implies that PDFs might not encode spaces as standard spaces (ASCII code 0x20 = 32). The various options for space characters that \\s will not pick up are here .


*In the example text, they are spaces, but that must have been changed in the copy / paste

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM