简体   繁体   English

用于拆分字符串的正则表达式

[英]Regular expression for splitting a string

I'm trying to split a string by using regex, so far I have 我正在尝试使用正则表达式拆分字符串,到目前为止

String[] words = a.replaceAll("[^a-zA-Z ]","").toLowerCase().split("\\s+");

And it's almost what I want, but I need to split the text also when there is a newline character in the string (by the way - should I actually use newline or return? What is the actual difference?) 这几乎是我想要的,但是当字符串中有换行符时,我也需要拆分文本(顺便说一句-我应该实际使用换行还是return?实际区别是什么?)

To clarify, my input is: 为了澄清,我的输入是:

this is a,
sample of
a file.

After splitting and doing a routine that sorts the words and counts occurrences of each, I should be getting this: 拆分并执行了对单词进行排序并计算每个单词的出现次数的例程后,我应该得到以下信息:

a: 2
file: 1
is: 1
of: 1
sample: 1
this: 1

Instead, I get: 相反,我得到:

asample: 1
file: 1
is: 1
ofa: 1
this: 1

How should I correct my regular expression to split at newlines as well? 我该如何纠正我的正则表达式也要在换行符处分割?

Use \\b[A-Za-z]+\\b regexp to find the word matches. 使用\\b[A-Za-z]+\\b表达式查找单词匹配项。 http://regexr.com/3ae1c http://regexr.com/3ae1c

You must change your replaceAll like this: 您必须像这样更改replaceAll:

 a.replaceAll("[^a-zA-Z]+"," ")

or as suggested by Alexander why not find directly the words (that is more straight to the point) 或亚历山大(Alexander)所建议的,为什么不直接找到这些词(更直接一点)

Just insert a space in your second argument of the replaceAll method and that should work 只需在replaceAll方法的第二个参数中插入一个空格即可,

replaceAll("[^a-zA-Z ]"," ") 

Or you can make it more efficient and avoid unnecessary spaces in the string returned by the replaceAll method by using the '+' quantifier as suggested by Casimir 或者,您可以按照Casimir的建议使用'+'量词来提高效率,并避免replaceAll方法返回的字符串中出现不必要的空格。

Both would work just fine in your case 两种都适合您的情况

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM