简体   繁体   English

将句子字符串转换为Java中单词的字符串数组

[英]Converting a sentence string to a string array of words in Java

I need my Java program to take a string like:我需要我的 Java 程序来获取如下字符串:

"This is a sample sentence."

and turn it into a string array like:并将其转换为字符串数组,例如:

{"this","is","a","sample","sentence"}

No periods, or punctuation (preferably).没有句点或标点符号(最好)。 By the way, the string input is always one sentence.顺便说一句,字符串输入总是一个句子。

Is there an easy way to do this that I'm not seeing?有没有一种我没有看到的简单方法可以做到这一点? Or do we really have to search for spaces a lot and create new strings from the areas between the spaces (which are words)?或者我们真的必须大量搜索空格并从空格之间的区域(即单词)创建新字符串吗?

String.split() will do most of what you want. String.split()会做你想做的大部分事情。 You may then need to loop over the words to pull out any punctuation.然后,您可能需要遍历单词以提取任何标点符号。

For example:例如:

String s = "This is a sample sentence.";
String[] words = s.split("\\s+");
for (int i = 0; i < words.length; i++) {
    // You may want to check for a non-word character before blindly
    // performing a replacement
    // It may also be necessary to adjust the character class
    words[i] = words[i].replaceAll("[^\\w]", "");
}

Now, this can be accomplished just with split as it takes regex:现在,这可以通过split来完成,因为它需要正则表达式:

String s = "This is a sample sentence with []s.";
String[] words = s.split("\\W+");

this will give words as: {"this","is","a","sample","sentence", "s"}这将给出以下单词: {"this","is","a","sample","sentence", "s"}

The \\\\W+ will match all non-alphabetic characters occurring one or more times. \\\\W+将匹配出现一次或多次的所有非字母字符。 So there is no need to replace.所以没有必要更换。 You can check other patterns also.您也可以检查其他模式。

You can use BreakIterator.getWordInstance to find all words in a string.您可以使用BreakIterator.getWordInstance查找字符串中的所有单词。

public static List<String> getWords(String text) {
    List<String> words = new ArrayList<String>();
    BreakIterator breakIterator = BreakIterator.getWordInstance();
    breakIterator.setText(text);
    int lastIndex = breakIterator.first();
    while (BreakIterator.DONE != lastIndex) {
        int firstIndex = lastIndex;
        lastIndex = breakIterator.next();
        if (lastIndex != BreakIterator.DONE && Character.isLetterOrDigit(text.charAt(firstIndex))) {
            words.add(text.substring(firstIndex, lastIndex));
        }
    }

    return words;
}

Test:测试:

public static void main(String[] args) {
    System.out.println(getWords("A PT CR M0RT BOUSG SABN NTE TR/GB/(G) = RAND(MIN(XXX, YY + ABC))"));
}

Ouput:输出:

[A, PT, CR, M0RT, BOUSG, SABN, NTE, TR, GB, G, RAND, MIN, XXX, YY, ABC]

您还可以使用BreakIterator.getWordInstance

You can just split your string like that using this regular expression你可以使用这个正则表达式来分割你的字符串

String l = "sofia, malgré tout aimait : la laitue et le choux !" <br/>
l.split("[[ ]*|[,]*|[\\.]*|[:]*|[/]*|[!]*|[?]*|[+]*]+");

Try using the following:尝试使用以下方法:

String str = "This is a simple sentence";
String[] strgs = str.split(" ");

That will create a substring at each index of the array of strings using the space as a split point.这将使用空格作为分割点在字符串数组的每个索引处创建一个子字符串。

The easiest and best answer I can think of is to use the following method defined on the java string -我能想到的最简单和最好的答案是使用在 java 字符串上定义的以下方法 -

String[] split(String regex)

And just do "This is a sample sentence".split(" ").只需执行“这是一个示例句子”.split(" ")。 Because it takes a regex, you can do more complicated splits as well, which can include removing unwanted punctuation and other such characters.因为它需要一个正则表达式,所以您也可以进行更复杂的拆分,包括删除不需要的标点符号和其他此类字符。

Use string.replace(".", "").replace(",", "").replace("?", "").replace("!","").split(' ') to split your code into an array with no periods, commas, question marks, or exclamation marks.使用string.replace(".", "").replace(",", "").replace("?", "").replace("!","").split(' ')拆分将您的代码放入一个没有句号、逗号、问号或感叹号的数组中。 You can add/remove as many replace calls as you want.您可以根据需要添加/删除任意数量的替换调用。

Try this:试试这个:

String[] stringArray = Pattern.compile("ian").split(
"This is a sample sentence"
.replaceAll("[^\\p{Alnum}]+", "") //this will remove all non alpha numeric chars
);

for (int j=0; i<stringArray .length; j++) {
  System.out.println(i + " \"" + stringArray [j] + "\"");
}

I already did post this answer somewhere, i will do it here again.我已经在某处发布了这个答案,我会再次在这里发布。 This version doesn't use any major inbuilt method.此版本不使用任何主要的内置方法。 You got the char array, convert it into a String.你得到了 char 数组,把它转换成一个字符串。 Hope it helps!希望有帮助!

import java.util.Scanner;

public class SentenceToWord 
{
    public static int getNumberOfWords(String sentence)
    {
        int counter=0;
        for(int i=0;i<sentence.length();i++)
        {
            if(sentence.charAt(i)==' ')
            counter++;
        }
        return counter+1;
    }

    public static char[] getSubString(String sentence,int start,int end) //method to give substring, replacement of String.substring() 
    {
        int counter=0;
        char charArrayToReturn[]=new char[end-start];
        for(int i=start;i<end;i++)
        {
            charArrayToReturn[counter++]=sentence.charAt(i);
        }
        return charArrayToReturn;
    }

    public static char[][] getWordsFromString(String sentence)
    {
        int wordsCounter=0;
        int spaceIndex=0;
        int length=sentence.length();
        char wordsArray[][]=new char[getNumberOfWords(sentence)][]; 
        for(int i=0;i<length;i++)
        {
            if(sentence.charAt(i)==' ' || i+1==length)
            {
            wordsArray[wordsCounter++]=getSubString(sentence, spaceIndex,i+1); //get each word as substring
            spaceIndex=i+1; //increment space index
            }
        }
        return  wordsArray; //return the 2 dimensional char array
    }


    public static void main(String[] args) 
    {
    System.out.println("Please enter the String");
    Scanner input=new Scanner(System.in);
    String userInput=input.nextLine().trim();
    int numOfWords=getNumberOfWords(userInput);
    char words[][]=new char[numOfWords+1][];
    words=getWordsFromString(userInput);
    System.out.println("Total number of words found in the String is "+(numOfWords));
    for(int i=0;i<numOfWords;i++)
    {
        System.out.println(" ");
        for(int j=0;j<words[i].length;j++)
        {
        System.out.print(words[i][j]);//print out each char one by one
        }
    }
    }

}

string.replaceAll() doesn't correctly work with locale different from predefined. string.replaceAll() 不能正确处理不同于预定义的语言环境。 At least in jdk7u10.至少在 jdk7u10 中。

This example creates a word dictionary from textfile with windows cyrillic charset CP1251此示例使用 windows 西里尔文字符集 CP1251 从文本文件创建单词词典

    public static void main (String[] args) {
    String fileName = "Tolstoy_VoinaMir.txt";
    try {
        List<String> lines = Files.readAllLines(Paths.get(fileName),
                                                Charset.forName("CP1251"));
        Set<String> words = new TreeSet<>();
        for (String s: lines ) {
            for (String w : s.split("\\s+")) {
                w = w.replaceAll("\\p{Punct}","");
                words.add(w);
            }
        }
        for (String w: words) {
            System.out.println(w);
        }
    } catch (Exception e) {
        e.printStackTrace();
    }

Following is a code snippet which splits a sentense to word and give its count too.以下是将句子拆分为单词并给出其计数的代码片段。

 import java.util.HashMap;
 import java.util.Iterator;
 import java.util.Map;

 public class StringToword {
public static void main(String[] args) {
    String s="a a a A A";
    String[] splitedString=s.split(" ");
    Map m=new HashMap();
    int count=1;
    for(String s1 :splitedString){
         count=m.containsKey(s1)?count+1:1;
          m.put(s1, count);
        }
    Iterator<StringToword> itr=m.entrySet().iterator();
    while(itr.hasNext()){
        System.out.println(itr.next());         
    }
    }

}

Another way to do that is StringTokenizer.另一种方法是 StringTokenizer。 ex:-例如:-

 public static void main(String[] args) {

    String str = "This is a sample string";
    StringTokenizer st = new StringTokenizer(str," ");
    String starr[]=new String[st.countTokens()];
    while (st.hasMoreElements()) {
        starr[i++]=st.nextElement();
    }
}

You can use simple following code您可以使用简单的以下代码

String str= "This is a sample sentence.";
String[] words = str.split("[[ ]*|[//.]]");
for(int i=0;i<words.length;i++)
System.out.print(words[i]+" ");

Most of the answers here convert String to String Array as the question asked.正如所问的问题,这里的大多数答案将字符串转换为字符串数组。 But Generally we use List , so more useful will be -但通常我们使用 List ,所以更有用的是 -

String dummy = "This is a sample sentence.";
List<String> wordList= Arrays.asList(dummy.split(" "));

Here is a solution in plain and simple C++ code with no fancy function, use DMA to allocate a dynamic string array, and put data in array till you find a open space.这里有一个简单的C++代码解决方案,没有花哨的功能,使用DMA分配一个动态字符串数组,并将数据放入数组,直到找到一个空位。 please refer code below with comments.请参考下面的代码和评论。 I hope it helps.我希望它有帮助。

#include<bits/stdc++.h>
using namespace std;

int main()
{

string data="hello there how are you"; // a_size=5, char count =23
//getline(cin,data); 
int count=0; // initialize a count to count total number of spaces in string.
int len=data.length();
for (int i = 0; i < (int)data.length(); ++i)
{
    if(data[i]==' ')
    {
        ++count;
    }
}
//declare a string array +1 greater than the size 
// num of space in string.
string* str = new string[count+1];

int i, start=0;
for (int index=0; index<count+1; ++index) // index array to increment index of string array and feed data.
{   string temp="";
    for ( i = start; i <len; ++i)
    {   
        if(data[i]!=' ') //increment temp stored word till you find a space.
        {
            temp=temp+data[i];
        }else{
            start=i+1; // increment i counter to next to the space
            break;
        }
    }str[index]=temp;
}


//print data 
for (int i = 0; i < count+1; ++i)
{
    cout<<str[i]<<" ";
}

    return 0;
}

This should help,这应该有帮助,

 String s = "This is a sample sentence";
 String[] words = s.split(" ");

this will make an array with elements as the string separated by " ".这将创建一个以“”分隔的字符串作为元素的数组。

TRY THIS....试试这个....

import java.util.Scanner;

public class test {
    public static void main(String[] args) {

        Scanner t = new Scanner(System.in);
        String x = t.nextLine();

        System.out.println(x);

        String[] starr = x.split(" ");

        System.out.println("reg no: "+ starr[0]);
        System.out.println("name: "+ starr[1]);
        System.out.println("district: "+ starr[2]);

    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM