简体   繁体   中英

How do I get the set of all letters in Java/Clojure?

In Python, I can do this:

>>> import string
>>> string.letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

Is there any way to do something similar in Clojure (apart from copying and pasting the above characters somewhere)? I looked through both the Clojure standard library and the java standard library and couldn't find it.

If you just want Ascii chars,

(map char (concat (range 65 91) (range 97 123)))

will yield,

(\A \B \C \D \E \F \G \H \I \J \K \L \M \N \O \P \Q \R \S \T \U \V \W \X \Y \Z 
 \a \b \c \d \e \f \g \h \i \j \k \l \m \n \o \p \q \r \s \t \u \v \w \x \y \z)

A properly non-ASCII-centric implementation:

private static String allLetters(String charsetName)
{
    CharsetEncoder ce = Charset.forName(charsetName).newEncoder();
    StringBuilder result = new StringBuilder();
    for(char c=0; c<Character.MAX_VALUE; c++)
    {
        if(ce.canEncode(c) && Character.isLetter(c))
        {
            result.append(c);
        }
    }
    return result.toString();
}

Call this with "US-ASCII" and you'll get the desired result (except that uppercase letters come first). You could call it with Charset.defaultCharset() , but I suspect that you'd get far more than the ASCII letters on most systems, even in the USA.

Caveat: only considers the basic multilingual plane. Wouldn't be too hard to extend to the supplementary planes, but it would take a lot longer, and the utility is questionable.

Based on Michaels imperative Java solution, this is a idiomatic (lazy sequences) Clojure solution:

(ns stackoverflow
  (:import (java.nio.charset Charset CharsetEncoder)))

(defn all-letters [charset]
  (let [encoder (. (Charset/forName charset) newEncoder)]
    (letfn [(valid-char? [c]
             (and (.canEncode encoder (char c)) (Character/isLetter c)))
        (all-letters-lazy [c]
                  (when (<= c (int Character/MAX_VALUE))
                (if (valid-char? c)
                  (lazy-seq
                   (cons (char c) (all-letters-lazy (inc c))))
                  (recur (inc c)))))]
      (all-letters-lazy 0))))

Update: Thanks cgrand for this preferable high-level solution:

(defn letters [charset-name]
  (let [ce (-> charset-name java.nio.charset.Charset/forName .newEncoder)]
    (->> (range 0 (int Character/MAX_VALUE)) (map char)
         (filter #(and (.canEncode ce %) (Character/isLetter %))))))

But the performace comparison between my first approach

user> (time (doall (stackoverflow/all-letters "ascii"))) 
"Elapsed time: 33.333336 msecs"                                                  
(\A \B \C \D \E \F \G \H \I \J \K \L \M \N \O \P \Q \R \S \T \U \V \W \X \Y \Z \\
a \b \c \d \e \f \g \h \i \j \k \l \m \n \o \p \q \r \s \t \u \v \w \x \y \z)  

and your solution

user> (time (doall (stackoverflow/letters "ascii"))) 
"Elapsed time: 666.666654 msecs"                                                 
(\A \B \C \D \E \F \G \H \I \J \K \L \M \N \O \P \Q \R \S \T \U \V \W \X \Y \Z \\
a \b \c \d \e \f \g \h \i \j \k \l \m \n \o \p \q \r \s \t \u \v \w \x \y \z) 

is quite interesting.

No, because that is just printing out the ASCII letters rather than the full set. Of course, it's trivial to print out the 26 lower case and upper case letters using two for loops but the fact is that there are many more "letters" outside of the first 127 code points. Java's "isLetter" fn on Character will be true for these and many others.

string.letters: The concatenation of the strings lowercase and uppercase described below. The specific value is locale-dependent, and will be updated when locale.setlocale() is called.

I modified the answer from Michael Borgwardt. In my implementation there are two lists lowerCases and upperCases for two reasons:

  1. string.letters is lowercases followed by uppercases.

  2. Java Character.isLetter(char) is more than just uppercases and lowercases, so use of Character.isLetter(char) will return to much results under some charsets, for example "windows-1252"

From Api-Doc: Character.isLetter(char) :

A character is considered to be a letter if its general category type, provided by Character.getType(ch), is any of the following:

 * UPPERCASE_LETTER * LOWERCASE_LETTER * TITLECASE_LETTER * MODIFIER_LETTER * OTHER_LETTER 

Not all letters have case. Many characters are letters but are neither uppercase nor lowercase nor titlecase.

So if string.letters should only return lowercases and uppercases, the TITLECASE_LETTER, ,MODIFIER_LETTER and OTHER_LETTER chars have to be ignored.

public static String allLetters(final Charset charset) {
    final CharsetEncoder encoder = charset.newEncoder();
    final StringBuilder lowerCases = new StringBuilder();
    final StringBuilder upperCases = new StringBuilder();
    for (char c = 0; c < Character.MAX_VALUE; c++) {
    if (encoder.canEncode(c)) {
    if (Character.isUpperCase(c)) {
    upperCases.append(c);
    } else if (Character.isLowerCase(c)) {
    lowerCases.append(c);
    }
    }
    }
    return lowerCases.append(upperCases).toString();
}

Additionally: the behaviour of string.letters changes when changing the locale. This maybe won't apply to my solution, because changing the default locale does not change the default charset. From apiDoc:

The default charset is determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system.

I guess, the default charset cannot be changed within the started JVM. So the "change locale" behaviour of string.letters can not be realizied with just Locale.setDefault(Locale). But changing the default locale is anyway a bad idea:

Since changing the default locale may affect many different areas of functionality, this method should only be used if the caller is prepared to reinitialize locale-sensitive code running within the same Java Virtual Machine.

我很确定这些字母在标准库中不可用,因此您可能只需要手动方法。

The same result as mentioned in your question would be given by the following statement that has to be manually generated in contrast to the Python solution:

public class Letters {

    public static String asString() {
        StringBuffer buffer = new StringBuffer();
        for (char c = 'a'; c <= 'z'; c++)
            buffer.append(c);
        for (char c = 'A'; c <= 'Z'; c++)
            buffer.append(c);
        return buffer.toString();
    }

    public static void main(String[] args) {
        System.out.println(Letters.asString());
    }

}

In case you don't remember code points ranges. Brute force way :-P :

user> (require '[clojure.contrib.str-utils2 :as stru2])
nil
user> (set (stru2/replace (apply str (map char (range 0 256))) #"[^A-Za-z]" ""))
#{\A \a \B \b \C \c \D \d \E \e \F \f \G \g \H \h \I \i \J \j \K \k \L \l \M \m \N \n \O \o \P \p \Q \q \R \r \S \s \T \t \U \u \V \v \W \w \X \x \Y \y \Z \z}
user> 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM