简体   繁体   English

如何在Java / Clojure中获取所有字母的集合?

[英]How do I get the set of all letters in Java/Clojure?

In Python, I can do this: 在Python中,我可以这样做:

>>> import string
>>> string.letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

Is there any way to do something similar in Clojure (apart from copying and pasting the above characters somewhere)? 有没有办法在Clojure中做类似的事情(除了复制和粘贴上面的字符)? I looked through both the Clojure standard library and the java standard library and couldn't find it. 我查看了Clojure标准库和java标准库,但找不到它。

If you just want Ascii chars, 如果你只想要Ascii字符,

(map char (concat (range 65 91) (range 97 123)))

will yield, 会产生,

(\A \B \C \D \E \F \G \H \I \J \K \L \M \N \O \P \Q \R \S \T \U \V \W \X \Y \Z 
 \a \b \c \d \e \f \g \h \i \j \k \l \m \n \o \p \q \r \s \t \u \v \w \x \y \z)

A properly non-ASCII-centric implementation: 一个正确的非以ASCII为中心的实现:

private static String allLetters(String charsetName)
{
    CharsetEncoder ce = Charset.forName(charsetName).newEncoder();
    StringBuilder result = new StringBuilder();
    for(char c=0; c<Character.MAX_VALUE; c++)
    {
        if(ce.canEncode(c) && Character.isLetter(c))
        {
            result.append(c);
        }
    }
    return result.toString();
}

Call this with "US-ASCII" and you'll get the desired result (except that uppercase letters come first). 用“US-ASCII”调用它,你将获得所需的结果(除了大写字母首先)。 You could call it with Charset.defaultCharset() , but I suspect that you'd get far more than the ASCII letters on most systems, even in the USA. 您可以使用Charset.defaultCharset()调用它,但我怀疑您在大多数系统上获得的远远超过ASCII字母,即使在美国也是如此。

Caveat: only considers the basic multilingual plane. 警告:只考虑基本的多语言平面。 Wouldn't be too hard to extend to the supplementary planes, but it would take a lot longer, and the utility is questionable. 扩展到补充平面不会太难,但需要更长的时间,实用程序是值得怀疑的。

Based on Michaels imperative Java solution, this is a idiomatic (lazy sequences) Clojure solution: 基于Michaels命令式Java解决方案,这是一个惯用的(懒惰序列)Clojure解决方案:

(ns stackoverflow
  (:import (java.nio.charset Charset CharsetEncoder)))

(defn all-letters [charset]
  (let [encoder (. (Charset/forName charset) newEncoder)]
    (letfn [(valid-char? [c]
             (and (.canEncode encoder (char c)) (Character/isLetter c)))
        (all-letters-lazy [c]
                  (when (<= c (int Character/MAX_VALUE))
                (if (valid-char? c)
                  (lazy-seq
                   (cons (char c) (all-letters-lazy (inc c))))
                  (recur (inc c)))))]
      (all-letters-lazy 0))))

Update: Thanks cgrand for this preferable high-level solution: 更新:感谢cgrand这个更好的高级解决方案:

(defn letters [charset-name]
  (let [ce (-> charset-name java.nio.charset.Charset/forName .newEncoder)]
    (->> (range 0 (int Character/MAX_VALUE)) (map char)
         (filter #(and (.canEncode ce %) (Character/isLetter %))))))

But the performace comparison between my first approach 但我的第一种方法之间的性能比较

user> (time (doall (stackoverflow/all-letters "ascii"))) 
"Elapsed time: 33.333336 msecs"                                                  
(\A \B \C \D \E \F \G \H \I \J \K \L \M \N \O \P \Q \R \S \T \U \V \W \X \Y \Z \\
a \b \c \d \e \f \g \h \i \j \k \l \m \n \o \p \q \r \s \t \u \v \w \x \y \z)  

and your solution 和你的解决方案

user> (time (doall (stackoverflow/letters "ascii"))) 
"Elapsed time: 666.666654 msecs"                                                 
(\A \B \C \D \E \F \G \H \I \J \K \L \M \N \O \P \Q \R \S \T \U \V \W \X \Y \Z \\
a \b \c \d \e \f \g \h \i \j \k \l \m \n \o \p \q \r \s \t \u \v \w \x \y \z) 

is quite interesting. 非常有趣。

No, because that is just printing out the ASCII letters rather than the full set. 不,因为那只是打印出ASCII字母而不是全套。 Of course, it's trivial to print out the 26 lower case and upper case letters using two for loops but the fact is that there are many more "letters" outside of the first 127 code points. 当然,使用两个for循环打印26个小写字母和大写字母是微不足道的,但事实是在前127个代码点之外还有更多的“字母”。 Java's "isLetter" fn on Character will be true for these and many others. Java中的“isLetter”字符对于这些和其他许多人都是正确的。

string.letters: The concatenation of the strings lowercase and uppercase described below. string.letters:下面描述的字符串小写和大写的串联。 The specific value is locale-dependent, and will be updated when locale.setlocale() is called. 特定值取决于语言环境,并在调用locale.setlocale()时更新。

I modified the answer from Michael Borgwardt. 我修改了Michael Borgwardt的答案。 In my implementation there are two lists lowerCases and upperCases for two reasons: 在我的实现中,有两个列表lowerCases和upperCases有两个原因:

  1. string.letters is lowercases followed by uppercases. string.letters是小写,后跟大写。

  2. Java Character.isLetter(char) is more than just uppercases and lowercases, so use of Character.isLetter(char) will return to much results under some charsets, for example "windows-1252" Java Character.isLetter(char)不仅仅是大写和小写,因此使用Character.isLetter(char)将在一些字符集下返回很多结果,例如“windows-1252”

From Api-Doc: Character.isLetter(char) : 来自Api-Doc:Character.isLetter(char)

A character is considered to be a letter if its general category type, provided by Character.getType(ch), is any of the following: 如果Character.getType(ch)提供的常规类别类型是以下任何一个字符,则该字符被视为字母:

 * UPPERCASE_LETTER * LOWERCASE_LETTER * TITLECASE_LETTER * MODIFIER_LETTER * OTHER_LETTER 

Not all letters have case. 并非所有信件都有案例。 Many characters are letters but are neither uppercase nor lowercase nor titlecase. 许多字符都是字母,但既不是大写也不是小写,也不是标题。

So if string.letters should only return lowercases and uppercases, the TITLECASE_LETTER, ,MODIFIER_LETTER and OTHER_LETTER chars have to be ignored. 因此,如果string.letters只返回小写和大写,则必须忽略TITLECASE_LETTER,MODIFIER_LETTER和OTHER_LETTER字符。

public static String allLetters(final Charset charset) {
    final CharsetEncoder encoder = charset.newEncoder();
    final StringBuilder lowerCases = new StringBuilder();
    final StringBuilder upperCases = new StringBuilder();
    for (char c = 0; c < Character.MAX_VALUE; c++) {
    if (encoder.canEncode(c)) {
    if (Character.isUpperCase(c)) {
    upperCases.append(c);
    } else if (Character.isLowerCase(c)) {
    lowerCases.append(c);
    }
    }
    }
    return lowerCases.append(upperCases).toString();
}

Additionally: the behaviour of string.letters changes when changing the locale. 另外:更改语言环境时string.letters的行为会发生变化。 This maybe won't apply to my solution, because changing the default locale does not change the default charset. 这可能不适用于我的解决方案,因为更改默认语言环境不会更改默认字符集。 From apiDoc: 来自apiDoc:

The default charset is determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system. 默认字符集在虚拟机启动期间确定,通常取决于底层操作系统的区域设置和字符集。

I guess, the default charset cannot be changed within the started JVM. 我猜,默认的字符集不能在启动的JVM中更改。 So the "change locale" behaviour of string.letters can not be realizied with just Locale.setDefault(Locale). 因此,仅使用Locale.setDefault(Locale)无法实现string.letters的“更改区域设置”行为。 But changing the default locale is anyway a bad idea: 但是更改默认语言环境无论如何都是一个坏主意:

Since changing the default locale may affect many different areas of functionality, this method should only be used if the caller is prepared to reinitialize locale-sensitive code running within the same Java Virtual Machine. 由于更改默认语言环境可能会影响许多不同的功能区域,因此只有在调用方准备重新初始化在同一Java虚拟机中运行的区域设置敏感代码时,才应使用此方法。

我很确定这些字母在标准库中不可用,因此您可能只需要手动方法。

The same result as mentioned in your question would be given by the following statement that has to be manually generated in contrast to the Python solution: 与Python解决方案相比,必须手动生成以下语句,与您的问题中提到的结果相同:

public class Letters {

    public static String asString() {
        StringBuffer buffer = new StringBuffer();
        for (char c = 'a'; c <= 'z'; c++)
            buffer.append(c);
        for (char c = 'A'; c <= 'Z'; c++)
            buffer.append(c);
        return buffer.toString();
    }

    public static void main(String[] args) {
        System.out.println(Letters.asString());
    }

}

In case you don't remember code points ranges. 如果你不记得代码点范围。 Brute force way :-P : 蛮力方式:-P:

user> (require '[clojure.contrib.str-utils2 :as stru2])
nil
user> (set (stru2/replace (apply str (map char (range 0 256))) #"[^A-Za-z]" ""))
#{\A \a \B \b \C \c \D \d \E \e \F \f \G \g \H \h \I \i \J \j \K \k \L \l \M \m \N \n \O \o \P \p \Q \q \R \r \S \s \T \t \U \u \V \v \W \w \X \x \Y \y \Z \z}
user> 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM