简体   繁体   English

正则表达式替换Windows在文件名中不接受的字符

[英]Regex to replace characters that Windows doesn't accept in a filename

I'm trying to build a regular expression that will detect any character that Windows does not accept as part of a file name (are these the same for other OS? I don't know, to be honest). 我正在尝试构建一个正则表达式,它将检测Windows不接受的任何字符作为文件名的一部分(这些对于其他操作系统是否相同?我不知道,说实话)。

These symbols are: 这些符号是:

\ / : * ? "  |

Anyway, this is what I have: [\\\\/:*?\\"<>|] 无论如何,这就是我所拥有的: [\\\\/:*?\\"<>|]

The tester over at http://gskinner.com/RegExr/ shows this to be working. http://gskinner.com/RegExr/上的测试人员证明了这一点。 For the string Allo*ha , the * symbol lights up, signalling it's been found. 对于字符串Allo*ha*符号亮起,表示已找到它。 Should I enter Allo**ha however, only the first * will light up. 我应该输入Allo**ha但只有第一个*会亮起来。 So I think I need to modify this regex to find all appearances of the mentioned characters, but I'm not sure. 所以我想我需要修改这个正则表达式以找到所提到的字符的所有外观,但我不确定。

You see, in Java, I'm lucky enough to have the function String.replaceAll(String regex, String replacement) . 你看,在Java中,我很幸运能够拥有String.replaceAll函数(String regex,String replacement) The description says: 描述说:

Replaces each substring of this string that matches the given regular expression with the given replacement. 将给定替换的给定正则表达式匹配的此字符串的每个子字符串替换。

So in other words, even if the regex only finds the first and then stops searching, this function will still find them all. 换句话说,即使正则表达式只找到第一个然后停止搜索,这个函数仍然会找到它们。

For instance: String.replaceAll("[\\\\/:*?\\"<>|]","") 例如: String.replaceAll("[\\\\/:*?\\"<>|]","")

However, I don't feel like I can take that risk. 但是,我不觉得我可以冒这个险。 So does anybody know how I can extend this? 那么有谁知道如何扩展它?

since no answer was good enough i did it myself. 因为没有答案足够好我自己做了。 hope this helps ;) 希望这可以帮助 ;)

public static boolean validateFileName(String fileName) {
    return fileName.matches("^[^.\\\\/:*?\"<>|]?[^\\\\/:*?\"<>|]*") 
    && getValidFileName(fileName).length()>0;
}

public static String getValidFileName(String fileName) {
    String newFileName = fileName.replace("^\\.+", "").replaceAll("[\\\\/:*?\"<>|]", "");
    if(newFileName.length()==0)
        throw new IllegalStateException(
                "File Name " + fileName + " results in a empty fileName!");
    return newFileName;
}

Windows filename rules are tricky . Windows文件名规则很棘手 You're only scratching the surface. 你只是在摸索表面。

For example here are some things that are not valid filenames, in addition to the chracters you listed: 例如,除了列出的字符之外,这里还有一些无效的文件名:

                                    (yes, that's an empty string)
.
.a
a.
 a                                  (that's a leading space)
a                                   (or a trailing space)
com
prn.txt
[anything over 240 characters]
[any control characters]
[any non-ASCII chracters that don't fit in the system codepage,
 if the filesystem is FAT32]

Removing special characters in a single regex sub like String.replaceAll() isn't enough; 删除像String.replaceAll()这样的单个正则表达式子句中的特殊字符是不够的; you can easily end up with something invalid like an empty string or trailing '.' 你可以很容易地得到一些无效的东西,如空字符串或尾随'。' or ' '. 要么 ' '。 Replacing something like “[^A-Za-z0-9_.]*” with '_' would be a better first step. 用“_”替换“[^ A-Za-z0-9 _。] *”之类的东西将是更好的第一步。 But you will still need higher-level processing on whatever platform you're using. 但是,您仍然需要在您使用的任何平台上进行更高级别的处理。

I use pure and simple regular expression. 我使用纯粹而简单的正则表达式。 I give characters that may occur and through the negation of "^" I change all the other as a sign of such. 我给出可能出现的字符,并通过否定“^”我改变所有其他字符作为这样的标志。 "_" “_”

String fileName = someString.replaceAll("[^a-zA-Z0-9\\\\.\\\\-]", "_"); String fileName = someString.replaceAll(“[^ a-zA-Z0-9 \\\\。\\\\ - ]”,“_”);

For example: If you do not want to be in the expression a "." 例如:如果您不想在表达式中使用“。” in then remove the "\\\\." 然后删除“\\\\。”

String fileName = someString.replaceAll("[^a-zA-Z0-9\\\\-]", "_"); String fileName = someString.replaceAll(“[^ a-zA-Z0-9 \\\\ - ]”,“_”);

For the record, POSIX-compliant systems (including UNIX and Linux) support all characters except the null character ( '\\0' ) and forwards slash ( '/' ) in filenames. 为了记录,POSIX兼容系统(包括UNIX和Linux)支持除空字符( '\\0' )之外的所有字符,并在文件名中转发斜杠( '/' )。 Special characters such as space and asterisk must be escaped on the command line so that they do not take their usual roles. 必须在命令行上转义特殊字符(如空格和星号),以便它们不会执行常规角色。

I extract all word characters and whitespace characters from the original string and I also make sure that whitespace character is not present at the end of the string. 我从原始字符串中提取所有单词字符和空白字符,并且还确保字符串末尾不存在空格字符。 Here is my code snippet in java. 这是我在java中的代码片段。

temp_string = original.replaceAll("[^\\w|\\s]", "");
final_string = temp_string.replaceAll("\\s$", "");

I think I helped someone. 我想我帮了别人。

Java has a replaceAll function, but every programming language has a way to do something similar. Java有一个replaceAll函数,但每种编程语言都有办法做类似的事情。 Perl, for example, uses the g switch to signify a global replacement. 例如,Perl使用g开关来表示全局替换。 Python's sub function allows you to specify the number of replacements to make. Python的sub允许您指定要进行的替换次数。 If, for some reason, your language didn't have an equivalent, you can always do something like this: 如果由于某种原因,您的语言没有等效语言,您可以随时执行以下操作:

while (filename.matches(bad_characters)
  filename.replace(bad_characters, "")

I made one very simple method that works for me for most common cases: 我做了一个非常简单的方法 ,适用于大多数常见情况:

// replace special characters that windows doesn't accept
private String replaceSpecialCharacters(String string) {
    return string.replaceAll("[\\*/\\\\!\\|:?<>]", "_")
            .replaceAll("(%22)", "_");
}

%22 is encoded if you have qoute ( " ) in your file names. 如果您的文件名中包含qoute( ),则编码%22

The required regex / syntax (JS): 所需的正则表达式/语法(JS):

.trim().replace(/[\\/:*?\"<>|]/g,"").substring(0,240);

where the last bit is optional, use only when you want to limit the length to 240. 最后一位是可选的,仅在您希望将长度限制为240时使用。

other useful functions (JS): 其他有用的功能(JS):

.toUppperCase();
.toLowerCase();
.replace(/  /g,' ');     //normalising multiple spaces to one, add before substring.
.includes("str");        //check if a string segment is included in the filename
.split(".").slice(-1);   //get extension, given the entire filename contains a .

您可以尝试仅允许用户能够输入的内容,例如AZ,az和0-9。

You cannot do this with a single regexp, because a regexp always matches a substring if the input. 您不能使用单个正则表达式执行此操作,因为正则表达式始终匹配子字符串(如果输入)。 Consider the word Alo*h*a , there is no substring that contains all * s, and not any other character. 考虑单词Alo*h*a ,没有包含所有* s的子字符串,而不包含任何其他字符。 So if you can use the replaceAll function, just stick with it. 因此,如果你可以使用replaceAll函数,只需坚持下去。

BTW, the set of forbidden characters is different in other OSes. 顺便说一句,禁止字符集在其他操作系统中是不同的。

Windows also do not accept "%" as a file name. Windows也不接受“%”作为文件名。

If you are building a general expression that may affect files that will eventually be moved to other operating system, I suggest that you put more characters that may have problems with them. 如果您正在构建一个可能影响最终将移动到其他操作系统的文件的通用表达式,我建议您添加可能有问题的更多字符。

For example, in Linux (many distributions I know), some users may have problems with files containing [b]& ! 例如,在Linux(我知道很多发行版)中,某些用户可能会遇到包含[b]&!的文件的问题。 ] [ / - ( )[/b]. ] [/ - ()[/ b]。 The symbols are allowed in file names, but they may need to be specially treated by users and some programs have bugs caused by their existence. 符号在文件名中是允许的,但是它们可能需要由用户特别处理,并且某些程序存在由其存在引起的错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM