简体   繁体   中英

Regex for UTF-8 valid filenames

I am trying to process the names of the files my users upload. I want to support all valid UTF-8 characters except those that might pose a problem for display on an HTML webpage, access over a CLI interface, or storage and retrieval on a filesystem.

Anyway, I came up with the following lenient function and I'm wondering if it's safe enough to be used. I use prepared statements for all database queries and I always html encode my output, but I still like to know that this is also a well thought through approach.

// $filename = $_FILES['file']['name'];

$filename = 'Filename 123;".\'"."la\l[a]*(/.jpg
∮ E⋅da = Q,  n → ∞, ∑ f(i) = ∏ g(i), ∀x∈ℝ: ⌈x⌉ = −⌊−x⌋, α ∧ ¬β = ¬(¬α ∨ β),
  ℕ ⊆ ℕ₀ ⊂ ℤ ⊂ ℚ ⊂ ℝ ⊂ ℂ, ⊥ < a ≠ b ≡ c ≤ d ≪ ⊤ ⇒ (A ⇔ B),
  2H₂ + O₂ ⇌ 2H₂O, R = 4.7 kΩ, ⌀ 200 mm
sfajs,-=[];\',./09μετράει
าวนั้นเป็นชน
Καλημέρα κόσμε, コンニチハ
()_+{}|":?><';


// Replace symbols, punctuation, and ASCII control characters like \n or [BEL]
$filename = preg_replace('~[\p{S}\p{P}\p{C}]+~u', ' ', $filename);

Is this approach safe for me, and suitable for my users?

Update

To clarify, I do not use the filename for the name of the file on the filesystem. I generate a unique hash and use that - I just need to save the original name for the users befit since that is how they recognize their files. A SHA1 hash or UUID doesn't mean a thing to them.

The very first thing you need to do is to check your input is UTF-8.

mb_internal_encoding and mb_check_encoding are your friends.

You are using a blacklist, when it's good security practice to use a whitelist of allowed input.

Edit after the clarification :

You should be safe. Remember to filter Lm and No as well if you don't want to summon Zalgo .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM