简体   繁体   English

如何从字符串中删除表情符号字符?

[英]How do I remove emoji characters from a string?

I've got a text input from a mobile device. 我从移动设备输入了文本。 It contains emoji. 它包含表情符号。 In C#, I have the text as 在C#中,我的文字为

Text 🍫🌐 text

Simply put, I want the output text to be 简而言之,我希望输出文本为

Text text

I'm trying to just remove all such emojis from the text with rejex.. except, I'm not sure how to convert that emoji into it's unicode sequence.. How do I do that? 我正在尝试使用rejex从文本中删除所有此类表情符号。.除了,我不确定如何将该表情符号转换为unicode序列。我该怎么做?

edit: 编辑:

I'm trying to save the user input into mysql. 我正在尝试将用户输入保存到mysql中。 It looks like mysql UTF8 doesn't really support unicode characters and the right way to do it would be by changing the schema but I don't think that is an option for me. 看起来mysql UTF8确实不支持Unicode字符, 正确的方法是更改​​架构,但我认为这不是我的选择。 So I'm trying to just remove all the emoji characters before saving it in the database. 所以我试图删除所有的表情符号字符,然后再将其保存在数据库中。

This is my schema for the relevant column: 这是相关列的架构:

在此处输入图片说明

I'm using Nhibernate as my ORM and the insert query generated looks like this: 我使用Nhibernate作为我的ORM,生成的插入查询如下所示:

Insert into `Content` (ContentTypeId, Comments, DateCreated) 
values (?p0, ?p1, ?p2);
?p0 = 4 [Type: Int32 (0)]. ?p1 = 'Text 🍫🌐 text' [Type: String (20)], ?p2 = 19/01/2015 10:38:23 [Type: DateTime (0)]

When I copy this query from logs and run it on mysql directly, I get this error: 当我从日志中复制此查询并直接在mysql上运行时,出现以下错误:

1 warning(s): 1366 Incorrect string value: '\xF0\x9F\x98\x80 t...' for column 'Comments' at row 1   0.000 sec

Also, I've tried to convert it into encoding bytes and it doesn't really work.. 另外,我尝试将其转换为编码字节,但实际上并没有用。

在此处输入图片说明

Assuming you just want to remove all non-BMP characters, ie anything with a Unicode code point of U+10000 and higher, you can use a regex to remove any UTF-16 surrogate code units from the string. 假设您只想删除所有非BMP字符,即Unicode代码点为U + 10000及更高版本的任何字符,则可以使用正则表达式从字符串中删除所有UTF-16 代理代码单元。 For example: 例如:

using System;
using System.Text.RegularExpressions;

class Test
{
    static void Main(string[] args)
    {
        string text = "x\U0001F310y";
        Console.WriteLine(text.Length); // 4
        string result = Regex.Replace(text, @"\p{Cs}", "");
        Console.WriteLine(result); // 2
    }
}

Here "Cs" is the Unicode category for "surrogate". 这里的“ Cs”是“代理”的Unicode类别。

It appears that Regex works based on UTF-16 code units rather than Unicode code points, otherwise you'd need a different approach. 看起来Regex基于UTF-16代码单元而不是Unicode代码点工作,否则您将需要其他方法。

Note that there are non-BMP characters other than emoji, but I suspect you'll find they'll have the same problem when you try to store them. 请注意,除了表情符号以外,还有其他非BMP字符,但我怀疑您在尝试存储它们时会发现它们也会遇到相同的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM