简体   繁体   English

如何检测文本是否人类可读?

[英]How To Detect Is Text Human Readable?

I am wondering if there's a way to tell a given text is human readable.我想知道是否有办法告诉给定的文本是人类可读的。 By human readable, I mean: it has some meanings, format like an article written by somebody, or at least generated by a software translator that is intended to be read by a human.人类可读,我的意思是:它具有一些含义,格式就像某人写的文章,或者至少是由软件翻译器生成的,旨在供人类阅读。

Here's the background story: recently I am making an app that allows user to upload a short text to a database.这是背景故事:最近我正在制作一个应用程序,允许用户将短文本上传到数据库。 At the early stage of deployment I noticed some user always uploaded corrupted text due to a problem with encoding.在部署的早期阶段,我注意到由于编码问题,一些用户总是上传损坏的文本。 This problem is fixed later, but leaves me wonder if there's a way to pick up non human readable text before serving the text back to users.这个问题稍后修复,但让我想知道是否有办法在将文本返回给用户之前获取非人类可读的文本。

Any advice will be appreciated.任何建议将被认真考虑。 The scope might be too large to include other languages, so at the moment let's limit the discussion to English only.范围可能太大而无法包括其他语言,因此现在让我们将讨论仅限于英语。

You can try a language identification tool, or something similar.您可以尝试使用语言识别工具或类似工具。

Basically you have to count the characters, or groups of character (character n-grams), and compare the distribution of the letters of the text submitted with the distribution of the letters of a collection of texts written in good english.基本上,您必须计算字符或字符组(字符 n-gram),并将提交的文本的字母分布与用良好英语编写的文本集合的字母分布进行比较。 (Make sure that such collection of texts is representative of the expected input). (确保这样的文本集合代表了预期的输入)。

In the continuity of a N-gram approach you might want to try a dictionary based approach and check for the presence of 'stop words' (eg 'the', 'a', 'an', 'of') in the input text.在 N-gram 方法的连续性中,您可能想尝试基于字典的方法并检查输入文本中是否存在“停用词”(例如“the”、“a”、“an”、“of”) .

Most of the NLP-Libraries will do the job (Spacy is a very common one).大多数 NLP 库都可以完成这项工作(Spacy 是一个非常常见的)。 You can also go for language detection: Langdetect will support you on this ( https://pypi.org/project/langdetect/ ) as many others will do.您还可以进行语言检测:Langdetect 将在这方面为您提供支持( https://pypi.org/project/langdetect/ ),就像其他许多人一样。 If you need to be less specific (more math than language) you should look for Phonotactics (with BLICK for Python: https://github.com/mmcauliffe/python-BLICK ) that looks into the construction of character order in a string.如果您需要不那么具体(数学而不是语言),您应该寻找 Phonotactics(使用 BLICK for Python: https : //github.com/mmcauliffe/python-BLICK )来研究字符串中字符顺序的构造。

做一个 hexdump 并确保每个字符都小于或等于 0x7f。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM