[英]How to determine if a string is an English word?
I have an input string, some of which does not contain actual words (for example, it contains mathematical formulas such as x^2 = y_2 + 4
).我有一个输入字符串,其中一些不包含实际单词(例如,它包含
x^2 = y_2 + 4
等数学公式)。 I would like a way to split my input string by whether we have a substring of actual English words.我想要一种方法来根据我们是否有 substring 个实际英语单词来拆分我的输入字符串。 For example:
例如:
If my string was:如果我的字符串是:
"Taking the derivative of: f(x) = \int_{0}^{1} z^3, we can see that we always get x^2 = y_2 + 4 which is the same as taking the double integral of g(x)"
then I would like it split into a list like:然后我想把它分成一个列表,比如:
["Taking the derivative of: ", "f(x) = \int_{0}^{1} z^3, ", "we can see that we always get ", "x^2 = y_2 + 4 ", "which is the same as taking the double integral of ", "g(x)"]
How can I accomplish this?我怎样才能做到这一点? I don't think regex will work for this, or at least I'm not aware of any method in regex that detects the longest substrings of English words (including commas, periods, semicolons, etc).
我不认为正则表达式适用于此,或者至少我不知道正则表达式中有任何方法可以检测英文单词的最长子串(包括逗号、句号、分号等)。
U can simply use the pyenchant
library as mentioned in this post:你可以简单地使用这篇文章中提到的
pyenchant
库:
import enchant
d = enchant.Dict("en_US")
print(d.check("Hello"))
Output: Output:
True
U can install it by typing pip install pyenchant
in ur command line.你可以通过在你的命令行中输入
pip install pyenchant
来安装它。 In ur case, u have to loop through all strings in the string and check whether the current string is an english word or not.在您的情况下,您必须遍历字符串中的所有字符串并检查当前字符串是否为英文单词。 Here is the full code to do it:
这是执行此操作的完整代码:
import enchant
d = enchant.Dict("en_US")
string = "Taking the derivative of: f(x) = \int_{0}^{1} z^3, we can see that we always get x^2 = y_2 + 4 which is the same as taking the double integral of g(x)"
stringlst = string.split(' ')
wordlst = []
for string in stringlst:
if d.check(string):
wordlst.append(string)
print(wordlst)
Output: Output:
['Taking', 'the', 'derivative', 'we', 'can', 'see', 'that', 'we', 'always', 'get', '4', 'which', 'is', 'the', 'same', 'as', 'taking', 'the', 'double', 'integral', 'of']
Hope that this helps!希望这会有所帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.