As the title states, I've got a dataset that includes strings that are either english or spanish. Prior to preprocessing, I want to remove any row that includes Spanish words.
Should I just use a spanish corpus loop through the entire dataset to see of any spanish words exist in the sentence?
Any help would be much appreciated.
I think the library you'll want to use is langdetect . Here's some example code I just whipped up + output.
from langdetect import detect
sentences = ["hello, how are you",
"Hola cómo estás",
"I've had a great day"]
for sentence in sentences:
print(detect(sentence)) # outputs 'en', 'es', 'en'
Hope this helps, happy to answer any follow up questions
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.