Python - How to remove a sentence if it contains spanish words

Question

As the title states, I've got a dataset that includes strings that are either english or spanish. Prior to preprocessing, I want to remove any row that includes Spanish words.

Should I just use a spanish corpus loop through the entire dataset to see of any spanish words exist in the sentence?

Any help would be much appreciated.

Answer 1

I think the library you'll want to use is langdetect . Here's some example code I just whipped up + output.

from langdetect import detect

sentences = ["hello, how are you",
             "Hola cómo estás",
             "I've had a great day"]

for sentence in sentences:
    print(detect(sentence)) # outputs 'en', 'es', 'en'

Hope this helps, happy to answer any follow up questions

Python - How to remove a sentence if it contains spanish words

Question

1 answers

solution1
2 ACCPTED 2018-04-23 17:59:04

Python - How to remove a sentence if it contains spanish words

Question

1 answers

solution1 2 ACCPTED 2018-04-23 17:59:04

solution1
2 ACCPTED 2018-04-23 17:59:04