Splitting strings in tuples within a pandas dataframe column

Question

I have a pandas dataframe where a column contains tuples:

p = pd.DataFrame({"sentence" : [("A.Hi",   "B.My",   "C.Friend"), \
                                ("AA.How", "BB.Are", "CC.You")]})

I'd like to split each string in the tuple on a punctuation . , take the second part of the split/string and see how many match list of strings:

p["tmp"] = p["sentence"].apply(lambda x: [i.split(".")[1] for i in x])
p["tmp"].apply(lambda x: [True if len(set(x).intersection(set(["Hi", "My"])))>0 else False])

This works as intended, but my dataframe has more than 100k rows - and apply doesn't seem very efficient at these sizes. Is there a way to optize/vectorize the above code?

Answer 1

Use nested list and set comprehension and for test convert sets to bool s - empty set return False :

s = set(["Hi", "My"])
p["tmp"] = [bool(set(i.split(".")[1] for i in x).intersection(s)) for x in p["sentence"]]

print (p)
                   sentence    tmp
0    (A.Hi, B.My, C.Friend)   True
1  (AA.How, BB.Are, CC.You)  False

EDIT:

If there are only 1 or 2 length values after split, you can select last value by indexing [-1] :

p = pd.DataFrame({"sentence" : [("A.Hi",   "B.My",   "C.Friend"), \
                                ("AA.How", "BB.Are", "You")]})

print (p)
                 sentence
0  (A.Hi, B.My, C.Friend)
1   (AA.How, BB.Are, You)

s = set(["Hi", "My"])

p["tmp"] = [bool(set(i.split(".")[-1] for i in x).intersection(s)) for x in p["sentence"]]
print (p)
                 sentence    tmp
0  (A.Hi, B.My, C.Friend)   True
1   (AA.How, BB.Are, You)  False

Splitting strings in tuples within a pandas dataframe column

Question

1 answers

solution1
1 ACCPTED 2019-01-23 09:33:37

Splitting strings in tuples within a pandas dataframe column

Question

1 answers

solution1 1 ACCPTED 2019-01-23 09:33:37

solution1
1 ACCPTED 2019-01-23 09:33:37