简体   繁体   中英

Splitting strings in tuples within a pandas dataframe column

I have a pandas dataframe where a column contains tuples:

p = pd.DataFrame({"sentence" : [("A.Hi",   "B.My",   "C.Friend"), \
                                ("AA.How", "BB.Are", "CC.You")]})

I'd like to split each string in the tuple on a punctuation . , take the second part of the split/string and see how many match list of strings:

p["tmp"] = p["sentence"].apply(lambda x: [i.split(".")[1] for i in x])
p["tmp"].apply(lambda x: [True if len(set(x).intersection(set(["Hi", "My"])))>0 else False])

This works as intended, but my dataframe has more than 100k rows - and apply doesn't seem very efficient at these sizes. Is there a way to optize/vectorize the above code?

Use nested list and set comprehension and for test convert sets to bool s - empty set return False :

s = set(["Hi", "My"])
p["tmp"] = [bool(set(i.split(".")[1] for i in x).intersection(s)) for x in p["sentence"]]

print (p)
                   sentence    tmp
0    (A.Hi, B.My, C.Friend)   True
1  (AA.How, BB.Are, CC.You)  False

EDIT:

If there are only 1 or 2 length values after split, you can select last value by indexing [-1] :

p = pd.DataFrame({"sentence" : [("A.Hi",   "B.My",   "C.Friend"), \
                                ("AA.How", "BB.Are", "You")]})

print (p)
                 sentence
0  (A.Hi, B.My, C.Friend)
1   (AA.How, BB.Are, You)

s = set(["Hi", "My"])

p["tmp"] = [bool(set(i.split(".")[-1] for i in x).intersection(s)) for x in p["sentence"]]
print (p)
                 sentence    tmp
0  (A.Hi, B.My, C.Friend)   True
1   (AA.How, BB.Are, You)  False

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM