I have a pandas dataframe where a column contains tuples:
p = pd.DataFrame({"sentence" : [("A.Hi", "B.My", "C.Friend"), \
("AA.How", "BB.Are", "CC.You")]})
I'd like to split each string in the tuple on a punctuation .
, take the second part of the split/string and see how many match list of strings:
p["tmp"] = p["sentence"].apply(lambda x: [i.split(".")[1] for i in x])
p["tmp"].apply(lambda x: [True if len(set(x).intersection(set(["Hi", "My"])))>0 else False])
This works as intended, but my dataframe has more than 100k rows - and apply
doesn't seem very efficient at these sizes. Is there a way to optize/vectorize the above code?
Use nested list and set comprehension and for test convert sets to bool
s - empty set
return False
:
s = set(["Hi", "My"])
p["tmp"] = [bool(set(i.split(".")[1] for i in x).intersection(s)) for x in p["sentence"]]
print (p)
sentence tmp
0 (A.Hi, B.My, C.Friend) True
1 (AA.How, BB.Are, CC.You) False
EDIT:
If there are only 1 or 2 length values after split, you can select last value by indexing [-1]
:
p = pd.DataFrame({"sentence" : [("A.Hi", "B.My", "C.Friend"), \
("AA.How", "BB.Are", "You")]})
print (p)
sentence
0 (A.Hi, B.My, C.Friend)
1 (AA.How, BB.Are, You)
s = set(["Hi", "My"])
p["tmp"] = [bool(set(i.split(".")[-1] for i in x).intersection(s)) for x in p["sentence"]]
print (p)
sentence tmp
0 (A.Hi, B.My, C.Friend) True
1 (AA.How, BB.Are, You) False
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.