简体   繁体   English

根据子字符串位置检索两个未知索引之间的列中的单元格字符串值

[英]retrieve cell string values in a column between two unknown indexes based on substrings location

I need to locate the first location where the word 'then' appears on Words table.我需要找到“then”一词出现在 Words 表上的第一个位置。 I'm trying to get a code to consolidate all strings on 'text' column from this location till the first text with a substring '666' or '999' in it (in this case a combination of their, stoma22, fe156, sligh334, pain666 (the desired subtrings_output = 'theirfe156sligh334pain666'). I've tried:我正在尝试获取一个代码,以从该位置合并“文本”列上的所有字符串,直到其中包含 substring“666”或“999”的第一个文本(在本例中是它们的组合,stoma22、fe156、sligh334 ,pain666(所需的 subtrings_output = 'theirfe156sligh334pain666')。我试过:

their_loc = np.where(words['text'].str.contains(r'their', na =True))[0][0]
666_999_loc = np.where(words['text'].str.contains(r'666', na =True))[0][0]
subtrings_output = Words['text'].loc[Words.index[their_loc:666_999_loc]]

as you can see I'm not sure how to extend the conditioning of 666_999_loc to include substring 666 or 999, also slicing the indexing between two variables renders an error.如您所见,我不确定如何扩展 666_999_loc 的条件以包括 substring 666 或 999,同时在两个变量之间分割索引会导致错误。 Many thanks非常感谢

Words table:单词表:

page no页码 text文本 font字体
1 1个 they他们 0 0
1 1个 ate 0 0
1 1个 apples苹果 0 0
2 2个 and 0 0
2 2个 then然后 1 1个
2 2个 their他们的 0 0
2 2个 stoma22造口22 0 0
2 2个 fe156铁156 1 1个
2 2个 sligh334略微334 0 0
2 2个 pain666疼痛666 1 1个
2 2个 given给予 0 0
2 2个 the 1 1个
3 3个 fruit水果 0 0

You just need to add one for the end of the slice, and add an or condition to the np.where of the 666_or_999_loc using the |您只需要在切片末尾添加一个,并使用|添加一个or条件到np.where666_or_999_loc operator.操作员。

text_col = words['text']

their_loc = np.where(text_col.str.contains(r'their', na=True))[0][0]

contains_666_or_999_loc = np.where(text_col.str.contains('666', na=True) |
                                   text_col.str.contains('999', na=True))[0][0]

subtrings_output = ''.join(text_col.loc[words.index[their_loc:contains_666_or_999_loc + 1]])

print(subtrings_output)

Output: Output:

theirstoma22fe156sligh334pain666

IIUC, use pandas.Series.idxmax with "".join() . IIUC,使用pandas.Series.idxmax"".join()

Series.idxmax(axis=0, skipna=True, *args, **kwargs)
Return the row label of the maximum value .返回最大值的第 label 行 If multiple values equal the maximum, the first row label with that value is returned.如果多个值等于最大值,则返回具有该值的第一行 label。

So, assuming ( Words ) is your dataframe, try this:因此,假设 ( Words ) 是您的 dataframe,试试这个:

their_loc = Words["text"].str.contains("their").idxmax()

_666_999_loc = Words["text"].str.contains("666").idxmax()

subtrings_output = "".join(Words["text"].loc[Words.index[their_loc:_666_999_loc+1]])

Output: Output:

print(subtrings_output)
#theirstoma22fe156sligh334pain666

#their stoma22 fe156 sligh334 pain666 # <- with " ".join()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM