[英]retrieve cell string values in a column between two unknown indexes based on substrings location
I need to locate the first location where the word 'then' appears on Words table.我需要找到“then”一词出现在 Words 表上的第一个位置。 I'm trying to get a code to consolidate all strings on 'text' column from this location till the first text with a substring '666' or '999' in it (in this case a combination of their, stoma22, fe156, sligh334, pain666 (the desired subtrings_output = 'theirfe156sligh334pain666'). I've tried:我正在尝试获取一个代码,以从该位置合并“文本”列上的所有字符串,直到其中包含 substring“666”或“999”的第一个文本(在本例中是它们的组合,stoma22、fe156、sligh334 ,pain666(所需的 subtrings_output = 'theirfe156sligh334pain666')。我试过:
their_loc = np.where(words['text'].str.contains(r'their', na =True))[0][0]
666_999_loc = np.where(words['text'].str.contains(r'666', na =True))[0][0]
subtrings_output = Words['text'].loc[Words.index[their_loc:666_999_loc]]
as you can see I'm not sure how to extend the conditioning of 666_999_loc to include substring 666 or 999, also slicing the indexing between two variables renders an error.如您所见,我不确定如何扩展 666_999_loc 的条件以包括 substring 666 或 999,同时在两个变量之间分割索引会导致错误。 Many thanks非常感谢
Words table:单词表:
page no页码 | text文本 | font字体 |
---|---|---|
1 1个 | they他们 | 0 0 |
1 1个 | ate吃 | 0 0 |
1 1个 | apples苹果 | 0 0 |
2 2个 | and和 | 0 0 |
2 2个 | then然后 | 1 1个 |
2 2个 | their他们的 | 0 0 |
2 2个 | stoma22造口22 | 0 0 |
2 2个 | fe156铁156 | 1 1个 |
2 2个 | sligh334略微334 | 0 0 |
2 2个 | pain666疼痛666 | 1 1个 |
2 2个 | given给予 | 0 0 |
2 2个 | the这 | 1 1个 |
3 3个 | fruit水果 | 0 0 |
You just need to add one for the end of the slice, and add an or
condition to the np.where
of the 666_or_999_loc
using the |
您只需要在切片末尾添加一个,并使用|
添加一个or
条件到np.where
的666_or_999_loc
operator.操作员。
text_col = words['text']
their_loc = np.where(text_col.str.contains(r'their', na=True))[0][0]
contains_666_or_999_loc = np.where(text_col.str.contains('666', na=True) |
text_col.str.contains('999', na=True))[0][0]
subtrings_output = ''.join(text_col.loc[words.index[their_loc:contains_666_or_999_loc + 1]])
print(subtrings_output)
Output: Output:
theirstoma22fe156sligh334pain666
IIUC, use pandas.Series.idxmax
with "".join()
. IIUC,使用pandas.Series.idxmax
和"".join()
。
Series.idxmax(axis=0, skipna=True, *args, **kwargs)
Return the row label of the maximum value .返回最大值的第 label 行。 If multiple values equal the maximum, the first row label with that value is returned.如果多个值等于最大值,则返回具有该值的第一行 label。
So, assuming ( Words
) is your dataframe, try this:因此,假设 ( Words
) 是您的 dataframe,试试这个:
their_loc = Words["text"].str.contains("their").idxmax()
_666_999_loc = Words["text"].str.contains("666").idxmax()
subtrings_output = "".join(Words["text"].loc[Words.index[their_loc:_666_999_loc+1]])
Output: Output:
print(subtrings_output)
#theirstoma22fe156sligh334pain666
#their stoma22 fe156 sligh334 pain666 # <- with " ".join()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.