简体   繁体   English

如何有效地从熊猫数据框中的部分字符串获取列和行

[英]How to get column and row from partial string in pandas dataframe efficenitly

How to get column row and value from partial string efficiently with Pandas 如何使用Pandas从部分字符串有效获取列行和值

I have a pandas dataframe setup with about 150 indexes and 8 columns what I am looking to do is efficiently get the the column and index for cells based on a partial string. 我有一个带有约150个索引和8列的pandas数据帧设置,我想要做的是根据部分字符串有效地获取单元格的列和索引。 What I came up with was as follows: 我想到的是:

df = pd.DataFrame([["foo", "foo", "foo", "foo"], ["foo", "bar", "foo", "foo"], ["bar", "foo", "foo", "bar"],
                   ["foo", "foo", "foo", "bar"]])

Output: 输出:

 0    1    2    3
 0  foo  foo  foo  foo
 1  foo  bar  foo  foo
 2  bar  foo  foo  bar
 3  foo  foo  foo  bar

Here if I'm looking for just the entries that contain the sub-string "ar" I employ: 在这里,如果我只是在寻找包含子字符串“ ar”的条目,则使用:

setup_mask = df.applymap(lambda x: "ar" in str(x))
values_hold = []
for x in df.index:
    for y in df.columns:
        if setup_mask.loc[x, y].any() == bool(True):
            if [x, y] not in values_hold:
                values_hold.append([x, y])

This works well and returns a list of index column values [[1, 1], [2, 0], [2, 3], [3, 3]]. 这可以很好地工作并返回索引列值的列表[[1,1],[2,0],[2,3],[3,3]]。

This feels unpythonic and really just plain messy is there a way to do something like this in a more pythonic way? 这感觉很不可思议,实际上只是一团糟,有没有办法以更pythonic的方式做这样的事情?

PS I know I could cut out the mask but I felt like if there is a more pythonic way it would rely on a mask. 附言:我知道我可以剪掉面具,但是我觉得如果有一种更Python化的方式可以依靠面具。

Pandas supports vectorized string operations, but only on one column at a time. Pandas支持矢量化的字符串操作,但一次仅支持一列。 So: 所以:

df.apply(lambda ser: ser.str.contains('ar'))

Will give you this: 会给你这个:

       0      1      2      3
0  False  False  False  False
1  False   True  False  False
2   True  False  False   True
3  False  False  False   True

And it's pretty efficient so long as you have fewer columns than rows (which you do). 只要您的列数少于行数(这样做),它就会非常有效。

If you store the above in mask , then: 如果将以上内容存储在mask ,则:

np.transpose(np.where(mask))

Gives you your answer: 给您答案:

array([[1, 1],
       [2, 0],
       [2, 3],
       [3, 3]])

You can use transform with str.contains and stack 您可以transform str.containsstack

In [5352]: s = df.transform(lambda x: x.str.contains('ar')).stack()

In [5353]: s.index[s].tolist()
Out[5353]: [(1L, 1L), (2L, 0L), (2L, 3L), (3L, 3L)]

Or, as list of lists 或者,作为列表清单

In [5366]: [list(map(int, x)) for x in s.index[s]]
Out[5366]: [[1, 1], [2, 0], [2, 3], [3, 3]]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM