简体   繁体   English

获取具有特定值的单元格的列和行标签

[英]Getting column and row label for cells with specific value

I have CSV files that contain cross-references, meaning the rows are labeled, the columns are labeled and the cells contain an "X" where both apply (imagine using colors and flavors if we're talking about sweets, so one file is a certain kind of candy and red ones taste like strawberry, green ones like apple etc, that would mean): 我有一个包含交叉引用的CSV文件,这意味着行被标记了,列被标记了,并且单元格都包含一个“ X”,两者都适用(假设我们谈论的是糖果,请使用颜色和风味,所以一个文件是某种糖果和红色的味道像草莓,绿色的味道像苹果等,这意味着):

Candy Q     red     green       blue
apple               X
strawberry  X
smurf                           X
dunno lol   X       X           X

I can load them into pandas dataframes, read them, iterate over them, but I didn't manage to get the descriptors for cells containing an X. I've tried the three different iterators pandas offers, but never got where I needed to get. 我可以将它们加载到pandas数据框中,对其进行读取,然后对其进行遍历,但是我没有设法获取包含X的单元格的描述符。我尝试了pandas提供的三种不同的迭代器,但从未获得过所需的信息。 。 I've tried using iterators and increment for index-based value-checking , but it got rather confusing and I discarded it. 我已经尝试过使用迭代器和增量进行index-based value-checking ,但是它变得相当混乱,因此我将其丢弃。

Ideally, the output would be {apple: green},{strawberry: red}, {smurf: blue},{dunno lol: [red, green, blue]} . 理想情况下,输出为{apple: green},{strawberry: red}, {smurf: blue},{dunno lol: [red, green, blue]}

How would I go about getting these references ? 我将如何获得这些references

Edit: I might need to add: I do not know the column or row names in advance as they are not uniform, they follow a certain logic, but generally, I can't define a strict schema. 编辑:我可能需要添加:我不预先知道列或行的名称,因为它们不统一,它们遵循一定的逻辑,但是通常,我无法定义严格的架构。

Update #2: Code, as per the combined solutions of coldspeed and Scott Boston (plus a tiny fix): 更新#2:代码,根据Coldspeed和Scott Boston的组合解决方案(加上一个小修正):

files = glob.glob(mappings_path + '\\*.csv')
# iterate over the list getting each file
for file in files:
    # open each file
    with open(file,'r') as f:
        # read content into pandas dataframe
        df = pd.read_csv(f, delimiter=";", encoding='utf-8')
        # set index to first column (and only column)
        df = df.set_index(df.iloc[:, 0])

        d = defaultdict(list)
        for x, y in zip(*np.where(df.notnull())):
            d[df.index[x]].append(df.columns[y])

        res = dict(d)
        for k, v in res.items():
            del v[0]
        logger.info(res)

which remediates the issue of the descriptor ( Candy Q in the example) turning up first in every result list: {'apple': ['Candy Q','green'], 'strawberry': ['Candy Q','red'] and so on. 它纠正了描述符的问题(在示例中为Candy Q )在每个结果列表中首先出现: {'apple': ['Candy Q','green'], 'strawberry': ['Candy Q','red']等。 Here's a link to the CSV files in case you need them or want to know what this is about , or, the fourth download on this page if you don't trust links people post somewhere on the internet. 这是CSV文件的链接,以防您需要它们或想知道这是什么 ,或者,如果您不信任人们在Internet上发布的链接,则此页面上的第四次下载

Thanks everyone for the help! 谢谢大家的帮助!

df

      Candy Q  red green blue
0       apple  NaN     X  NaN
1  strawberry    X   NaN  NaN
2       smurf  NaN   NaN    X
3   dunno lol    X     X    X

df = df.set_index('Candy Q')

Slightly hacky, but really fast. 有点hacky,但速度很快。

j = df.notnull()\
      .dot(df.columns + '_')\
      .str.strip('_')\
      .str.split('_')\
      .to_dict()

print(j)
{
    "dunno lol": [
        "red",
        "green",
        "blue"
    ],
    "smurf": [
        "blue"
    ],
    "strawberry": [
        "red"
    ],
    "apple": [
        "green"
    ]
} 

This involves performing a "dot" product between the columns and a mask (which specifies whether the cell has X or not). 这涉及在列和掩码(指定单元格是否具有X )之间执行“点”乘积。

The caveat here is that the separator to use for column names ( _ - underscore, in this case) should not exist as part a the column name. 这里需要说明的是,分离器使用的列名( _ -下划线,在这种情况下)不应该存在的部分列名。 In that case, choose any separator that does not exist in the column, and this should work. 在这种情况下,请选择该列中不存在的任何分隔符,这应该可行。

Where df: 哪里df:

            red green blue
Candy Q                   
apple       NaN     X  NaN
strawberry    X   NaN  NaN
smurf       NaN   NaN    X
dunno lol     X     X    X

You can use np.where to return indexes: 您可以使用np.where返回索引:

from collections import defaultdict

d = defaultdict(list)
for x, y in zip(*np.where(df.notnull())):
     d[df.index[x]].append(df.columns[y])

dict(d)

Output: 输出:

{'apple': ['green'],
 'dunno lol': ['red', 'green', 'blue'],
 'smurf': ['blue'],
 'strawberry': ['red']}

Thanks @cᴏʟᴅsᴘᴇᴇᴅ, I appreciate the edit and simplification. 感谢@cᴏʟᴅsᴘᴇᴇᴅ,感谢您的编辑和简化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM