简体   繁体   中英

Getting column and row label for cells with specific value

I have CSV files that contain cross-references, meaning the rows are labeled, the columns are labeled and the cells contain an "X" where both apply (imagine using colors and flavors if we're talking about sweets, so one file is a certain kind of candy and red ones taste like strawberry, green ones like apple etc, that would mean):

Candy Q     red     green       blue
apple               X
strawberry  X
smurf                           X
dunno lol   X       X           X

I can load them into pandas dataframes, read them, iterate over them, but I didn't manage to get the descriptors for cells containing an X. I've tried the three different iterators pandas offers, but never got where I needed to get. I've tried using iterators and increment for index-based value-checking , but it got rather confusing and I discarded it.

Ideally, the output would be {apple: green},{strawberry: red}, {smurf: blue},{dunno lol: [red, green, blue]} .

How would I go about getting these references ?

Edit: I might need to add: I do not know the column or row names in advance as they are not uniform, they follow a certain logic, but generally, I can't define a strict schema.

Update #2: Code, as per the combined solutions of coldspeed and Scott Boston (plus a tiny fix):

files = glob.glob(mappings_path + '\\*.csv')
# iterate over the list getting each file
for file in files:
    # open each file
    with open(file,'r') as f:
        # read content into pandas dataframe
        df = pd.read_csv(f, delimiter=";", encoding='utf-8')
        # set index to first column (and only column)
        df = df.set_index(df.iloc[:, 0])

        d = defaultdict(list)
        for x, y in zip(*np.where(df.notnull())):
            d[df.index[x]].append(df.columns[y])

        res = dict(d)
        for k, v in res.items():
            del v[0]
        logger.info(res)

which remediates the issue of the descriptor ( Candy Q in the example) turning up first in every result list: {'apple': ['Candy Q','green'], 'strawberry': ['Candy Q','red'] and so on. Here's a link to the CSV files in case you need them or want to know what this is about , or, the fourth download on this page if you don't trust links people post somewhere on the internet.

Thanks everyone for the help!

df

      Candy Q  red green blue
0       apple  NaN     X  NaN
1  strawberry    X   NaN  NaN
2       smurf  NaN   NaN    X
3   dunno lol    X     X    X

df = df.set_index('Candy Q')

Slightly hacky, but really fast.

j = df.notnull()\
      .dot(df.columns + '_')\
      .str.strip('_')\
      .str.split('_')\
      .to_dict()

print(j)
{
    "dunno lol": [
        "red",
        "green",
        "blue"
    ],
    "smurf": [
        "blue"
    ],
    "strawberry": [
        "red"
    ],
    "apple": [
        "green"
    ]
} 

This involves performing a "dot" product between the columns and a mask (which specifies whether the cell has X or not).

The caveat here is that the separator to use for column names ( _ - underscore, in this case) should not exist as part a the column name. In that case, choose any separator that does not exist in the column, and this should work.

Where df:

            red green blue
Candy Q                   
apple       NaN     X  NaN
strawberry    X   NaN  NaN
smurf       NaN   NaN    X
dunno lol     X     X    X

You can use np.where to return indexes:

from collections import defaultdict

d = defaultdict(list)
for x, y in zip(*np.where(df.notnull())):
     d[df.index[x]].append(df.columns[y])

dict(d)

Output:

{'apple': ['green'],
 'dunno lol': ['red', 'green', 'blue'],
 'smurf': ['blue'],
 'strawberry': ['red']}

Thanks @cᴏʟᴅsᴘᴇᴇᴅ, I appreciate the edit and simplification.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM