简体   繁体   中英

Pandas: count unique value in each column, by looping through them?

I have a very large dataframe and I want to generate unique values from each column. This is just a sample-- there are over 20 columns in total.

          CRASH_DT        CRASH_MO_NO     CRASH_DAY_NO
          1/1/2013        01              01    
          1/1/2013        01              01
          1/5/2013        03              05

My desired output is like so:

<variable = "CRASH_DT">
   <code>1/1/2013</code>
   <count>2</count>
   <code>1/5/2013</code>
   <count>1</count>
</variable>
<variable = "CRASH_MO_NO">
   <code>01</code>
   <count>2</count>
   <code>03</code>
   <count>1</count>
</variable>
<variable = "CRASH_DAY_NO">
   <code>01</code>
   <count>2</count>
   <code>05</code>
   <count>1</count>
</variable>

I have been trying to use the .sum() or .unique() functions, as suggested by many other questions about this topic that I have already looked at.

None of them seem to apply to this problem, and all of them say that in order to generate unique values from every column, you should either use a groupby function, or select individual columns. I have a very large number of columns (over 20), so it doesn't really make sense to group them together just by writing out df.unique['col1','col2'...'col20']

I have tried .unique(), .value_counts(), and .count, but I can't figure out how to apply any of those to work across multiple columns, rather than a groupby function or anything that was suggested in the above links.

My question is: how can I generate a count of unique values from each of the columns in a truly massive dataframe, preferably by looping through the columns themselves? (I apologize if this is a duplicate, I have looked through a whole lot of questions on this topic and while they seem like they should work for my problem as well, I can't figure out exactly how to tweak them to get them to work for me.)

This is my code so far:

import pyodbc
import pandas.io.sql

conn = pyodbc.connect('DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=C:\\Users\\<filename>.accdb')

sql_crash = "SELECT * FROM CRASH"
df_crash = pandas.io.sql.read_sql(sql_crash, conn)
df_c_head = df_crash.head()
df_c_desc = df_c_head.describe()

for k in df_c_desc:
   df_c_unique = df_c_desc[k].unique()
   print(df_c_unique.value_counts()) #Generates the error "numpy.ndarray object has no attribute .value_counts() 

I would loop over value_counts().items() per column:

>>> df["CRASH_DAY_NO"].value_counts()
01    2
05    1
dtype: int64
>>> df["CRASH_DAY_NO"].value_counts().items()
<zip object at 0x7fabf49f05c8>
>>> for value, count in df["CRASH_DAY_NO"].value_counts().items():
...     print(value, count)
...     
01 2
05 1

So something like

def vc_xml(df):
    for col in df:
        yield '<variable = "{}">'.format(col)
        for k,v in df[col].value_counts().items():
            yield "   <code>{}</code>".format(k)
            yield "   <count>{}</count>".format(v)
        yield '</variable>'

with open("out.xml", "w") as fp:
    for line in vc_xml(df):
        fp.write(line + "\n")

gives me

<variable = "CRASH_DAY_NO">
   <code>01</code>
   <count>2</count>
   <code>05</code>
   <count>1</count>
</variable>
<variable = "CRASH_DT">
   <code>1/1/2013</code>
   <count>2</count>
   <code>1/5/2013</code>
   <count>1</count>
</variable>
<variable = "CRASH_MO_NO">
   <code>01</code>
   <count>2</count>
   <code>03</code>
   <count>1</count>
</variable>

Here is an answer inspired by the answer to this question . But I don't know if it will be scalable enough in your case.

df = pd.DataFrame({'CRASH_DAY_NO': [1, 1, 5, 2, 2],
 'CRASH_DT': ['10/2/2014 5:00:08 PM',
  '5/28/2014 1:29:28 PM',
  '5/28/2014 1:29:28 PM',
  '7/14/2014 5:42:03 PM',
  '6/3/2014 10:33:22 AM'],
 'CRASH_ID': [1486150, 1486152, 1486224, 1486225, 1486226],
 'SEG_PT_LRS_MEAS': [79.940226960000004,
  297.80989999000002,
  140.56460290999999,
  759.43600000000004,
  102.566036],
 'SER_NO': [1, 3, 4, 5, 6]})

df = df.apply(lambda x: x.value_counts(sort=False))
df.index = df.index.astype(str)
# Transforming to XML by hand ...
def func(row):
    xml = ['<variable = "{0}">'.format(row.name)]
    for field in row.index:
        if not pd.isnull(row[field]):
            xml.append('  <code>{0}</code>'.format(field))
            xml.append('  <count>{0}</count>'.format(row[field]))
    xml.append('</variable>')
    return '\n'.join(xml)

print('\n'.join(df.apply(func, axis=0)))

<variable = "CRASH_DAY_NO">
  <code>1</code>
  <count>2.0</count>
  <code>2</code>
  <count>2.0</count>
  <code>5</code>
  <count>1.0</count>
</variable>
<variable = "CRASH_DT">
  <code>5/28/2014 1:29:28 PM</code>
  <count>2.0</count>
  <code>7/14/2014 5:42:03 PM</code>
  <count>1.0</count>
  <code>10/2/2014 5:00:08 PM</code>
  <count>1.0</count>
  <code>6/3/2014 10:33:22 AM</code>
  <count>1.0</count>
</variable>
....

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM