简体   繁体   中英

List to dataframe, list to multiple lists, single column to dataframe

Still figuring out programming, help is appreciated. I have a single column of information that i would ultimately like to turn into a dataframe, I could transpose it but the address information varies. it is either 2 lines or 3 lines (some have suite numbers etc).

It generally looks like this.

name x,  
ID 1,  
123-xyz,  
ID 2,  
abcdefg,  
ACTIVITY,  
ggg,  
TYPE,  
C,  
COUNTY,  
orange county,  
ADDRESS,  
123 stack st,  
city state zip,  
PHONE,  
111-111-1111,  
EXPIRES,  
date,  
name y,  
ID 1,  
456-abc,  
ID 2,  
cvbnmnb,  
ACTIVITY,  
ggg,  
TYPE,  
A,  
COUNTY,  
dakota county,  
ADDRESS,  
234 overflow st, 
lot a,   
city state zip,  
PHONE,  
000-000-0000,  
EXPIRES,  
date,  
name z,  
...,  

I was thinking of creating new lists for all desired columns and conditionally appending values with a for loop.

for i in list  

if value = ID  
 append previous value to name list  
 append next value to ID list  

elif value = phone  
 send next value to phone   

elif value = address  
 evaluate 3 rows down  
  if value = phone  
   concatenate previous two values and append to address list  
  if value != phone  
   concatenate current and previous 2 values and append to address list  

else print error message  

Would this be a decently efficient option for lists of around ~20,000 values?
I don't really know how to write this, I am using python in a jupyter notebook. Looking for solutions but also looking to learn more!

-EDIT-

A user had suggested a while loop, and the original data sample I gave was simplified and contained 4 fields. My actual set contained 9, and I tried playing around but unfortunately wasn't able to figure it out on my own.

count = 0 #Pointer to start of a cluster
lengthdf = len(df) #Getting the length of the existing dataframe to use it as the terminating condition
while count != lengthdf: 
    name = id1 = id2 = activity = type = county = address = phone = expires = "" #Reset the fields for every cluster of information
    name = df[0][count] #Name is always the first line of cluster
    id1 = df[0][count+2] #id is always third line of cluster
    id2 = df[0][count+4]
    activity = df[0][count+6]
    type = df[0][count+8]
    county = df[0][count+10]
    n=11
    while df[0][count+n] != "Phone": #While row is not 'PHONE', everything else in between is the address, appended and separated by comma.
        address=address+df[0][count+n]+", "
        n+=1
    phone = df[0][count+n+1] #Phone number is always the row after 'PHONE', and is only of 1 line.
    expires = df[0][count+n+3]
    n+=2
    newdf = newdf.append({'NAME': name, 'ID 1': id1, 'ID 2': id2, 'ACTIVITY': activity, 'TYPE': type, 'COUNTY': county, 'ADDRESS': address, 'Phone': phone, 'Expires': expires}, ignore_index=True) #Append the data into the new dataframe
    count=count+n

You seem to have a brief understanding of what you need to do judging by the pseudocode you provided!

I'm assuming that your xlsx file looks something like this without the commas. 在此处输入图像描述

Based on your sample data, this is what I can come with for you. I'll be referencing each user data as a 'cluster'.

This code works under a few assumptions:

  1. The PHONE field always only have 1 line of data
  2. There is complete data for all cluster (or if there is missing data, a blank exists on the next row).
  3. Data is always in this particular order (ie name, ID, address, Phone)

count will be like a pointer to the start of a cluster, while n will be the offset from count . Read the comments for the explanations.

import pandas as pd
df = pd.read_excel (r'test.xlsx', header = None) #Import xlsx file
newdf = pd.DataFrame(columns=['name', 'id', 'address', 'phone']) #Creating blank dataframe

count = 0 #Pointer to start of a cluster
lengthdf = len(df) #Getting the length of the existing dataframe to use it as the terminating condition
while count != lengthdf: 
    this_add = this_name = this_id = this_phone = "" #Reset the fields for every cluster of information
    this_name = df[0][count] #Name is always the first line of cluster
    this_id = df[0][count+2] #id is always third line of cluster
    n=4
    while df[0][count+n] != "PHONE": #While row is not 'PHONE', everything else in between is the address, appended and separated by comma.
        this_add=this_add+df[0][count+n]+", "
        n+=1
    this_phone = df[0][count+n+1] #Phone number is always the row after 'PHONE', and is only of 1 line.
    n+=2
    newdf = newdf.append({'name': this_name, 'id': this_id, 'address': this_add, 'phone':this_phone}, ignore_index=True) #Append the data into the new dataframe
    count=count+n

As for performance wise, I honestly do not think there is much optimisation that can be done given the nature of the dataset (I might be wrong). If you realised my solution is pretty "hard-coded" to reduce the need for if-else statements, but 20,000 lines should not be huge of a problem for Jupyter Notebook. May take a couple of minutes but that should be alright.

I hope this gets you started on tackling other scenarios you may encounter with the remaining datasets!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM