简体   繁体   中英

How to create a sub data-frame from a given pandas data-frame?

I have written a code that reads from the given dataset and converts the whole txt file into a pandas data-frame (after some pre-processing)

  • The latitudes represent the rows and are present in a list.
  • The longitudes represent the columns and are present in a separate list.

Now, I want to create a smaller data frame from the original one I created (so that it is easier to understand and interpret the data) and perform calculations. For that, I created a smaller column of size 18 by skipping over every 10 elements. This worked fine. Lets call this new column as new_column.

Now, what I want to do is I want to iterate over every row and for every value of row k and new_column j, add it to a new matrix or a data frame.
For eg. if the row 10 and new_column 12 has the value 'x' i want to add this 'x' at the same position but in a new data frame (or matrix).

I have written the following code but I don't know how to perform that part which lets me do the above.

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import interpolate
# open the file for reading
dataset = open("Aug-2016-potential-temperature-180x188.txt", "r+")

# read the file linewise
buffer = dataset.readlines()

# pre-process the data to get the columns
column = buffer[8]
column = column[3 : -1]

# get the longitudes as features
features = column.split("\t")

# convert the features to float data-type
longitude = []

for i in features:
    if "W" in features:
        longitude.append(-float(i[:-1]))   # append -ve sign if "W", drop the "W" symbol
    else:
        longitude.append(float(i[:-1]))    # append +ve sign if "E", drop the "E" symbol

# append the longitude as columns to the dataframe
df = pd.DataFrame(columns = longitude)

# convert the rows into float data-type
latitude = []

for i in buffer[9:]:
    i = i[:-1]
    i = i.split("\t")

    if i[0] != "":
        if "S" in i[0]:     # if the first entry in the row is not null/blank
            latitude.append(-float(i[0][:-1]))  # append it to latitude list; append -ve for for "S"
            df.loc[-float(i[0][:-1])] = i[1:]   # add the row to the data frame; append -ve for "S" and drop the symbol
        else:
            latitude.append(float(i[0][:-1]))
            df.loc[-float(i[0][:-1])] = i[1:]

print(df.head(5))

temp_col = []
temp_row = []
temp_list = []

temp_col = longitude[0 : ((len(longitude) + 1)) : 10]

for iter1 in temp_col:
    for iter2 in latitude:
        print(df.loc[iter2])

I am also providing the link to the dataset here

(Download the file that ends with .txt and run the code from the same directory as the .txt file)

I am new to numpy, pandas and python and writing this small piece of code has been a huge task for me. It would be great if I could get some help in this regard.

Welcome to the world of NumPy/Pandas :) One of the really cool things about it is the way it abstracts actions on a matrix into simple commands, removing in the vast majority of cases any need to write loops.

A lot of your hard work would be unnecessary with more pandorable code. The following is my attempt to reproduce what you said. I may have misunderstood, but hopefully it will get you closer/ point you in the right direction. Feel free to ask for clarification!

import pandas as pd

df = pd.read_csv('Aug-2016-potential-temperature-180x188.txt', skiprows=range(7))
df.columns=['longitude'] #renaming
df = df.longitude.str.split('\t', expand=True)
smaller = df.iloc[::10,:] # taking every 10th row
df.head()

so if i understand you right (just to be sure): you have a huge dataset with latitude and longitude as rows and columns. you want to take a sub sample of this to deal with it (calculation, exploration, etc). So you create a sub list of rows and you want to create a new dataframe based on those rows. Is this correct?

if so:

df['temp_col'] = [ 1 if x%10 == 0 else 0 for x in range(len(longitude))]
new_df = df[df['temp_col']>0].drop(['temp_col'],axis = 1]

and if you also want to drop some columns:

keep_columns = df.columns.values[0 :len(df.columns) : 10]
to_be_droped = list(set(df.columns.values) - set(keep_columns))
new_df = new_df.drop(to_be_droped, axis = 1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM