简体   繁体   English

如何从给定的熊猫数据帧创建子数据帧?

[英]How to create a sub data-frame from a given pandas data-frame?

I have written a code that reads from the given dataset and converts the whole txt file into a pandas data-frame (after some pre-processing) 我编写了一个代码,该代码从给定的数据集中读取并将整个txt文件转换为熊猫数据帧(经过一些预处理)

  • The latitudes represent the rows and are present in a list. 纬度代表行,并显示在列表中。
  • The longitudes represent the columns and are present in a separate list. 经度代表各列,并显示在单独的列表中。

Now, I want to create a smaller data frame from the original one I created (so that it is easier to understand and interpret the data) and perform calculations. 现在,我想从我创建的原始数据框架中创建一个较小的数据框架(以便更容易理解和解释数据)并执行计算。 For that, I created a smaller column of size 18 by skipping over every 10 elements. 为此,我跳过了每10个元素,创建了一个较小的18列。 This worked fine. 这很好。 Lets call this new column as new_column. 让我们将此新列称为new_column。

Now, what I want to do is I want to iterate over every row and for every value of row k and new_column j, add it to a new matrix or a data frame. 现在,我要做的是遍历每一行,并针对行k和new_column j的每个值,将其添加到新矩阵或数据帧中。
For eg. 例如。 if the row 10 and new_column 12 has the value 'x' i want to add this 'x' at the same position but in a new data frame (or matrix). 如果第10行和new_column 12的值是“ x”,我想将此“ x”添加到相同的位置,但要在新的数据帧(或矩阵)中。

I have written the following code but I don't know how to perform that part which lets me do the above. 我已经编写了以下代码,但我不知道该如何执行那部分,因此我可以执行上述操作。

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import interpolate
# open the file for reading
dataset = open("Aug-2016-potential-temperature-180x188.txt", "r+")

# read the file linewise
buffer = dataset.readlines()

# pre-process the data to get the columns
column = buffer[8]
column = column[3 : -1]

# get the longitudes as features
features = column.split("\t")

# convert the features to float data-type
longitude = []

for i in features:
    if "W" in features:
        longitude.append(-float(i[:-1]))   # append -ve sign if "W", drop the "W" symbol
    else:
        longitude.append(float(i[:-1]))    # append +ve sign if "E", drop the "E" symbol

# append the longitude as columns to the dataframe
df = pd.DataFrame(columns = longitude)

# convert the rows into float data-type
latitude = []

for i in buffer[9:]:
    i = i[:-1]
    i = i.split("\t")

    if i[0] != "":
        if "S" in i[0]:     # if the first entry in the row is not null/blank
            latitude.append(-float(i[0][:-1]))  # append it to latitude list; append -ve for for "S"
            df.loc[-float(i[0][:-1])] = i[1:]   # add the row to the data frame; append -ve for "S" and drop the symbol
        else:
            latitude.append(float(i[0][:-1]))
            df.loc[-float(i[0][:-1])] = i[1:]

print(df.head(5))

temp_col = []
temp_row = []
temp_list = []

temp_col = longitude[0 : ((len(longitude) + 1)) : 10]

for iter1 in temp_col:
    for iter2 in latitude:
        print(df.loc[iter2])

I am also providing the link to the dataset here 我也在这里提供到数据集的链接

(Download the file that ends with .txt and run the code from the same directory as the .txt file) (下载以.txt结尾的文件,并从与.txt文件相同的目录中运行代码)

I am new to numpy, pandas and python and writing this small piece of code has been a huge task for me. 我是numpy,pandas和python的新手,编写这小段代码对我来说是一项艰巨的任务。 It would be great if I could get some help in this regard. 如果能在这方面得到一些帮助,那将是很棒的。

Welcome to the world of NumPy/Pandas :) One of the really cool things about it is the way it abstracts actions on a matrix into simple commands, removing in the vast majority of cases any need to write loops. 欢迎来到NumPy / Pandas的世界:)关于它的最酷的事情之一是将矩阵上的动作抽象为简单的命令的方式,在大多数情况下,无需编写循环。

A lot of your hard work would be unnecessary with more pandorable code. 使用更可笑的代码,您无需进行很多工作。 The following is my attempt to reproduce what you said. 以下是我尝试重现您所说的内容。 I may have misunderstood, but hopefully it will get you closer/ point you in the right direction. 我可能会误解了,但希望它能使您更加接近/指出正确的方向。 Feel free to ask for clarification! 随时要求澄清!

import pandas as pd

df = pd.read_csv('Aug-2016-potential-temperature-180x188.txt', skiprows=range(7))
df.columns=['longitude'] #renaming
df = df.longitude.str.split('\t', expand=True)
smaller = df.iloc[::10,:] # taking every 10th row
df.head()

so if i understand you right (just to be sure): you have a huge dataset with latitude and longitude as rows and columns. 因此,如果我理解正确(请确保),您将拥有一个庞大的数据集,其中行和列为纬度和经度。 you want to take a sub sample of this to deal with it (calculation, exploration, etc). 您想对此进行抽样处理(计算,探索等)。 So you create a sub list of rows and you want to create a new dataframe based on those rows. 因此,您将创建一个行的子列表,并希望基于这些行创建一个新的数据框。 Is this correct? 这个对吗?

if so: 如果是这样的话:

df['temp_col'] = [ 1 if x%10 == 0 else 0 for x in range(len(longitude))]
new_df = df[df['temp_col']>0].drop(['temp_col'],axis = 1]

and if you also want to drop some columns: 并且如果您还想删除一些列:

keep_columns = df.columns.values[0 :len(df.columns) : 10]
to_be_droped = list(set(df.columns.values) - set(keep_columns))
new_df = new_df.drop(to_be_droped, axis = 1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM