简体   繁体   English

如何读取多个数据集,并创建带有年份列的单个 dataframe

[英]How to read multiple data sets, and create a single dataframe with a year column

I would like to read multiple data sets and combine them into a single Pandas dataframe with a year column.我想读取多个数据集并将它们组合成一个带有年份列的 Pandas dataframe。

My sample data sets include newyork2000.txt , newyork2001.txt , newyork2002.txt .我的示例数据集包括newyork2000.txtnewyork2001.txtnewyork2002.txt

Each data set contains 'address' and 'price' .每个数据集都包含'address''price'

Below is the newyork2000.txt :下面是newyork2000.txt

253 XXX st, 150000
2567 YYY st, 200000
...
3896 ZZZ rd, 350000   

My final single dataframe should look like this:我的最终单曲 dataframe 应该是这样的:

year address      price
2000 253 XXX st   150000
2000 2567 YYY st  200000
...
2000 3896 ZZZ rd  350000
...
2002 789 XYZ ave  450000

So, I need to combine all data sets, create the year column, and name the columns.因此,我需要合并所有数据集,创建年份列,并为列命名。

Here is my code to create a single dataframe:这是我创建单个 dataframe 的代码:

years=[2000,2001,2002]
df=[]
for i years:
    df.append(pd.read_csv("newyork" + str(i) + ".txt", header=None))
dfs=pd.concat(df)

But, I could not create the year column and name the columns.但是,我无法创建年份列并为列命名。 Please help me solve this problem.请帮我解决这个问题。

  • It is preferred to programmatically extract the year from the filename, than to manually create a list of years.最好以编程方式从文件名中提取year ,而不是手动创建年份list
  • Use pathlib with .glob to find the files, use the .stem method to extract the filename, and then slice the year from the stem, with [-4:] , providing the names of the files are consistent, with the year as the last 4 characters of the filename.使用带pathlib.glob查找文件,使用.stem方法提取文件名,然后从stem 中切出year ,用[-4:] ,前提是文件名一致,以year为文件名的最后 4 个字符。
    • The .stem method will extract the final path component (eg 'newyork2000' ), without its suffix (eg '.txt' ) .stem方法将提取最终路径组件(例如'newyork2000' ),不带后缀(例如'.txt'
  • Use pandas.DataFrame.insert to add the 'year' column to a specific location in the dataframe. This method inserts the column inplace, so do not use x = x.insert(...) ,使用pandas.DataFrame.insert'year'列添加到 dataframe 中的特定位置。此方法将列插入到位,因此不要使用x = x.insert(...)
import pandas as pd
from pathlib import Path

# set the file path
file_path = Path('e:/PythonProjects/stack_overflow/data/example')

# find your files
files = file_path.glob('newyork*.txt')

# create a list of dataframes
df_list = list()

for f in files:
    # extract year from filename, by slicing the last four characters off the stem
    year = (f.stem)[-4:]
    
    # read the file and add column names
    x = pd.read_csv(f, header=None, names=['address', 'price'])
    
    # add a year column at index 0; use int(year) if the year should be an int, otherwise use only year
    x.insert(0, 'year', int(year))
    
    # append to the list
    df_list.append(x)
    
# create one dataframe from the list of dataframes
df = pd.concat(df_list).reset_index(drop=True)

Result结果

 year      address   price
 2000   253 XXX st  150000
 2000  2567 YYY st  200000
 2000  3896 ZZZ rd  350000
 2001  456 XYZ ave  650000
 2002  789 XYZ ave  450000

Sample data files示例数据文件

  • 'newyork2000.txt'
253 XXX st, 150000
2567 YYY st, 200000
3896 ZZZ rd, 350000 
  • 'newyork2001.txt'
456 XYZ ave, 650000
  • 'newyour2002.txt'
789 XYZ ave, 450000

You can read the files first, then insert the columns corresponding to years, and then concatenate them:可以先读取文件,然后插入年份对应的列,再拼接:

import pandas as pd

years = [2000,2001,2002]

# Read all CSV files
dfs = [pd.read_csv(f"newyork{year}.txt", header=None) for year in years]

# Insert column in the beginning
for i, df in enumerate(dfs):
   df.insert(0, 'year', years[i])

# Concatenate all
df = pd.concat(dfs)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM