如何读取多个数据集，并创建带有年份列的单个 dataframe

Question

I would like to read multiple data sets and combine them into a single Pandas dataframe with a year column.我想读取多个数据集并将它们组合成一个带有年份列的 Pandas dataframe。

My sample data sets include newyork2000.txt , newyork2001.txt , newyork2002.txt .我的示例数据集包括newyork2000.txt 、 newyork2001.txt 、 newyork2002.txt 。

Each data set contains 'address' and 'price' .每个数据集都包含'address'和'price' 。

Below is the newyork2000.txt :下面是newyork2000.txt ：

253 XXX st, 150000
2567 YYY st, 200000
...
3896 ZZZ rd, 350000

My final single dataframe should look like this:我的最终单曲 dataframe 应该是这样的：

year address      price
2000 253 XXX st   150000
2000 2567 YYY st  200000
...
2000 3896 ZZZ rd  350000
...
2002 789 XYZ ave  450000

So, I need to combine all data sets, create the year column, and name the columns.因此，我需要合并所有数据集，创建年份列，并为列命名。

Here is my code to create a single dataframe:这是我创建单个 dataframe 的代码：

years=[2000,2001,2002]
df=[]
for i years:
    df.append(pd.read_csv("newyork" + str(i) + ".txt", header=None))
dfs=pd.concat(df)

But, I could not create the year column and name the columns.但是，我无法创建年份列并为列命名。 Please help me solve this problem.请帮我解决这个问题。

Answer 1

It is preferred to programmatically extract the year from the filename, than to manually create a list of years.最好以编程方式从文件名中提取year ，而不是手动创建年份list 。
Use pathlib with .glob to find the files, use the .stem method to extract the filename, and then slice the year from the stem, with [-4:] , providing the names of the files are consistent, with the year as the last 4 characters of the filename.使用带pathlib的.glob查找文件，使用.stem方法提取文件名，然后从stem 中切出year ，用[-4:] ，前提是文件名一致，以year为文件名的最后 4 个字符。
- The .stem method will extract the final path component (eg 'newyork2000' ), without its suffix (eg '.txt' ) .stem方法将提取最终路径组件（例如'newyork2000' ），不带后缀（例如'.txt' ）
Use pandas.DataFrame.insert to add the 'year' column to a specific location in the dataframe. This method inserts the column inplace, so do not use x = x.insert(...) ,使用pandas.DataFrame.insert将'year'列添加到 dataframe 中的特定位置。此方法将列插入到位，因此不要使用x = x.insert(...) ，

import pandas as pd
from pathlib import Path

# set the file path
file_path = Path('e:/PythonProjects/stack_overflow/data/example')

# find your files
files = file_path.glob('newyork*.txt')

# create a list of dataframes
df_list = list()

for f in files:
    # extract year from filename, by slicing the last four characters off the stem
    year = (f.stem)[-4:]
    
    # read the file and add column names
    x = pd.read_csv(f, header=None, names=['address', 'price'])
    
    # add a year column at index 0; use int(year) if the year should be an int, otherwise use only year
    x.insert(0, 'year', int(year))
    
    # append to the list
    df_list.append(x)
    
# create one dataframe from the list of dataframes
df = pd.concat(df_list).reset_index(drop=True)

Result结果

 year      address   price
 2000   253 XXX st  150000
 2000  2567 YYY st  200000
 2000  3896 ZZZ rd  350000
 2001  456 XYZ ave  650000
 2002  789 XYZ ave  450000

Sample data files示例数据文件

'newyork2000.txt'

253 XXX st, 150000
2567 YYY st, 200000
3896 ZZZ rd, 350000

'newyork2001.txt'

456 XYZ ave, 650000

'newyour2002.txt'

789 XYZ ave, 450000

Answer 2

You can read the files first, then insert the columns corresponding to years, and then concatenate them:可以先读取文件，然后插入年份对应的列，再拼接：

import pandas as pd

years = [2000,2001,2002]

# Read all CSV files
dfs = [pd.read_csv(f"newyork{year}.txt", header=None) for year in years]

# Insert column in the beginning
for i, df in enumerate(dfs):
   df.insert(0, 'year', years[i])

# Concatenate all
df = pd.concat(dfs)

如何读取多个数据集，并创建带有年份列的单个 dataframe

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-11-14 19:00:55

Result结果

Sample data files示例数据文件

解决方案2
1 2020-11-14 18:50:08

如何读取多个数据集，并创建带有年份列的单个 dataframe

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-11-14 19:00:55

Result结果

Sample data files示例数据文件

解决方案2 1 2020-11-14 18:50:08

解决方案1
2 已采纳 2020-11-14 19:00:55

解决方案2
1 2020-11-14 18:50:08