[英]How to read multiple data sets, and create a single dataframe with a year column
I would like to read multiple data sets and combine them into a single Pandas dataframe with a year column.我想读取多个数据集并将它们组合成一个带有年份列的 Pandas dataframe。
My sample data sets include newyork2000.txt
, newyork2001.txt
, newyork2002.txt
.我的示例数据集包括
newyork2000.txt
、 newyork2001.txt
、 newyork2002.txt
。
Each data set contains 'address'
and 'price'
.每个数据集都包含
'address'
和'price'
。
Below is the newyork2000.txt
:下面是
newyork2000.txt
:
253 XXX st, 150000
2567 YYY st, 200000
...
3896 ZZZ rd, 350000
My final single dataframe should look like this:我的最终单曲 dataframe 应该是这样的:
year address price
2000 253 XXX st 150000
2000 2567 YYY st 200000
...
2000 3896 ZZZ rd 350000
...
2002 789 XYZ ave 450000
So, I need to combine all data sets, create the year column, and name the columns.因此,我需要合并所有数据集,创建年份列,并为列命名。
Here is my code to create a single dataframe:这是我创建单个 dataframe 的代码:
years=[2000,2001,2002]
df=[]
for i years:
df.append(pd.read_csv("newyork" + str(i) + ".txt", header=None))
dfs=pd.concat(df)
But, I could not create the year column and name the columns.但是,我无法创建年份列并为列命名。 Please help me solve this problem.
请帮我解决这个问题。
year
from the filename, than to manually create a list
of years.year
,而不是手动创建年份list
。pathlib
with .glob
to find the files, use the .stem
method to extract the filename, and then slice the year
from the stem, with [-4:]
, providing the names of the files are consistent, with the year
as the last 4 characters of the filename.pathlib
的.glob
查找文件,使用.stem
方法提取文件名,然后从stem 中切出year
,用[-4:]
,前提是文件名一致,以year
为文件名的最后 4 个字符。
.stem
method will extract the final path component (eg 'newyork2000'
), without its suffix (eg '.txt'
) .stem
方法将提取最终路径组件(例如'newyork2000'
),不带后缀(例如'.txt'
)pandas.DataFrame.insert
to add the 'year'
column to a specific location in the dataframe. This method inserts the column inplace, so do not use x = x.insert(...)
,pandas.DataFrame.insert
将'year'
列添加到 dataframe 中的特定位置。此方法将列插入到位,因此不要使用x = x.insert(...)
,import pandas as pd
from pathlib import Path
# set the file path
file_path = Path('e:/PythonProjects/stack_overflow/data/example')
# find your files
files = file_path.glob('newyork*.txt')
# create a list of dataframes
df_list = list()
for f in files:
# extract year from filename, by slicing the last four characters off the stem
year = (f.stem)[-4:]
# read the file and add column names
x = pd.read_csv(f, header=None, names=['address', 'price'])
# add a year column at index 0; use int(year) if the year should be an int, otherwise use only year
x.insert(0, 'year', int(year))
# append to the list
df_list.append(x)
# create one dataframe from the list of dataframes
df = pd.concat(df_list).reset_index(drop=True)
year address price
2000 253 XXX st 150000
2000 2567 YYY st 200000
2000 3896 ZZZ rd 350000
2001 456 XYZ ave 650000
2002 789 XYZ ave 450000
'newyork2000.txt'
253 XXX st, 150000
2567 YYY st, 200000
3896 ZZZ rd, 350000
'newyork2001.txt'
456 XYZ ave, 650000
'newyour2002.txt'
789 XYZ ave, 450000
You can read the files first, then insert the columns corresponding to years, and then concatenate them:可以先读取文件,然后插入年份对应的列,再拼接:
import pandas as pd
years = [2000,2001,2002]
# Read all CSV files
dfs = [pd.read_csv(f"newyork{year}.txt", header=None) for year in years]
# Insert column in the beginning
for i, df in enumerate(dfs):
df.insert(0, 'year', years[i])
# Concatenate all
df = pd.concat(dfs)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.