简体   繁体   English

使用 pandas/dask Python 操作 large.csv 文件

[英]Operating large .csv file with pandas/dask Python

I've got a large.csv file (5GB) from UK land registry.我从英国土地登记处获得了一个 large.csv 文件 (5GB)。 I need to find all real estate that has been bought/sold two or more times.我需要找到所有被买卖两次或多次的房地产。

Each row of the table looks like this:表格的每一行如下所示:

{F887F88E-7D15-4415-804E-52EAC2F10958},"70000","1995-07-07 00:00","MK15 9HP","D","N","F","31","","ALDRICH DRIVE","WILLEN","MILTON KEYNES","MILTON KEYNES","MILTON KEYNES","A","A"

I've never used pandas or any data science library.我从未使用过 pandas 或任何数据科学库。 So far I've come up with this plan:到目前为止,我已经提出了这个计划:

  1. Load the.csv file and add headers and column names加载.csv文件并添加标题和列名

  2. Drop unnecessary columns删除不必要的列

  3. Create hashmap of edited df and find duplicates创建已编辑 df 的 hashmap 并查找重复项

  4. Export duplicates to a new.csv file将重复项导出到新的.csv 文件

  5. From my research I found that pandas are bad with very big files so I used dask根据我的研究,我发现 pandas 对非常大的文件不好,所以我使用了 dask

df = dd.read_csv('pp-complete.csv', header=None, dtype={7: 'object', 8: 'object'}).astype(str)
df.columns = ['ID', 'Price', 'Date', 'ZIP', 'PropType', 'Old/new', 'Duration', 'Padress', 'Sadress', 'Str', 'Locality', 'Town', 'District', 'County', 'PPDType', 'Rec_Stat']
df.head()
  1. After I tried to delete unnecessary columns在我尝试删除不必要的列之后
df.drop('ID', axis=1).head()

also tried也试过

indexes_to_remove = [0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16]
for index in indexes_to_remove:
    df.drop(df.index[index], axis=1)

Nothing worked.没有任何效果。

The task is to show the property that has been bought/sold two or more times.任务是展示已购买/出售两次或多次的房产。 I decided to use only address columns because every other column's data isn't consistent (ID - is unique code of transaction, Date, type of offer etc.)我决定只使用地址列,因为其他列的数据不一致(ID - 是唯一的交易代码、日期、报价类型等)

I need to do this task with minimum memory and CPU usage that's why I went with hashmap.我需要用最少的 memory 和 CPU 使用率来完成这项任务,这就是我选择 hashmap 的原因。

I don't know if there's another method to do this easier or more efficient.我不知道是否有另一种方法可以更轻松或更有效地做到这一点。

Some minor suggestions:一些小建议:

  • if 5GB is the full dataset, it's best to use plain pandas.如果 5GB 是完整的数据集,最好使用普通的 pandas。 The strategy you outlined might involve communication across partitions, so it's going to be computationally more expensive (or will require some work to make it more efficient).您概述的策略可能涉及跨分区的通信,因此计算成本会更高(或者需要一些工作以提高效率)。 With pandas all the data will be in memory, so sorting/duplication check will be fast.使用pandas ,所有数据都将在 memory 中,因此排序/重复检查将很快。

  • In the code, make sure to assign the modified dataframe.在代码中,确保分配修改后的 dataframe。 Typically the modification is assigned to replace the existing dataframe:通常分配修改以替换现有的 dataframe:

# without "df = " part, the modification is not stored
df = df.drop(columns=['ID'])
  • If memory is a big constraint, then consider loading only the data you need (as opposed to loading everything and then dropping specific columns).如果 memory 是一个很大的约束,那么请考虑仅加载您需要的数据(而不是加载所有内容然后删除特定列)。 For this we will need to provide the list of columns to usecols kwarg of pd.read_csv .为此,我们需要将列列表提供给usecols的 usecols pd.read_csv Here's the rough idea:这是粗略的想法:
column_names = ['ID', 'Price', 'Date', 'ZIP', 'PropType', 'Old/new', 'Duration', 'Padress', 'Sadress', 'Str', 'Locality', 'Town', 'District', 'County', 'PPDType', 'Rec_Stat']
indexes_to_remove = [0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16]
indexes_to_keep = [i for i in range(len(column_names)) if i not in indexes_to_remove]
column_names_to_keep = [n for i,n in enumerate(column_names) if i in indexes_to_keep]

df = pd.read_csv('some_file.csv', header=column_names_to_keep, usecols=indexes_to_keep)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM