使用 pandas/dask Python 操作 large.csv 文件

Question

I've got a large.csv file (5GB) from UK land registry.我从英国土地登记处获得了一个 large.csv 文件 (5GB)。 I need to find all real estate that has been bought/sold two or more times.我需要找到所有被买卖两次或多次的房地产。

Each row of the table looks like this:表格的每一行如下所示：

{F887F88E-7D15-4415-804E-52EAC2F10958},"70000","1995-07-07 00:00","MK15 9HP","D","N","F","31","","ALDRICH DRIVE","WILLEN","MILTON KEYNES","MILTON KEYNES","MILTON KEYNES","A","A"

I've never used pandas or any data science library.我从未使用过 pandas 或任何数据科学库。 So far I've come up with this plan:到目前为止，我已经提出了这个计划：

Load the.csv file and add headers and column names加载.csv文件并添加标题和列名
Drop unnecessary columns删除不必要的列
Create hashmap of edited df and find duplicates创建已编辑 df 的 hashmap 并查找重复项
Export duplicates to a new.csv file将重复项导出到新的.csv 文件
From my research I found that pandas are bad with very big files so I used dask根据我的研究，我发现 pandas 对非常大的文件不好，所以我使用了 dask

df = dd.read_csv('pp-complete.csv', header=None, dtype={7: 'object', 8: 'object'}).astype(str)
df.columns = ['ID', 'Price', 'Date', 'ZIP', 'PropType', 'Old/new', 'Duration', 'Padress', 'Sadress', 'Str', 'Locality', 'Town', 'District', 'County', 'PPDType', 'Rec_Stat']
df.head()

After I tried to delete unnecessary columns在我尝试删除不必要的列之后

df.drop('ID', axis=1).head()

also tried也试过

indexes_to_remove = [0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16]
for index in indexes_to_remove:
    df.drop(df.index[index], axis=1)

Nothing worked.没有任何效果。

The task is to show the property that has been bought/sold two or more times.任务是展示已购买/出售两次或多次的房产。 I decided to use only address columns because every other column's data isn't consistent (ID - is unique code of transaction, Date, type of offer etc.)我决定只使用地址列，因为其他列的数据不一致（ID - 是唯一的交易代码、日期、报价类型等）

I need to do this task with minimum memory and CPU usage that's why I went with hashmap.我需要用最少的 memory 和 CPU 使用率来完成这项任务，这就是我选择 hashmap 的原因。

I don't know if there's another method to do this easier or more efficient.我不知道是否有另一种方法可以更轻松或更有效地做到这一点。

Answer 1

Some minor suggestions:一些小建议：

if 5GB is the full dataset, it's best to use plain pandas.如果 5GB 是完整的数据集，最好使用普通的 pandas。 The strategy you outlined might involve communication across partitions, so it's going to be computationally more expensive (or will require some work to make it more efficient).您概述的策略可能涉及跨分区的通信，因此计算成本会更高（或者需要一些工作以提高效率）。 With pandas all the data will be in memory, so sorting/duplication check will be fast.使用pandas ，所有数据都将在 memory 中，因此排序/重复检查将很快。
In the code, make sure to assign the modified dataframe.在代码中，确保分配修改后的 dataframe。 Typically the modification is assigned to replace the existing dataframe:通常分配修改以替换现有的 dataframe：

# without "df = " part, the modification is not stored
df = df.drop(columns=['ID'])

If memory is a big constraint, then consider loading only the data you need (as opposed to loading everything and then dropping specific columns).如果 memory 是一个很大的约束，那么请考虑仅加载您需要的数据（而不是加载所有内容然后删除特定列）。 For this we will need to provide the list of columns to usecols kwarg of pd.read_csv .为此，我们需要将列列表提供给usecols的 usecols pd.read_csv 。 Here's the rough idea:这是粗略的想法：

column_names = ['ID', 'Price', 'Date', 'ZIP', 'PropType', 'Old/new', 'Duration', 'Padress', 'Sadress', 'Str', 'Locality', 'Town', 'District', 'County', 'PPDType', 'Rec_Stat']
indexes_to_remove = [0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16]
indexes_to_keep = [i for i in range(len(column_names)) if i not in indexes_to_remove]
column_names_to_keep = [n for i,n in enumerate(column_names) if i in indexes_to_keep]

df = pd.read_csv('some_file.csv', header=column_names_to_keep, usecols=indexes_to_keep)

使用 pandas/dask Python 操作 large.csv 文件

问题描述

1 个解决方案

解决方案1
2 已采纳 2022-03-26 15:37:43

使用 pandas/dask Python 操作 large.csv 文件

问题描述

1 个解决方案

解决方案1 2 已采纳 2022-03-26 15:37:43

解决方案1
2 已采纳 2022-03-26 15:37:43