如何計算 Pandas 中多個 CSV 文件之間的相同行數？

Question

我合並了 3 個不同的 CSV（D1，D2，D3）Netflow 數據集並創建了一個大數據集（df），並將 KMeans 聚類應用於該數據集。 為了合並它們，由於內存錯誤，我沒有使用 pd.concat 並使用 Linux 終端解決了。

df = pd.read_csv('D.csv')
#D is already created in a Linux machine from terminal

........
KMeans Clustering
........

As a result of clustering, I separated the clusters into a dataframe
then created a csv file.
cluster_0 = df[df['clusters'] == 0]
cluster_1 = df[df['clusters'] == 1]
cluster_2 = df[df['clusters'] == 2]

cluster_0.to_csv('cluster_0.csv')
cluster_1.to_csv('cluster_1.csv')
cluster_2.to_csv('cluster_2.csv')

#My goal is to understand the number of same rows with clusters
#and D1-D2-D3
D1 = pd.read_csv('D1.csv')
D2 = pd.read_csv('D2.csv')
D3 = pd.read_csv('D3.csv')

所有這些數據集都包含相同的列名，它們有 12 列（所有數值）

示例預期結果：

cluster_0 有來自 D1 的 xxxx 個相同的行，來自 D2 的 xxxxx 個相同的行，來自 D3 的 xxxxx 個相同的行？

Answer 1

cluster0_D1 = pd.merge(D1, cluster_0, how ='inner')
number_of_rows_D1 = len(cluster0_D1)

cluster0_D2 = pd.merge(D2, cluster_0, how ='inner')
number_of_rows_D2 = len(cluster0_D2)

cluster0_D3 = pd.merge(D3, cluster_0, how ='inner')
number_of_rows_D3 = len(cluster0_D3)

print("How many samples belong to D1, D2, D3 for cluster_0?")
print("D1: ",number_of_rows_D1)
print("D2: ",number_of_rows_D2)
print("D3: ",number_of_rows_D3)

我認為這解決了我的問題。

Answer 2

我認為問題中沒有足夠的信息來涵蓋邊緣情況，但是如果我理解正確的話，這應該可以工作。

# Read in the 3, and add a column called "file" so we know which file they came from
D1 = pd.read_csv('D1.csv')
D1['file'] = 'D1.csv'
D2 = pd.read_csv('D2.csv')
D2['file'] = 'D2.csv'
D3 = pd.read_csv('D3.csv')
D3['file'] = 'D3.csv'

# Merge them together into the DF that the "awk" command was doing
df = pd.concat([D1, D2, D3], axis=1)

# Save off the series showing which files each row belong sto
files = df['file']
# Drop it so that doesnt get included in your analysis
df.drop('file', inplace=True, axis=1)

"""
There is no code in the question to show the KMeans clustering
"""

# Add the filename back
df['filename'] = files

我們將避免使用awk命令，而是選擇pd.concat 。

如何計算 Pandas 中多個 CSV 文件之間的相同行數？

問題描述

2 個解決方案

解決方案1
1 已采納 2022-05-13 20:31:54

解決方案2
0 2022-05-13 19:49:58

如何計算 Pandas 中多個 CSV 文件之間的相同行數？

問題描述

2 個解決方案

解決方案1 1 已采納 2022-05-13 20:31:54

解決方案2 0 2022-05-13 19:49:58

解決方案1
1 已采納 2022-05-13 20:31:54

解決方案2
0 2022-05-13 19:49:58