简体   繁体   English

随机排列大 csv

[英]Shuffle rows of a large csv

I want to shuffle this dataset to have a random set.我想打乱这个数据集以获得一个随机集。 It has 1.6 million rows but the first are 0 and the last 4, so I need pick samples randomly to have more than one class. The actual code prints only class 0 (meaning in just 1 class).它有 160 万行,但第一行是 0,最后一行是 4,所以我需要随机选择样本以获得多个 class。实际代码仅打印 class 0(意味着只有 1 类)。 I took advice from this platform but doesn't work.我从这个平台上听取了建议,但没有用。

fid = open("sentiment_train.csv", "r")

li = fid.readlines(16000000)


random.shuffle(li)

fid2 = open("shuffled_train.csv", "w")

fid2.writelines(li)

fid2.close()

fid.close()

sentiment_onefourty_train = pd.read_csv('shuffled_train.csv', header= 0, delimiter=",", usecols=[0,5], nrows=100000)

sentiment_onefourty_train.columns=['target', 'text']

print(sentiment_onefourty_train['target'].value_counts())

Because you read in your data using Pandas, you can also do the randomisation in a different way using pd.sample :因为您使用 Pandas 读取数据,您还可以使用pd.sample以不同的方式进行随机化:

df = pd.read_csv('sentiment_train.csv', header= 0, delimiter=",", usecols=[0,5])
df.columns=['target', 'text']
df1 = df.sample(n=100000)

If this fails, it might be good to check out the amount of unique values and how frequent they appear.如果失败,最好检查唯一值的数量以及它们出现的频率。 If the first 1,599,999 are 0 and the last is only 4, then the chances are that you won't get any 4.如果前 1,599,999 个是 0 而最后一个只有 4,那么你很可能得不到任何 4。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM