[英]How to read and extract data from .vec file in python
How to read and extract data from .vec
file in python?如何从 python 中的
.vec
文件中读取和提取数据?
f = open("test.vec","r") # opens file with name of "test.txt"
print(f.read())
f.close()
But I cant extract the information.但我无法提取信息。 I want that the data will be stored in individual arrays in the
test.vec
file.我希望数据将存储在
test.vec
文件中的单个 arrays 中。
This is my dataset: https://www.kaggle.com/datasets/yekenot/fasttext-crawl-300d-2m这是我的数据集: https://www.kaggle.com/datasets/yekenot/fasttext-crawl-300d-2m
It is Common Crawl 4.2 GB vec file.它是 Common Crawl 4.2 GB vec 文件。
Since the file is too big to display in IDE.由于文件太大而无法在 IDE 中显示。 I read it line by line & export to CSV (17 MB)
我逐行阅读并导出到 CSV (17 MB)
def load_vectors(fname):
fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
x = fin.readline().split()
all = []
while x:
all.append(x[0])
x = fin.readline().split()
df = pd.DataFrame(all)
df.to_csv('.../output/ft.csv',index=False)
Call the function here:在此处调用 function:
FASTTEXT_DATASET_PATH = '/content/drive/MyDrive/Colab Notebooks/pretrained/crawl-300d-2M.vec'
load_vectors(FASTTEXT_DATASET_PATH)
The dimension of x is (1999995, 300) x 的维度是 (1999995, 300)
Here I print the first line: [',', '-0.0282', '-0.0557', ... '-0.0042']这里我打印第一行: [',', '-0.0282', '-0.0557', ... '-0.0042']
In my case, I just want to export the first element of every list.就我而言,我只想导出每个列表的第一个元素。 So I append x[0] to a list named 'all'.
所以我将 append x[0] 放到一个名为“all”的列表中。 Then I convert it to dataframe & export to csv file.
然后我将其转换为 dataframe 并导出到 csv 文件。
For those who interested to view how FastText pretrained dataset look like, I've uploaded it to Kaggle .对于那些有兴趣查看 FastText 预训练数据集的样子的人,我已将其上传到 Kaggle 。 The details of dataset: crawl-300d-2M.vec.zip: 2 million word vectors trained on Common Crawl (600B tokens) - Cased
数据集的详细信息: crawl-300d-2M.vec.zip:在 Common Crawl 上训练的 200 万个词向量(600B 令牌)- 大小写
with open("file.txt", "r") as ins:
array = []
for line in ins:
array.append(line)
Try this one.试试这个。 This is kind of complicated a bit.
这有点复杂。 Otherwise try this simple one.
否则试试这个简单的。
with open('filename') as f:
lines = f.readlines()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.