简体   繁体   English

如何从 python 中的 .vec 文件中读取和提取数据

[英]How to read and extract data from .vec file in python

How to read and extract data from .vec file in python?如何从 python 中的.vec文件中读取和提取数据?

f = open("test.vec","r") # opens file with name of "test.txt"
print(f.read())
f.close() 

But I cant extract the information.但我无法提取信息。 I want that the data will be stored in individual arrays in the test.vec file.我希望数据将存储在test.vec文件中的单个 arrays 中。

I think you can get some inspiration from this project here .我想你可以从这里的这个项目中得到一些启发。 The important part for you starts at line 131 , ie,对您来说重要的部分从第 131 行开始,即

...
with open(f, 'rb') as vecfile:  
    content = ''.join(str(line) for line in vecfile.readlines())
    val = struct.unpack('<iihh', content[:12])
...

This is my dataset: https://www.kaggle.com/datasets/yekenot/fasttext-crawl-300d-2m这是我的数据集: https://www.kaggle.com/datasets/yekenot/fasttext-crawl-300d-2m

It is Common Crawl 4.2 GB vec file.它是 Common Crawl 4.2 GB vec 文件。

Since the file is too big to display in IDE.由于文件太大而无法在 IDE 中显示。 I read it line by line & export to CSV (17 MB)我逐行阅读并导出到 CSV (17 MB)

def load_vectors(fname):
   fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
   x = fin.readline().split()
   all = []
   while x:
     all.append(x[0])
     x = fin.readline().split()
   df = pd.DataFrame(all)
   df.to_csv('.../output/ft.csv',index=False)

Call the function here:在此处调用 function:

FASTTEXT_DATASET_PATH = '/content/drive/MyDrive/Colab Notebooks/pretrained/crawl-300d-2M.vec'
load_vectors(FASTTEXT_DATASET_PATH)

The dimension of x is (1999995, 300) x 的维度是 (1999995, 300)

Here I print the first line: [',', '-0.0282', '-0.0557', ... '-0.0042']这里我打印第一行: [',', '-0.0282', '-0.0557', ... '-0.0042']

In my case, I just want to export the first element of every list.就我而言,我只想导出每个列表的第一个元素。 So I append x[0] to a list named 'all'.所以我将 append x[0] 放到一个名为“all”的列表中。 Then I convert it to dataframe & export to csv file.然后我将其转换为 dataframe 并导出到 csv 文件。

For those who interested to view how FastText pretrained dataset look like, I've uploaded it to Kaggle .对于那些有兴趣查看 FastText 预训练数据集的样子的人,我已将其上传到 Kaggle The details of dataset: crawl-300d-2M.vec.zip: 2 million word vectors trained on Common Crawl (600B tokens) - Cased数据集的详细信息: crawl-300d-2M.vec.zip:在 Common Crawl 上训练的 200 万个词向量(600B 令牌)- 大小写

with open("file.txt", "r") as ins:
    array = []
    for line in ins:
        array.append(line)

Try this one.试试这个。 This is kind of complicated a bit.这有点复杂。 Otherwise try this simple one.否则试试这个简单的。

with open('filename') as f:
    lines = f.readlines()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM