![](/img/trans.png)
[英]Reading multiple .csv files from different directories into pandas DataFrame
[英]Append to dataframe from different file directories, reading from .tsv files
我正在嘗試從 .tsv 文件和 append 中讀取 text_A 和 text_B 到 dataframe。我正在調整 Tensorflow 教程中的代碼。
這是我改編的代碼:
from absl import logging
import tensorflow as tf
import os
import pandas as pd
import csv
def load_directory_data(directory):
data = {}
data["text_A"] = []
data["text_B"] = []
for file_path in os.listdir(directory):
with open(os.path.join(directory, file_path), "r", encoding='utf-8') as csvfile:
texts = 0
texts = csv.reader(csvfile, delimiter="\t", quotechar='"')
for text in texts:
print(text[0])
#I want to apppend here
return pd.DataFrame.from_dict(data)
# Merge examples, add similiarity
def load_dataset(directory):
sa_df = load_directory_data(os.path.join(directory, "sa"))
s_df = load_directory_data(os.path.join(directory, "s"))
ns_df = load_directory_data(os.path.join(directory, "ns"))
sa_df["similarity"] = 2
s_df["similarity"] = 1
ns_df["similarity"] = 0
return pd.concat([sa_df, s_df, ns_df]).sample(frac=1).reset_index(drop=True)
# Download and process the dataset files.
def download_and_load_datasets(force_download=False):
dataset = tf.keras.utils.get_file(
fname="tfm_dataset.tar.gz",
origin="file:///mypath/tfm_dataset.tar.gz",
extract=True)
train_df = load_dataset(os.path.join(os.path.dirname(dataset),
"tfm_dataset", "train"))
test_df = load_dataset(os.path.join(os.path.dirname(dataset),
"tfm_dataset", "test"))
return train_df, test_df
# Reduce logging output.
logging.set_verbosity(logging.ERROR)
train_df, test_df = download_and_load_datasets()
train_df.head()
我正在閱讀的目錄呈現以下結構:
test/sa:
01_02.tsv
03_04.tsv
.
.
11_12.tsv
test/s:
13_14.tsv
.
.
17_18.tsv
test/ns:
19_20.tsv
.
.
29_30.tsv
火車目錄具有類似的結構。
tsv 文件的格式類似於下面的示例:
"Text A, it could be the story about a black dog." "Text B, it could be a story about a bee."
當我打印 (text[0]) 時,它會打印出“A”position 中的所有文本,直到 test/sa 中的最后一個文件。 然后我得到了錯誤:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-49-f5de8939de2e> in <module>
43 logging.set_verbosity(logging.ERROR)
44
---> 45 train_df, test_df = download_and_load_datasets()
46 train_df.head(30)
<ipython-input-49-f5de8939de2e> in download_and_load_datasets(force_download)
34
35 train_df = load_dataset(os.path.join(os.path.dirname(dataset),
---> 36 "tfm_dataset", "train"))
37 test_df = load_dataset(os.path.join(os.path.dirname(dataset),
38 "tfm_dataset", "test"))
<ipython-input-49-f5de8939de2e> in load_dataset(directory)
18 # Merge positive and negative examples, add a polarity column and shuffle.
19 def load_dataset(directory):
---> 20 sa_df = load_directory_data(os.path.join(directory, "sa"))
21 s_df = load_directory_data(os.path.join(directory, "s"))
22 ns_df = load_directory_data(os.path.join(directory, "ns"))
<ipython-input-49-f5de8939de2e> in load_directory_data(directory)
11 texts = csv.reader(csvfile, delimiter="\t", quotechar='"')
12 for text in texts:
---> 13 print(text[0])
14
15
IndexError: list index out of range
我需要遍歷不同目錄中的所有文件而不會出錯,因此我將能夠 append 其中的文本來構建 pandas dataframe。
我的代碼中的行“#I need to append here”將被這兩個命令替換:
data["text_A"].append(text[0])
data["text_B"].append(text[1])
有什么建議么?
非常感謝你
我發現我在某些文件的末尾有制表符空格。 我使用了代碼:
with open(os.path.join(directory, file_path), "r", encoding='utf-8') as csvfile:
texts = 0
texts = csv.reader(csvfile, delimiter="\t", quotechar='"')
for text in texts:
print(directory,file_path,text)
然后修復壞文件,擦除多余的空間。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.