简体   繁体   English

从 Github 在 Google Colab 中使用 Tensorflow 导入文本文件

[英]Importing text files with Tensorflow in Google Colab from Github

I am trying to load and preprocess text documents in Google Colab using Tensorflow and I seem to be having problems importing my text files.我正在尝试使用 Tensorflow 在 Google Colab 中加载和预处理文本文档,但我似乎在导入文本文件时遇到了问题。

Based on Example 2 of the tutorial, I ran基于本教程的示例 2 ,我运行

DIRECTORY_URL = 'Github URL'
FILE_NAMES = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt']

for name in FILE_NAMES:
  text_dir = utils.get_file(name, origin=DIRECTORY_URL + name)

parent_dir = pathlib.Path(text_dir).parent
list(parent_dir.iterdir())

However, I'm getting these odd results但是,我得到了这些奇怪的结果

Sentence:  b'            name-with-owner="YWtlNzAwL1B5dGhvbg=="'
Label: 3
Sentence:  b'        </tr>'
Label: 0
Sentence:  b'        </tr>'
Label: 0

Am I correct in assuming that Tf is reading in the raw text files from GitHub?我是否正确假设 Tf 正在从 GitHub 读取原始文本文件? I went and tried the first example in the above tutorial by uploading a zipped folder of the text files and trying to unzip via Tf in Colab, then reading the files.我通过上传文本文件的压缩文件夹并尝试通过 Colab 中的 Tf 解压缩,然后读取文件,尝试了上述教程中的第一个示例。

data_url = 'GithubUrlWithZip.7z'

dataset_dir = utils.get_file(
    origin=data_url,
    extract=True,
    cache_dir=None,
    cache_subdir='')

dataset_dir = pathlib.Path(dataset_dir).parent
text_dir = dataset_dir/"datasets"
list(text_dir.iterdir())

but the text files did not read well when I checked.但是当我检查时,文本文件读得不好。

sample_file = text_dir/"file1.txt"

with open(sample_file) as f:
  print(f.read())

The files reads as such文件是这样读取的

<!DOCTYPE html>
<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">
  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://github.githubassets.com">
  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">
  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">
  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>
  <link rel="preconnect" href="https://avatars.githubusercontent.com">

  <link crossorigin="anonymous" media="all" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" rel="stylesheet" href="https://github.githubassets.com/assets/light-92c7d381038e.css" /><link crossorigin="anonymous" media="all" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+solP87jByEvY/g4BmoxLihRogKcX1obPnf4Yp7dI0ZTWO+ljg==" rel="stylesheet" href="https://github.githubassets.com/assets/dark-d4a90c367f0c.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" integrity="sha512-cZa7DZqvMBwD236uzEunO/G1dvw8/QftyT2UtLWKQFEy0z0eq0R5WPwqVME+3NSZG1YaLJAaIqtU+m0zWf/6SQ==" rel="stylesheet" data-href="https://github.githubassets.com/assets/dark_dimmed-7196bb0d9aaf.css" /><link data-color-theme="dark_high_contrast" crossorigin="anonymous" media="all" integrity="sha512-WVoKqJ4y1nLsdNH4RkRT5qrM9+n9RFe1RHSiTnQkBf5TSZkJEc9GpLpTIS7T15EQaUQBJ8BwmKvwFPVqfpTEIQ==" rel="stylesheet" data-href="https://github.githubassets.com/assets/dark_high_contrast-595a0aa89e32.css" /><link data-color-theme="dark_colorblind" crossorigin="anonymous" media="all" integrity="sha512-XpAMBMSRZ6RTXgepS8LjKiOeNK3BilRbv8qEiA/M3m+Q4GoqxtHedOI5BAZRikCzfBL4KWYvVzYZSZ8Gp/UnUg==" rel="stylesheet" data-href="https://github.githubassets.com/assets/dark_colorblind-5e900c04c491.css" /><link data-color-theme="light_colorblind" crossorigin="anonymous" media="all" integrity="sha512-3HF2HZ4LgEIQm77yOzoeR20CX1n2cUQlcywscqF4s+5iplolajiHV7E5ranBwkX65jN9TNciHEVSYebQ+8xxEw==" rel="stylesheet" data-href="https://github.githubassets.com/assets/light_colorblind-dc71761d9e0b.css" /><link data-color-theme="light_high_contrast" crossorigin="anonymous" media="all" integrity="sha512-+J8j3T0kbK9/sL3zbkCfPtgYcRD4qQfRbT6xnfOrOTjvz4zhr0M7AXPuE642PpaxGhHs1t77cTtieW9hI2K6Gw==" rel="stylesheet" data-href="https://github.githubassets.com/assets/light_high_contrast-f89f23dd3d24.css" /><link data-color-theme="light_tritanopia" crossorigin="anonymous" media="all" integrity="sha512-AQeAx5wHQAXNf0DmkvVlHYwA3f6BkxunWTI0GGaRN57GqD+H9tW8RKIKlopLS0qGaC54seFsPc601GDlqIuuHg==" rel="stylesheet" data-href="https://github.githubassets.com/assets/light_tritanopia-010780c79c07.css" /><link data-color-theme="dark_tritanopia" crossorigin="anonymous" media="all" integrity="sha512-+u5pmgAE0T03d/yI6Ha0NWwz6Pk0W6S6WEfIt8veDVdK8NTjcMbZmQB9XUCkDlrBoAKkABva8HuGJ+SzEpV1Uw==" rel="stylesheet" data-href="https://github.githubassets.com/assets/dark_tritanopia-faee699a0004.css" />

I also tried having the local files into my Google Drive, but for some reason I had issues reading in the text files.我还尝试将本地文件放入我的 Google Drive,但由于某种原因,我在读取文本文件时遇到了问题。 In the future if I'm working with larger and more numerous text data, I presume I wouldn't be uploading them to Google Drive then reading the local files from Colab, so I want to properly learn how to import text files from Github or another source.将来,如果我要处理更大和更多的文本数据,我想我不会将它们上传到 Google Drive 然后从 Colab 读取本地文件,所以我想正确学习如何从 Github 导入文本文件或另一个来源。

The tutorials seem to have their files in Google API storage (I was using this Google Cloud link ), but when I tried uploading my files there, it was extremely slow.这些教程的文件似乎在 Google API 存储中(我使用的是这个Google Cloud 链接),但是当我尝试在那里上传文件时,速度非常慢。

Is there another method generally used for ML models like this or with text-based work?是否有另一种方法通常用于这样的 ML 模型或基于文本的工作?

Looks like you might be getting the webpage for GitHub instead of the actual "raw" file.看起来您可能正在获取 GitHub 的网页,而不是实际的“原始”文件。

You can get raw files using this kind of URL: https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/README.md您可以使用这种 URL 获取原始文件: https ://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/README.md

Note the "raw" at the start of the URL.请注意 URL 开头的“原始”。

This is the original URL here: https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/README.md这是此处的原始 URL: https ://github.com/mrdbourke/tensorflow-deep-learning/blob/main/README.md

Without the raw URL, the file won't download properly.如果没有原始 URL,文件将无法正确下载。

突出显示原始按钮的 GitHub 自述文件页面

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM