简体   繁体   English

使用python清理iBooks目录中的数据文件名

[英]Data cleaning file names in iBooks directory with python

I'm trying to print a list of all the files in the specified directory that end in .pdf我正在尝试打印指定目录中以 .pdf 结尾的所有文件的列表

Once that is running I want to expand it to print out the number of files that are named "unnamed document" or end in .pdf.pdf.pdf which is a problem among the 1,200 or so books I've collected in iBooks.一旦运行,我想扩展它以打印出名为“未命名文档”或以 .pdf.pdf.pdf 结尾的文件数量,这是我在 iBooks 中收集的 1,200 本书中的一个问题。

After it prints out the .pdf file I'm trying to get it to trim off the excess .pdf extensions and somehow prompt me to edit each file which I'll have to do manually after reviewing the first few pages of each "unnamed document"在它打印出 .pdf 文件后,我试图让它修剪掉多余的 .pdf 扩展名,并以某种方式提示我编辑每个文件,在查看每个“未命名文档”的前几页后,我必须手动编辑每个文件”

While I would love all the code spelled out for me, I would more appreciate hints or tips on how to go about learning how to do this.虽然我会喜欢为我拼写的所有代码,但我更喜欢有关如何学习如何执行此操作的提示或技巧。

The directory was found from this page.该目录是从此页面找到的。 https://www.idownloadblog.com/2018/05/24/ibooks-library-location-mac/ https://www.idownloadblog.com/2018/05/24/ibooks-library-location-mac/

The script I've found was from here Get list of pdf files in folder我发现的脚本来自这里Get list of pdf files in folder

I print this currently and get EOF errors and type errors so I'm asking for help on how to structure or revise this script for the start of a larger data cleaning project.我目前正在打印此文件并收到 EOF 错误和类型错误,因此我正在寻求有关如何构建或修改此脚本以启动更大数据清理项目的帮助。

While this can be done with regex, I prefer this done with python.虽然这可以用正则表达式完成,但我更喜欢用 python 完成。 Remove duplicate filename extensions 删除重复的文件扩展名

Thanks!谢谢!

First version第一版

#!/usr/bin/env python3

import os

all_files = []
for dirpath, dirnames, filenames in os.walk("~/Library/Mobile Documents/iCloud~com~apple~iBooks/Documents"):
    for filename in [f for f in filenames if f.endswith(".pdf")]:
        all_files.append(os.path.join(dirpath, filename)

# print (files ending in .pdf.pdf.etc)

# trim file names with duplicate .pdf names

# print(files named "unnamed document")

End of the First version第一个版本结束

Start of the Second version第二个版本的开始

The second version, well I've now realized after reading a few other blogs that this is a relatively known and solved problem.第二个版本,在阅读其他一些博客后,我现在意识到这是一个相对知名且已解决的问题。 So the script I've changed to using was found online from 2013 that uses hashes to compare the files and it claims quite quickly.所以我改用的脚本是从 2013 年在网上找到的,它使用哈希来比较文件,而且它声称的速度非常快。 Then the script as shown below just needs the name of a subdirectory in the terminal (where you will need to run it) and press enter.然后,如下所示的脚本只需要终端中子目录的名称(您需要在其中运行它)并按 Enter。

testmachine at testmachine-MacPro in ~sandbox/test 
$python3 dupFinder.py venv $testdirectory

results in结果是

Duplicates Found:
The following files are identical. The name could differ, but the content is identical
___________________
        venv/bin/easy_install
        venv/bin/easy_install-3.6
___________________
        venv/bin/pip
        venv/bin/pip3
        venv/bin/pip3.6
___________________
        venv/bin/python
        venv/bin/python3
___________________
        venv/lib/python3.6/site-packages/six.py
        venv/lib/python3.6/site-packages/pip/_vendor/six.py
___________________
        venv/lib/python3.6/site-packages/wq-1.1.0-py3.6-nspkg.pth
        venv/lib/python3.6/site-packages/wq.app-1.1.0-py3.6-nspkg.pth
        venv/lib/python3.6/site-packages/wq.core-1.1.0-py3.6-nspkg.pth
        venv/lib/python3.6/site-packages/wq.db-1.1.2-py3.6-nspkg.pth
        venv/lib/python3.6/site-packages/wq.io-1.1.0-py3.6-nspkg.pth
        venv/lib/python3.6/site-packages/wq.start-1.1.1-py3.6-nspkg.pth

I really like the simplicity of this script and posting it here but also giving credit to [pythoncentral][1] for creating this baller code back in 2013. That still runs despite six years but also works flawlessly.我真的很喜欢这个脚本的简单性并将其发布在这里,但也感谢 [pythoncentral][1] 在 2013 年创建了这个 Baller 代码。尽管 6 年了,它仍然运行,但也完美无缺。

# dupFinder.py
import os, sys
import hashlib


def findDup(parentFolder):
    # Dups in format {hash:[names]}
    dups = {}
    for dirName, subdirs, fileList in os.walk(parentFolder):
        print('Scanning %s...' % dirName)
        for filename in fileList:
            # Get the path to the file
            path = os.path.join(dirName, filename)
            # Calculate hash
            file_hash = hashfile(path)
            # Add or append the file path
            if file_hash in dups:
                dups[file_hash].append(path)
            else:
                dups[file_hash] = [path]
    return dups


# Joins two dictionaries
def joinDicts(dict1, dict2):
    for key in dict2.keys():
        if key in dict1:
            dict1[key] = dict1[key] + dict2[key]
        else:
            dict1[key] = dict2[key]


def hashfile(path, blocksize=65536):
    afile = open(path, 'rb')
    hasher = hashlib.md5()
    buf = afile.read(blocksize)
    while len(buf) > 0:
        hasher.update(buf)
        buf = afile.read(blocksize)
    afile.close()
    return hasher.hexdigest()


def printResults(dict1):
    results = list(filter(lambda x: len(x) > 1, dict1.values()))
    if len(results) > 0:
        print('Duplicates Found:')
        print('The following files are identical. The name could differ, but the content is identical')
        print('___________________')
        for result in results:
            for subresult in result:
                print('\t\t%s' % subresult)
            print('___________________')

    else:
        print('No duplicate files found.')


if __name__ == '__main__':
    if len(sys.argv) > 1:
        dups = {}
        folders = sys.argv[1:]
        for i in folders:
            # Iterate the folders given
            if os.path.exists(i):
                # Find the duplicated files and append them to the dups
                joinDicts(dups, findDup(i))
            else:
                print('%s is not a valid path, please verify' % i)
                sys.exit()
        printResults(dups)
    else:
        print('Usage: python dupFinder.py folder or python dupFinder.py folder1 folder2 folder3')


  [1]: https://www.pythoncentral.io/finding-duplicate-files-with-python/

The third version and forward needs to evolve a few things.第三个版本和前进版需要进化一些东西。

There is room for improvement in the UI, a terminal or headless gui, saving the results in a log or csv file. UI、终端或无头 gui 有改进的余地,将结果保存在日志或 csv 文件中。 Eventually moving this to a Flask or Django app could prove beneficial.最终将其移动到 Flask 或 Django 应用程序可能会证明是有益的。 Course this being a pdf document scrubber I could create a que of files named "unnamed document" that the machine could log its hash or create an index that it would save could be improved speed next time it wouldnt scan whole again just show me the "unnamed documents" that need work.当然,这是一个 pdf 文档清理器,我可以创建一个名为“未命名文档”的文件队列,机器可以记录它的哈希值或创建一个它可以保存的索引,下次它不会再次扫描时可以提高速度,只需向我展示“需要工作的未命名文件”。 Work could be defined as scrubbing the name, deduping, finding the cover page, adding keywords or even creating a que file for the reader to actually read each document.工作可以定义为擦洗名称、重复数据删除、查找封面、添加关键字甚至创建一个que 文件供读者实际阅读每个文档。 Maybe there is an API for good reads ?也许有一个用于良好读取的 API? Any menu or gui would need to include error handling as well as starting with somekind of cron job and saving results somewhere.well as the intelligence behind this so it starts to learn the steps you are taking over time.任何菜单或 gui 都需要包含错误处理以及从某种 cron 作业开始并将结果保存在某处。以及这背后的智能,因此它开始学习您随着时间的推移采取的步骤。

Ideas ?想法?

I would suggest you to have a look at the pathlib library.我建议你看看pathlib库。 link关联

  • To list out all the files with extension pdf列出所有扩展名为 pdf 的文件
from pathlib import Path

pdf_files = list(map(str,Path("~/Library/Mobile Documents/iCloud~com~apple~iBooks/Documents").glob("**/*.pdf")))

print(pdf_files)
  • To trim extra .pdf修剪额外的.pdf
pdf_files = [x[:-4] if x.endswith('.pdf.pdf') else x for x in pdf_files]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM