简体   繁体   English

python从文件夹中读取所有文件,并将文件名和其他信息写入txt文件

[英]python read all files from a folder and write the file name and other info into a txt file

I have 30911 html files. 我有30911个html文件。 I need to do webscraping and then save the info into a txt file named index.txt. 我需要进行网络抓取,然后将信息保存到名为index.txt的txt文件中。 It should look like 它看起来像

filename1, title, t1, date, p1
filename2, title, t1, date, p1
filename3, title, t1, date, p2
and so on...

I only want filename, but output gave me path+filename. 我只想要文件名,但是输出给了我path + filename。

You can use: 您可以使用:

path = 'C:/Users/.../.../output/'
#read html files
for filename in glob.glob(os.path.join(path, '*.html')):
    soup = bs4.BeautifulSoup(open(filename).read(), "lxml")
    title = soup.find('h1')
    ticker = soup.find('p')
    d_date = soup.find_all('div', {"id": "a-body"})[0].find_all("p")[2]

    try:
        def find_participant(tag):
            return tag.name == 'p' and tag.find("strong", text=re.compile(r"Executives|Corporate Participants"))

        participants = soup.find(find_participant)
        parti_names = ""
        for parti in participants.find_next_siblings("p"):
            if parti.find("strong", text=re.compile(r"(Operator)")):
                break
            parti_names += parti.get_text(strip=True) + ","
    except:
        indexFile = open('C:/Users/.../output1/' + 'index.txt', 'a+')
        indexFile.write(filename + ', ' + title.get_text(strip=True) + ', '+ ticker.get_text(strip=True) + ', ' + d_date.get_text(strip=True) + ', ' + 'No participants' + '\n')
    else:
        participants = soup.find(find_participant)
        parti_names = ""
        for parti in participants.find_next_siblings("p"):
            if parti.find("strong", text=re.compile(r"(Operator)")):
                break
            parti_names += parti.get_text(strip=True) + ","
        indexFile = open('C:/Users/.../output1/' + 'index.txt', 'a+')
        indexFile.write(os.path.basename(filename) + ', ' + title.get_text(strip=True) + ', '+ ticker.get_text(strip=True) + ', ' + d_date.get_text(strip=True) + ', ' + parti_names + '\n')
        indexFile.close()

Your problem is that filename is filepath in reality, in order to get the filename you could use os module 您的问题是文件名实际上是文件路径,为了获得文件名,您可以使用os模块

os.path.basename('filepath')

so in order to write to the file: 因此,为了写入文件:

indexFile.write(os.path.basename(filename)+ ', ' + title.get_text(strip=True) + ', '+ ticker.get_text(strip=True) + ', ' + d_date.get_text(strip=True) + ', ' + parti_names + '\n')

ntpath is another module used to get base name from path. ntpath是另一个用于从路径获取基本名称的模块。

>>> import ntpath
>>> ntpath.basename('C:/Users/.../output1/' + 'index.txt')
'index.txt'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从需要在单独文件夹中读写 .txt 文件的 .py 文件创建 python 可执行文件 - How to create python executable from a .py file that need to read and write .txt files in a separate folder python,读写所有扩展名为*.txt的文件 - python, read and write all the file with *.txt extension 如何使用 Python 将文件夹中的 all.txt 文件和 append 其内容读取到 one.txt 文件中? - How to read all .txt files in a folder and append its contents into one .txt file, using Python? Python - 从文件夹读取文件并以格式写入 CSV 文件 - Python - Read files from folder and Write CSV file in format 如何读取文件夹中的多个.txt 文件并使用 python 写入单个文件? - How to read multiple .txt files in a folder and write into a single file using python? 如何使用 python 中的 file.txt 中的文件名重命名文件夹中的文件? - how to rename files in a folder with the files name in file.txt in python? python从文件读取写入其他文件 - python read from file write to other file 读取/写入 Python 中的 txt 文件 - Read/Write to a txt file in Python 从文件夹中的 all.cif 文件中提取数据并写入新文件(.txt 或 .csv)中的一行 - extract data from all .cif files in folder and write to a line in new file (.txt or .csv) 如何依次读取两个txt文件并将其写入python中的新文件? - how to sequentially read two txt files and write into a new file in python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM