python從文件夾中讀取所有文件，並將文件名和其他信息寫入txt文件

Question

我有30911個html文件。 我需要進行網絡抓取，然后將信息保存到名為index.txt的txt文件中。 它看起來像

filename1, title, t1, date, p1
filename2, title, t1, date, p1
filename3, title, t1, date, p2
and so on...

我只想要文件名，但是輸出給了我path + filename。

Answer 1

您可以使用：

path = 'C:/Users/.../.../output/'
#read html files
for filename in glob.glob(os.path.join(path, '*.html')):
    soup = bs4.BeautifulSoup(open(filename).read(), "lxml")
    title = soup.find('h1')
    ticker = soup.find('p')
    d_date = soup.find_all('div', {"id": "a-body"})[0].find_all("p")[2]

    try:
        def find_participant(tag):
            return tag.name == 'p' and tag.find("strong", text=re.compile(r"Executives|Corporate Participants"))

        participants = soup.find(find_participant)
        parti_names = ""
        for parti in participants.find_next_siblings("p"):
            if parti.find("strong", text=re.compile(r"(Operator)")):
                break
            parti_names += parti.get_text(strip=True) + ","
    except:
        indexFile = open('C:/Users/.../output1/' + 'index.txt', 'a+')
        indexFile.write(filename + ', ' + title.get_text(strip=True) + ', '+ ticker.get_text(strip=True) + ', ' + d_date.get_text(strip=True) + ', ' + 'No participants' + '\n')
    else:
        participants = soup.find(find_participant)
        parti_names = ""
        for parti in participants.find_next_siblings("p"):
            if parti.find("strong", text=re.compile(r"(Operator)")):
                break
            parti_names += parti.get_text(strip=True) + ","
        indexFile = open('C:/Users/.../output1/' + 'index.txt', 'a+')
        indexFile.write(os.path.basename(filename) + ', ' + title.get_text(strip=True) + ', '+ ticker.get_text(strip=True) + ', ' + d_date.get_text(strip=True) + ', ' + parti_names + '\n')
        indexFile.close()

Answer 2

您的問題是文件名實際上是文件路徑，為了獲得文件名，您可以使用os模塊

os.path.basename('filepath')

因此，為了寫入文件：

indexFile.write(os.path.basename(filename)+ ', ' + title.get_text(strip=True) + ', '+ ticker.get_text(strip=True) + ', ' + d_date.get_text(strip=True) + ', ' + parti_names + '\n')

Answer 3

ntpath是另一個用於從路徑獲取基本名稱的模塊。

>>> import ntpath
>>> ntpath.basename('C:/Users/.../output1/' + 'index.txt')
'index.txt'

python從文件夾中讀取所有文件，並將文件名和其他信息寫入txt文件

問題描述

3 個解決方案

解決方案1
1 2017-05-29 07:14:31

解決方案2
1 已采納 2017-05-29 07:17:16

解決方案3
0 2017-05-29 07:31:40

python從文件夾中讀取所有文件，並將文件名和其他信息寫入txt文件

問題描述

3 個解決方案

解決方案1 1 2017-05-29 07:14:31

解決方案2 1 已采納 2017-05-29 07:17:16

解決方案3 0 2017-05-29 07:31:40

解決方案1
1 2017-05-29 07:14:31

解決方案2
1 已采納 2017-05-29 07:17:16

解決方案3
0 2017-05-29 07:31:40