簡體   English   中英

BeautifulSoup - 抓論壇頁面

[英]BeautifulSoup - scraping a forum page

我正在嘗試抓一個論壇討論並將其導出為csv文件,其中包含“thread title”,“user”和“post”等行,其中后者是每個人的實際論壇帖子。

我是Python和BeautifulSoup的初學者,所以我很難用這個!

我目前的問題是所有文本在csv文件中每行被拆分為一個字符。 那里有誰可以幫助我嗎? 如果有人能幫我一把,真是太棒了!

這是我一直在使用的代碼:

from bs4 import BeautifulSoup
import csv
import urllib2

f = urllib2.urlopen("https://silkroad5v7dywlc.onion.to/index.php?action=printpage;topic=28536.0")

soup = BeautifulSoup(f)

b = soup.get_text().encode("utf-8").strip() #the posts contain non-ascii words, so I had to do this

writer = csv.writer(open('silkroad.csv', 'w'))
writer.writerows(b)

好的,我們走了。 不太清楚我在這里幫你做什么,但希望你有充分的理由去分析絲綢之路的帖子。

這里有一些問題,最重要的是你根本沒有解析數據。 您實際上正在使用.get_text()進入頁面,突出顯示整個內容,然后將整個內容復制並粘貼到csv文件中。

所以這是你應該嘗試做的事情:

  1. 閱讀頁面源代碼
  2. 用湯將其分成你想要的部分
  3. 保存並行數組中的節,用於作者,日期,時間,帖子等
  4. 逐行將數據寫入csv文件

我寫了一些代碼來向你展示它看起來像什么,它應該做的工作:

from bs4 import BeautifulSoup
import csv
import urllib2

# get page source and create a BeautifulSoup object based on it
print "Reading page..."
page = urllib2.urlopen("https://silkroad5v7dywlc.onion.to/index.php?action=printpage;topic=28536.0")
soup = BeautifulSoup(page)

# if you look at the HTML all the titles, dates, 
# and authors are stored inside of <dt ...> tags
metaData = soup.find_all("dt")

# likewise the post data is stored
# under <dd ...>
postData = soup.find_all("dd")

# define where we will store info
titles = []
authors = []
times = []
posts = []

# now we iterate through the metaData and parse it
# into titles, authors, and dates
print "Parsing data..."
for html in metaData:
    text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "") # convert the html to text
    titles.append(text.split("Title:")[1].split("Post by:")[0].strip()) # get Title:
    authors.append(text.split("Post by:")[1].split(" on ")[0].strip()) # get Post by:
    times.append(text.split(" on ")[1].strip()) # get date

# now we go through the actual post data and extract it
for post in postData:
    posts.append(BeautifulSoup(str(post)).get_text().encode("utf-8").strip())

# now we write data to csv file
# ***csv files MUST be opened with the 'b' flag***
csvfile = open('silkroad.csv', 'wb')
writer = csv.writer(csvfile)

# create template
writer.writerow(["Time", "Author", "Title", "Post"])

# iterate through and write all the data
for time, author, title, post in zip(times, authors, titles, posts):
    writer.writerow([time, author, title, post])


# close file
csvfile.close()

# done
print "Operation completed successfully."

編輯:包含的解決方案,可以從目錄中讀取文件並使用其中的數據

好的,所以你將HTML文件放在一個目錄中。 您需要獲取目錄中的文件列表,遍歷它們,並將csv文件附加到目錄中的每個文件。

這是我們新計划的基本邏輯。

如果我們有一個名為processData()的函數,它將文件路徑作為參數,並將文件中的數據附加到csv文件,它的外觀如下所示:

# the directory where we have all our HTML files
dir = "myDir"

# our csv file
csvFile = "silkroad.csv"

# insert the column titles to csv
csvfile = open(csvFile, 'wb')
writer = csv.writer(csvfile)
writer.writerow(["Time", "Author", "Title", "Post"])
csvfile.close()

# get a list of files in the directory
fileList = os.listdir(dir)

# define variables we need for status text
totalLen = len(fileList)
count = 1

# iterate through files and read all of them into the csv file
for htmlFile in fileList:
    path = os.path.join(dir, htmlFile) # get the file path
    processData(path) # process the data in the file
    print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..." # display status
    count = count + 1 # increment counter

碰巧我們的processData()函數或多或少是我們之前做過的,只需做一些改動。

所以這與我們的上一個程序非常相似,只有一些小改動:

  1. 我們首先編寫列標題
  2. 接下來我們打開帶有'ab'標志的csv來追加
  3. 我們導入os以獲取文件列表

這是看起來像:

from bs4 import BeautifulSoup
import csv
import urllib2
import os # added this import to process files/dirs

# ** define our data processing function
def processData( pageFile ):
    ''' take the data from an html file and append to our csv file '''
    f = open(pageFile, "r")
    page = f.read()
    f.close()
    soup = BeautifulSoup(page)

    # if you look at the HTML all the titles, dates, 
    # and authors are stored inside of <dt ...> tags
    metaData = soup.find_all("dt")

    # likewise the post data is stored
    # under <dd ...>
    postData = soup.find_all("dd")

    # define where we will store info
    titles = []
    authors = []
    times = []
    posts = []

    # now we iterate through the metaData and parse it
    # into titles, authors, and dates
    for html in metaData:
        text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "") # convert the html to text
        titles.append(text.split("Title:")[1].split("Post by:")[0].strip()) # get Title:
        authors.append(text.split("Post by:")[1].split(" on ")[0].strip()) # get Post by:
        times.append(text.split(" on ")[1].strip()) # get date

    # now we go through the actual post data and extract it
    for post in postData:
        posts.append(BeautifulSoup(str(post)).get_text().encode("utf-8").strip())

    # now we write data to csv file
    # ***csv files MUST be opened with the 'b' flag***
    csvfile = open('silkroad.csv', 'ab')
    writer = csv.writer(csvfile)

    # iterate through and write all the data
    for time, author, title, post in zip(times, authors, titles, posts):
        writer.writerow([time, author, title, post])

    # close file
    csvfile.close()
# ** start our process of going through files

# the directory where we have all our HTML files
dir = "myDir"

# our csv file
csvFile = "silkroad.csv"

# insert the column titles to csv
csvfile = open(csvFile, 'wb')
writer = csv.writer(csvfile)
writer.writerow(["Time", "Author", "Title", "Post"])
csvfile.close()

# get a list of files in the directory
fileList = os.listdir(dir)

# define variables we need for status text
totalLen = len(fileList)
count = 1

# iterate through files and read all of them into the csv file
for htmlFile in fileList:
    path = os.path.join(dir, htmlFile) # get the file path
    processData(path) # process the data in the file
    print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..." # display status
    count = count + 1 # incriment counter

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM