如何使用python將網絡抓取的文本寫入csv？

Question

我一直在研究一個實踐性的網絡爬蟲，該爬蟲將獲得書面評論並將它們寫到一個csv文件中，並且每個評論都有自己的一行。 我一直在遇到麻煩，因為：

我似乎無法剝離html並僅獲取文本（即書面評論，僅此而已）
甚至在我的評論文本之間和之內都有很多奇怪的空格 （例如，行之間有一行空格等）

謝謝你的幫助！

代碼如下：

#! python3

import bs4, os, requests, csv

# Get URL of the page

URL = ('https://www.tripadvisor.com/Attraction_Review-g294265-d2149128-Reviews-Gardens_by_the_Bay-Singapore.html')

# Looping until the 5th page of reviews

pagecounter = 0
while pagecounter != 5:

    # Request get the first page
    res = requests.get(URL)
    res.raise_for_status

    # Download the html of the first page
    soup = bs4.BeautifulSoup(res.text, "html.parser")
    reviewElems = soup.select('.partial_entry')


    if reviewElems == []:
        print('Could not find clue.')

    else:
        #for i in range(len(reviewElems)):
            #print(reviewElems[i].getText())

        with open('GardensbytheBay.csv', 'a', newline='') as csvfile:

            for row in reviewElems:
                writer = csv.writer(csvfile, delimiter=' ', quoting=csv.QUOTE_ALL)
                writer.writerow(row)
            print('Writing page')

    # Find URL of next page and update URL
    if pagecounter == 0:
        nextLink = soup.select('a[data-offset]')[0]

    elif pagecounter != 0:
        nextLink = soup.select('a[data-offset]')[1]

    URL = 'http://www.tripadvisor.com' + nextLink.get('href')
    pagecounter += 1

print('Download complete')
csvfile.close()

Answer 1

您可以使用row.get_text(strip=True)從選定的p.partial_entry獲取文本。 請嘗試以下操作：

import bs4, os, requests, csv

# Get URL of the page
URL = ('https://www.tripadvisor.com/Attraction_Review-g294265-d2149128-Reviews-Gardens_by_the_Bay-Singapore.html')

with open('GardensbytheBay.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=' ')

    # Looping until the 5th page of reviews
    for pagecounter in range(6):

        # Request get the first page
        res = requests.get(URL)
        res.raise_for_status

        # Download the html of the first page
        soup = bs4.BeautifulSoup(res.text, "html.parser")
        reviewElems = soup.select('p.partial_entry')

        if reviewElems:
            for row in reviewElems:
                review_text = row.get_text(strip=True).encode('utf8', 'ignore').decode('latin-1')
                writer.writerow([review_text])
            print('Writing page', pagecounter + 1)
        else:
            print('Could not find clue.')

        # Find URL of next page and update URL
        if pagecounter == 0:
            nextLink = soup.select('a[data-offset]')[0]
        elif pagecounter != 0:
            nextLink = soup.select('a[data-offset]')[1]

        URL = 'http://www.tripadvisor.com' + nextLink.get('href')

print('Download complete')

如何使用python將網絡抓取的文本寫入csv？

問題描述

1 個解決方案

解決方案1
1 已采納 2016-10-18 13:02:57

如何使用python將網絡抓取的文本寫入csv？

問題描述

1 個解決方案

解決方案1 1 已采納 2016-10-18 13:02:57

解決方案1
1 已采納 2016-10-18 13:02:57