简体   繁体   中英

Python Web Scraper + Cleanup

So I'm currently trying to export a twitter .html page, and I created this webscraper using BeautifulSoup. The OUTPUT.csv file is currently really messy, and here are my questions (current .py file is below):

What are some steps I can take to clean out the code? My output csv has the tweets, but they're really messy, and separated by commas. Is there any way I can separate them by using a new line? Also, how can I only extract the part of the tweet that says "Bank Of America: Growth Is Back – Bank of America Corporation" (which I surrounded by stars) in my cleanup() function?

"<div class=""js-tweet-text-container"">
<p class=""TweetTextSize js-tweet-text tweet-text"" data-aria-label-
part=""0"" lang=""en"">*****Bank Of America: Growth Is Back – Bank of 
America Corporation***** (<strong>NYSE:BAC</strong>) <a class=""twitter-
timeline-link u-hidden"" data-expanded-url=""https://good-
stockinvest.com/2017/11/29/bank-of-america-growth-is-back-bank-of-
america-corporation-nysebac/"" dir=""ltr"" 
href="" rel=""nofollow noopener"" 
target=""_blank"" title=""https://good-stockinvest.com/2017/11/29/bank-
of-america-growth-is-back-bank-of-america-corporation-nysebac/""><span 
class=""tco-ellipsis""></span><span class=""invisible"">https://</span>
<span class=""js-display-url"">good-
stockinvest.com/2017/11/29/ban</span><span class=""invisible"">k-of-
america-growth-is-back-bank-of-america-corporation-nysebac/</span><span 
class=""tco-ellipsis""><span class=""invisible""> </span>…</span></a>
</p>
</div>"

Below is my code:

from bs4 import BeautifulSoup
import csv


new = csv.writer(open("OUTPUT", "w"))
new.writerow(["Tweets:"])
new.writerow([ ])       # allowing for a simple space

data = open("bac.html", "r").read()
soup = BeautifulSoup(data, "html.parser")

tweets = soup.find_all('div', class_="js-tweet-text-container")

def writetweets():
    for tweet in tweets:
        new.writerow(tweets)
        new.writerow([ ])   
    print "writetweets - open OUTPUT.csv for the tweet divs"

def cleanup():
    print "cleanup - nothing here for now"

def tests():
    print "tests - nothing here for now"

def demo():
    writetweets()
    cleanup()
    tests()

if __name__ == '__main__':
    demo()

A quick fix could be if you use the split() function to acquire only the text between the asterisks. Is every tweet you acquire between asterisks or this specific one?

Another solution would be to search more for tags in order to end up with a "cleaner" string as a result. ie further use find_all in your "tweets" string.

First you have a couple of errors, you used a for to iterate over the tweets but you are writing the tweets instead of tweet,

Also if you want it to be line by line instead of comma separated values you could change from using csv to use

with open(fine_name,'w') as file_output: for tweet in tweets: file_output.write(tweet) that way it will be one line per tweet, also you could use file_output = open(file_name,'w') for tweet in tweets: file_output.write() file_output.close() its up to you

Building on the previous answers, but helping with the cleanup:

from bs4 import BeautifulSoup
import csv


data = open("bac.html", "r").read()
soup = BeautifulSoup(data, "html.parser")

#tweets = soup.find_all('div', class_="js-tweet-text-container")
tweets = soup.find_all("div", {"class": "js-tweet-text-container"})

def writetweets():
    with open("OUTPUT.txt", "w") as new:
        new.write("Tweets:\r\n")
        for tweet in tweets:
            new.write(tweet.getText() + "\r\n")
    print "writetweets - open OUTPUT.txt for the tweet divs"

def cleanup():
    print "cleanup - nothing here for now"

def tests():
    print "tests - nothing here for now"

def demo():
    writetweets()
    cleanup()
    tests()

if __name__ == '__main__':
    demo()

I get:

In [29]: tweet.getText()

Out[29]: '*****Bank Of America: Growth Is Back – Bank of America Corporation***** (NYSE:BAC) https://good-stockinvest.com/2017/11/29/bank-of-america-growth-is-back-bank-of-america-corporation-nysebac/ …'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM