简体   繁体   English

Python Web抓取工具+清理

[英]Python Web Scraper + Cleanup

So I'm currently trying to export a twitter .html page, and I created this webscraper using BeautifulSoup. 因此,我目前正在尝试导出一个twitter .html页面,并且我使用BeautifulSoup创建了该webscraper。 The OUTPUT.csv file is currently really messy, and here are my questions (current .py file is below): OUTPUT.csv文件当前确实很乱,这是我的问题(下面是当前的.py文件):

What are some steps I can take to clean out the code? 我可以采取哪些步骤清除代码? My output csv has the tweets, but they're really messy, and separated by commas. 我的输出csv有推文,但它们确实很凌乱,并用逗号分隔。 Is there any way I can separate them by using a new line? 有什么办法可以通过换行来分隔它们? Also, how can I only extract the part of the tweet that says "Bank Of America: Growth Is Back – Bank of America Corporation" (which I surrounded by stars) in my cleanup() function? 另外,如何在cleanup()函数中仅提取推文中“美国银行:增长又回来了–美国银行公司”(我被星号包围)的一部分?

"<div class=""js-tweet-text-container"">
<p class=""TweetTextSize js-tweet-text tweet-text"" data-aria-label-
part=""0"" lang=""en"">*****Bank Of America: Growth Is Back – Bank of 
America Corporation***** (<strong>NYSE:BAC</strong>) <a class=""twitter-
timeline-link u-hidden"" data-expanded-url=""https://good-
stockinvest.com/2017/11/29/bank-of-america-growth-is-back-bank-of-
america-corporation-nysebac/"" dir=""ltr"" 
href="" rel=""nofollow noopener"" 
target=""_blank"" title=""https://good-stockinvest.com/2017/11/29/bank-
of-america-growth-is-back-bank-of-america-corporation-nysebac/""><span 
class=""tco-ellipsis""></span><span class=""invisible"">https://</span>
<span class=""js-display-url"">good-
stockinvest.com/2017/11/29/ban</span><span class=""invisible"">k-of-
america-growth-is-back-bank-of-america-corporation-nysebac/</span><span 
class=""tco-ellipsis""><span class=""invisible""> </span>…</span></a>
</p>
</div>"

Below is my code: 下面是我的代码:

from bs4 import BeautifulSoup
import csv


new = csv.writer(open("OUTPUT", "w"))
new.writerow(["Tweets:"])
new.writerow([ ])       # allowing for a simple space

data = open("bac.html", "r").read()
soup = BeautifulSoup(data, "html.parser")

tweets = soup.find_all('div', class_="js-tweet-text-container")

def writetweets():
    for tweet in tweets:
        new.writerow(tweets)
        new.writerow([ ])   
    print "writetweets - open OUTPUT.csv for the tweet divs"

def cleanup():
    print "cleanup - nothing here for now"

def tests():
    print "tests - nothing here for now"

def demo():
    writetweets()
    cleanup()
    tests()

if __name__ == '__main__':
    demo()

A quick fix could be if you use the split() function to acquire only the text between the asterisks. 如果您使用split()函数仅获取星号之间的文本,则可能是一种快速解决方案。 Is every tweet you acquire between asterisks or this specific one? 您在星号之间或此特定星号之间获得的每条推文都是?

Another solution would be to search more for tags in order to end up with a "cleaner" string as a result. 另一个解决方案是搜索更多标签,以便最终得到一个“更干净”的字符串。 ie further use find_all in your "tweets" string. 即进一步在您的“ tweets”字符串中使用find_all。

First you have a couple of errors, you used a for to iterate over the tweets but you are writing the tweets instead of tweet, 首先,您有几个错误,使用for来遍历tweet,但是您是在编写tweets而不是tweet,

Also if you want it to be line by line instead of comma separated values you could change from using csv to use 另外,如果您希望将其逐行而不是逗号分隔的值,则可以从使用csv改为使用

with open(fine_name,'w') as file_output: for tweet in tweets: file_output.write(tweet) that way it will be one line per tweet, also you could use file_output = open(file_name,'w') for tweet in tweets: file_output.write() file_output.close() its up to you with open(fine_name,'w') as file_output: for tweet in tweets: file_output.write(tweet)这样每条鸣叫file_output = open(file_name,'w') for tweet in tweets: file_output.write() file_output.close()一行,也可以使用file_output = open(file_name,'w') for tweet in tweets: file_output.write() file_output.close()由您决定

Building on the previous answers, but helping with the cleanup: 以先前的答案为基础,但有助于进行清理:

from bs4 import BeautifulSoup
import csv


data = open("bac.html", "r").read()
soup = BeautifulSoup(data, "html.parser")

#tweets = soup.find_all('div', class_="js-tweet-text-container")
tweets = soup.find_all("div", {"class": "js-tweet-text-container"})

def writetweets():
    with open("OUTPUT.txt", "w") as new:
        new.write("Tweets:\r\n")
        for tweet in tweets:
            new.write(tweet.getText() + "\r\n")
    print "writetweets - open OUTPUT.txt for the tweet divs"

def cleanup():
    print "cleanup - nothing here for now"

def tests():
    print "tests - nothing here for now"

def demo():
    writetweets()
    cleanup()
    tests()

if __name__ == '__main__':
    demo()

I get: 我得到:

In [29]: tweet.getText() 在[29]中:tweet.getText()

Out[29]: '*****Bank Of America: Growth Is Back – Bank of America Corporation***** (NYSE:BAC) https://good-stockinvest.com/2017/11/29/bank-of-america-growth-is-back-bank-of-america-corporation-nysebac/ …' Out [29]:'*****美国银行:增长又回来了-美国银行公司*****(NYSE:BAC) https://good-stockinvest.com/2017/11/29/bank美国增长是美国公司nysebac的后备银行/ ...”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM