简体   繁体   中英

I need to remove excess characters from string output of BeautifulSoup

I need to remove the [u' prefix and '] suffix that surrounds the data that's important to me. This will get put into a database and from what I see it takes those additional characters. How can I remove them? I've tried .replace on the variable but it returns an error.

import urllib
import mechanize
from bs4 import BeautifulSoup
import requests
import re
import MySQLdb
import time

db = MySQLdb.connect(
  host=" ",
  user=" ",
  passwd=" ",
  db=" ")

inc = 0

# while inc != 3289:
c = db.cursor()
c.execute("""SELECT `symbol` FROM `stocks` LIMIT %s,1""", (inc,))
result = c.fetchall()
result = str(result)

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
br = mechanize.Browser()
br.set_handle_robots(False)
br.addHeaders = [('User-agent',user_agent)]

term = result.replace('((','').replace(',)','').replace("'",'')
url = "http://www.marketwatch.com/investing/stock/"+term
soup = BeautifulSoup(requests.get(url).text)
search = soup.find('p', attrs = {'class':'data bgLast'})
cur = search.findAll(text = True)
search2 = soup.find('span', attrs = {'class':'bgChange'})
diff = search2.findAll(text = True)
print term
print cur
print diff

c.execute("""UPDATE stocks SET cur = %s WHERE symbol = %s""", (cur,term))
c.execute("""UPDATE stocks SET diff = %s WHERE symbol = %s""", (diff,term))
db.commit()

No thanks to you @jonrsharpe, I found the answer. In the original code the .findAll was retrieving a result set. All I had to do was change it to a str which allowed the strip function to be passed to it. The revised code is below. :

import urllib
import mechanize
from bs4 import BeautifulSoup
import requests
import re
import MySQLdb
import time

db = MySQLdb.connect(
  host=" ",
  user=" ",
  passwd=" ",
  db=" ")

inc = 0

# while inc != 3289:
c = db.cursor()
c.execute("""SELECT `symbol` FROM `stocks` LIMIT %s,1""", (inc,))
result = c.fetchall()
result = str(result)

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
br = mechanize.Browser()
br.set_handle_robots(False)
br.addHeaders = [('User-agent',user_agent)]

term = result.replace('((','').replace(',)','').replace("'",'')
url = "http://www.marketwatch.com/investing/stock/"+term
soup = BeautifulSoup(requests.get(url).text)
search = soup.find('p', attrs = {'class':'data bgLast'})
cur = str(search.findAll(text = True))
search2 = soup.find('span', attrs = {'class':'bgChange'})
diff = str(search2.findAll(text = True))
cur = cur.strip("'[]u")
diff = diff.strip("'[]u")
print term
print cur
print diff

c.execute("""UPDATE stocks SET cur = %s WHERE symbol = %s""", (cur,term))
c.execute("""UPDATE stocks SET diff = %s WHERE symbol = %s""", (diff,term))
db.commit()
result = str(result)
...
cur = str(search.findAll(text = True))

Stop doing this! There are datatypes other than strings!

result is a list of lists; search.findAll gives you a list of text nodes. You can get to, for example, the symbol value of the first row by saying result[0][0] ; you can get the text of an element by saying just search.getText() .

Serialising structured objects like lists into a flat string and then trying to pick the bits out of it is not a sensible approach.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM