简体   繁体   English

使用BeautifulSoup抓取网站后,如何分割单词和数字?

[英]how i can split word and number after scraping website with BeautifulSoup?

It's look difficult to me to scrap data from website and that data is inside a table. 对于我来说,从网站上抓取数据很困难,而且数据位于表格内。 I use BeautifulSoup and urllib from Python and when i run the program, it's look like this IndexAceh5.82Bali6.23Banten5.85Bengkulu4.81DKI6. 我使用Python中的BeautifulSoup和urllib,当我运行该程序时,它看起来像这个IndexAceh5.82Bali6.23Banten5.85Bengkulu4.81DKI6. . How i can remove Index , split word like Aceh and number 5.82 into something like this 我如何删除Index ,将Aceh单词和5.82拆分成这样的内容

prov = ['Aceh', 'Bali']

number = [5.82, 6.23]

and this is my code and website link : 这是我的代码和网站链接:

import urllib2
from bs4 import BeautifulSoup
quote_page = "MY LINK"
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, "html.parser")
pemerintah = soup.find("table", attrs={"cellspacing": "0"}); #cellspacing="0"
name = pemerintah.text.strip()
print name

I found same case in here , but when i try, it not working because on my case i have . 我在这里找到了同样的情况,但是当我尝试时,它不起作用,因为就我而言,我有. and if i use ade12.3 for example it will give me result ade, 12 , not ade, 12.3 如果我使用ade12.3例如它将给我结果ade, 12而不是ade, 12.3

Use the th & td tags to search. 使用thtd标签进行搜索。

Ex: 例如:

import urllib2
from bs4 import BeautifulSoup
quote_page = "http://www.kemitraan.or.id/igi/index.php/index.php?option=com_content&view=article&id=235"
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, "html.parser")
pemerintah = soup.find("table", attrs={"cellspacing": "0"}); #cellspacing="0"
for i in pemerintah.find_all("tr"):
    if i.find("th"):
        print i.th.text, " = ", i.td.text

Output: 输出:

Aceh  =  5.82
Bali  =  6.23
Banten  =  5.85
Bengkulu  =  4.81
....

There are easier ways to get the values you want with BS4. 有更简单的方法来获取所需的BS4值。 But if you want to work with strings, you can use re. 但是,如果要使用字符串,则可以使用re。

import re

y = 'IndexAceh5.82Bali6.23Banten5.85Bengkulu4.81'
k = re.split('(\w+)(\d.?\.\d.?)', y.replace('Index',''))
k = [i for i in k if i] #removes ‘’
prov = [item for i,item in enumerate(k) if i%2==0]
num  = [item for i,item in enumerate(k) if i%2!=0]

del y,k,i,item #cleaning

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM