简体   繁体   English

如何使用BeautifulSoup从html脚本中提取元素

[英]How to extract elements from a script of a html with BeautifulSoup

I'm new in Python programming and I'm using BeautifulSoup to do some web-scraping from Chile's county electoral department. 我是Python编程的新手,我正在使用BeautifulSoup从智利的县选举部门进行一些网络爬虫。 My problem is this: I need to extract specific strings out from a script. 我的问题是:我需要从脚本中提取特定的字符串。 After some cleaning, I obtain something like this: 经过一番清洁,我得到了这样的东西:

<script type="text/javascript">
    document.writeln("<p align='left' class='cleleccion2008'>");
    document.writeln("&nbsp;&nbsp;&nbsp;&nbsp;<a href='geografico.htm'>&laquo;&nbsp;&nbsp;VOLVER MEN&Uacute;<\/a><br>");
    document.writeln("<\/p>");
    document.writeln("<div class='mapTitle'>REGI&Oacute;N<\/div>");
    document.writeln("<p align='left' class='cleleccion2008'>");
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'101'+")' >Regi&oacute;n I<\/a><br>");
    document.writeln("<\/p>");
    document.writeln("<br>");
    document.writeln("<div class='mapTitle'>COMUNAS<\/div>");
    document.writeln("<p align='left' class='cleleccion2008'>"); 
    if ( parent.DIR_ANO >= "2004"){
        document.writeln("  &nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2307'+")' >Alto Hospicio<\/a> <br>");
    }
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2101'+")' >Arica<\/a><br>");
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2102'+")' >Camarones<\/a><br>");
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2303'+")' >Cami&ntilde;a<\/a><br>");
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2304'+")' >Colchane<\/a><br>");
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2202'+")' >General Lagos<\/a><br>");
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2302'+")' >Huara<\/a><br>");  
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2301'+")' >Iquique<\/a><br>");  
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2305'+")' >Pica<\/a><br>");
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2306'+")' >Pozo Almonte<\/a><br>");   
    document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2201'+")' >Putre<\/a><br>");                
    document.writeln("<\/p>"); 
    document.close();                                                               
}
</script>

From this script, I want to extract, from the last 12 lines, the county name and code to create something like: 从此脚本中,我想从最后12行中提取县名和代码,以创建类似以下内容的代码:

Code, County 2101, Arica 2102, Camarones ... 2201, Putre Code,County 2101,Arica 2102,Camarones ... 2201,Putre

Any help would be really appreciated. 任何帮助将非常感激。 Thanks for all your responses/reads. 感谢您的所有答复/阅读。

There is no specific js parser in BeautifulSoup , but it can be handled easily by using regex . BeautifulSoup没有特定的js parser ,但是可以使用regex轻松处理它。

import re

text = '''
<script type="text/javascript">
    document.writeln("<p align='left' class='cleleccion2008'>");
    document.writeln("&nbsp;&nbsp;&nbsp;&nbsp;<a 
href='geografico.htm'>&laquo;&nbsp;&nbsp;VOLVER MEN&Uacute;<\/a><br>");
document.writeln("<\/p>");
document.writeln("<div class='mapTitle'>REGI&Oacute;N<\/div>");
document.writeln("<p align='left' class='cleleccion2008'>");
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'101'+")' >Regi&oacute;n I<\/a><br>");
document.writeln("<\/p>");
document.writeln("<br>");
document.writeln("<div class='mapTitle'>COMUNAS<\/div>");
document.writeln("<p align='left' class='cleleccion2008'>"); 
if ( parent.DIR_ANO >= "2004"){
    document.writeln("  &nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2307'+")' >Alto Hospicio<\/a> <br>");
}
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2101'+")' >Arica<\/a><br>");
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2102'+")' >Camarones<\/a><br>");
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2303'+")' >Cami&ntilde;a<\/a><br>");
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2304'+")' >Colchane<\/a><br>");
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2202'+")' >General Lagos<\/a><br>");
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2302'+")' >Huara<\/a><br>");  
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2301'+")' >Iquique<\/a><br>");  
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2305'+")' >Pica<\/a><br>");
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2306'+")' >Pozo Almonte<\/a><br>");   
document.writeln("  &nbsp;&nbsp;&bull;&nbsp;<a href='javascript:Consulta("+'2201'+")' >Putre<\/a><br>");                
document.writeln("<\/p>"); 
document.close();                                                               
}
</script>
'''

result_num = []
result_county = []
result = []

for i in re.findall('"[+]\'(.*?)\'[+]"', text):
    result_num.append(i)
for j in re.findall('\'[ ]>(.*?)<', text):
    if j != '':
        result_county.append(j)

result_county = result_county[2:]
result_num = result_num[2:]

for count in range(len(result_county)):
    result.append(result_county[count] + result_num[count])

print(result)

Output : 输出

['Arica2101', 'Camarones2102', 'Cami&ntilde;a2303', 'Colchane2304', 'General Lagos2202', 'Huara2302', 'Iquique2301', 'Pica2305', 'Pozo Almonte2306', 'Putre2201']

Jihan is partially right, in that there are no javascript parsers available in BeautifulSoup explicitly. Jihan是部分正确的,因为BeautifulSoup中没有明确的JavaScript解析器。 You will still likely need bs4 to perform the initial parsing. 您仍然可能需要bs4来执行初始解析。 Regular expressions can help get you through the string parsing, but I would use a compiled regular expression rather than performing a re.findall() . 正则表达式可以帮助您完成字符串解析,但是我将使用编译后的正则表达式而不是执行re.findall() Using re.findall() is likely to result in a lot of false positives and cleanup for you. 使用re.findall()可能会导致许多误报和清理。 If you perform the regex line-by-line, you can be much more confident you are grabbing the right data and perform validation as you iterate. 如果逐行执行正则表达式,则可以更有把握地获取正确的数据并在进行迭代时执行验证。 It also ultimately leads to cleaner code and more manageable output. 最终还可以使代码更简洁,输出更易于管理。

Instead, you can explicitly pull the <script> tag out of the page contents and use str.splitlines() method on the script tag you want. 相反,您可以显式地从页面内容中拉出<script>标记,并对所需的脚本标记使用str.splitlines()方法。 This will split the entire tag into a list of strings. 这会将整个标签拆分为字符串列表。 You might want to split on the ; 您可能想对; character that signifies a javascript line termination so that it will work even in cases where you are dealing with "optimized" (obfuscated) javascript code that is obnoxiously smashed together. 表示javascript行终止的字符,以便即使在您处理令人讨厌地粉碎在一起的“优化”(模糊)javascript代码的情况下也可以使用。

At that point, you can use a compiled (or simple re.search() ) regex on each line. 那时,您可以在每行上使用已编译(或简单的re.search() )正则表达式。 That way you are certain you are getting a line-by-line match. 这样,您就可以确定逐行匹配。 Here is the code. 这是代码。

import argparse
import bs4
import re
import requests


def parse_county_codes(soup_object):
    for tag in soup_object:
        tag = str(tag)
        lines = tag.splitlines()
        code_regex = re.compile('"[+]\'(.*?)\'[+]"')
        county_regex = re.compile('\'[ ]>(.*?)<')

        for line in lines:
            county = county_regex.search(line)
            code = code_regex.search(line)
            if county and code:
                print(county.group(1), ':', code.group(1))

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('-i', '--input-file', dest='in_file', help='Input html')
    parser.add_argument('-u', '--url', dest='url', help='Some url\'s content you want to parse')
    args = parser.parse_args()

    if args.in_file:
        with open(args.in_file) as f:
            html_string = f.read()
            soup = bs4.BeautifulSoup(html_string, 'html.parser')
    elif args.url:
        try:
            # Remember to handle any possible url handling exceptions
            response = requests.get(args.url)
        except Exception as e:
            print("The following exception occurred while requesting the url\n{0}".format(args.url))
            print(e)
            return

        soup = bs4.BeautifulSoup(response.content, 'html.parser')
    else:
        print("Input missing. Please provide -i or -u")
        return

    script_tags = soup.find_all('script')
    parse_county_codes(script_tags)

if __name__ == '__main__':
    main()

The output of this code is as follows: 此代码的输出如下:

Regi&oacute;n I : 101
Alto Hospicio : 2307
Arica : 2101
Camarones : 2102
Cami&ntilde;a : 2303
Colchane : 2304
General Lagos : 2202
Huara : 2302
Iquique : 2301
Pica : 2305
Pozo Almonte : 2306
Putre : 2201

Note there are some characters and escape-sequences for special characters that look out of place in the strings, but the regular expressions in their current form provided by Jihan are valid. 请注意,有些字符和特殊字符的转义序列在字符串中看起来不合适,但是Jihan提供的当前形式的正则表达式是有效的。 If you want to clean up the output, you'd know best how to do that, so I'll leave that up to you. 如果您想清理输出,那么您将最好地知道如何做到这一点,所以我将由您自己决定。 Be aware your mileage may vary when using regular expressions, and depending on the other web page contents, you can run into other problems. 请注意,使用正则表达式时,里程可能会有所不同,并且根据其他网页的内容,您可能会遇到其他问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM