简体   繁体   中英

Beautiful Soup code fails to extract string content from <h> Tag

I am learning Beautiful Soup and Python and in this context I am doing the "Baby names" exercise of the Google Tutorial on Regex using the set of html files that contains popular baby names for different years (eg baby1990.html etc). You can find this dataset if you are interested here: https://developers.google.com/edu/python/exercises/baby-names

The html files contain a particular table which store the popular baby names and whose html code is the following:

<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="formatting">
<tr valign="top"><td width="25%" class="greycell">
<a href="../OACT/babynames/background.html">Background information</a>
<p><br />
&nbsp; Select another <label for="yob">year of birth</label>?<br />      
<form method="post" action="/cgi-bin/popularnames.cgi">
&nbsp; <input type="text" name="year" id="yob" size="4" value="1990">
<input type="hidden" name="top" value="1000">
<input type="hidden" name="number" value="">
&nbsp; <input type="submit" value="   Go  "></form>
</td><td>
<h3 align="center">Popularity in 1990</h3>

I want to loop through all the html files in the folder and extract the information of the Year stored between Tags in the end (in some files it is Tags).

I have written the following code:

    Years = [] # Initializes an empty list where the Years will be
stored
    f = files(path) # Calls the function files() defined earlier
    pattern = re.compile(r'.+(\d\d\d\d)')  # Establishes a regex patter to extract the Year string from each file
    for file in f:  # loops through the files
        try:
            with open(file,"r") as f: soup = bs(f, 'lxml')  # opens and reads each file in turn from the files list
            h = soup.find_all(re.compile("(h2)|(h3)"))  # Extracts and stores <h3> and <h2> Tags to h ResultSet object
            string = h[0].get_text()  #Passes the first element of the ResultSet to a string variable (only one <h> Tag exists)
            Years.append(pattern.match(string).group(1))   # Extracts the first match (i.e. Year) and appends it to the list
        except:
            Years.append('NaN')
            continue
    Years  # Returns the year

This code returns instead of a list as string 'NaN'

The function files() called by the code is the following:

def files(path):
# This function returns a list with the full paths (including the file name) of all the files that are stored in a directory
# and whose names match a regex pattern.  The functions has as an argument the path of the target directory.

files = [f for f in os.listdir(path) 
    if re.match(r'.+\.html', f)]  # extracts all the filenames matching the pattern and stores them to a list
files = [path + s for s in files]  # Concatenates the path string to the name of the files
return files

Can you understand what is wrong with the code?

Your advice will be appreciated.

most of your code works, I just removed the function to find html files and it seems to be working for me. Change the "path\\to\\file" to your folder and try this .

from bs4 import BeautifulSoup as bs
import glob, os
import re
pattern = re.compile(r'.+(\d\d\d\d)')
os.chdir("path\to\file")
for htmlfile in glob.glob("*.html"):
    print "path\to\file"+htmlfile
    with open(htmlfile,"r") as f: 
        soup = bs(f,'lxml')
        table_headers = []
        header=soup.find_all(re.compile("(h2)|(h3)")) 
        string = header[0].get_text() 
        print pattern.match(string).group(1)

output

baby1990.html
1990
baby1992.html
1992
baby1994.html
1994
baby1996.html
1996
baby1998.html
1998
baby2000.html
2000
baby2002.html
2002
baby2004.html
2004
baby2006.html
2006
baby2008.html
2008

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM