简体   繁体   English

Python正则表达式忽略换行

[英]Python regex ignore new line

I have web page look like this 我的网页看起来像这样

<td valign="top">

    <table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
        <tr>
            <td colspan="2">
                <div align="center">
                <a href="/title/name.php" target="_blank">
                <img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
                </a>
                </div>
            </td>
        </tr>
        <tr>
            <td colspan="2"><h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1></td>
        </tr>
        <tr>
            <td><span class="style10">Cat1 :</span></td>
            <td>1st name</td>
        </tr>
        <tr>
            <td width="32%"><span class="style10">Cat2 :</span></td>
            <td width="68%"><b><i><a href="./secondname.php" target="_blank">secondname</a></i></b></td>
        </tr>
        <tr>
            <td><span class="style10">cat4 :</span></td>
            <td>Bla bla</td>
        </tr>
        <tr>
            <td><span class="style10">Cat3 :</span></td>
            <td>thirdName2</td>
        </tr>
    </table>

</td>
<td valign="top">

    <table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
        <tr>
            <td colspan="2">
                <div align="center">
                <a href="/title/name.php" target="_blank">
                <img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
                </a>
                </div>
            </td>
        </tr>
        <tr>
            <td colspan="2"><h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1></td>
        </tr>
        <tr>
            <td><span class="style10">Cat1 :</span></td>
            <td>1st name</td>
        </tr>
        <tr>
            <td width="32%"><span class="style10">Cat2 :</span></td>
            <td width="68%"><b><i><a href="./secondname.php" target="_blank">secondname</a></i></b></td>
        </tr>
        <tr>
            <td><span class="style10">cat4 :</span></td>
            <td>Bla bla</td>
        </tr>
        <tr>
            <td><span class="style10">Cat3 :</span></td>
            <td>thirdName2</td>
        </tr>
    </table>

</td>

I would like to get certain values from this site using python regex. 我想使用python regex从此站点获取某些值。 After <div align="center"> I like to get href value: "/title/name.php" and img src: "./movie/image.jpg" and Title - secondname from <h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1> <div align="center">我想获取href值:“ /title/name.php”和img src:“ ./movie/image.jpg”和Title-来自<h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1>第二名<h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1>

i have tried this: regex = 'class="main_tb3"*\\n<a href="(.+?)" target="_blank">\\n<img src="(.+?)"' 我已经尝试过: regex = 'class="main_tb3"*\\n<a href="(.+?)" target="_blank">\\n<img src="(.+?)"'

please help me 请帮我

you can use below regex 您可以在正则表达式下面使用

For href value: <a href="(.*?)" 对于href值: <a href="(.*?)"

For Image src: <img src="(.*?)" 对于图片src: <img src="(.*?)"

For Title: titleid=12">(.*?)< 对于标题: titleid=12">(.*?)<

You will find it a lot simpler to install something like BeautifulSoup to do this: 您会发现安装类似BeautifulSoup这样的东西要简单得多:

from bs4 import BeautifulSoup

html = """
<td valign="top">

    <table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
        <tr>
            <td colspan="2">
                <div align="center">
                <a href="/title/name.php" target="_blank">
                <img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
                </a>
                </div>
            </td>
        </tr>
        <tr>
            <td colspan="2"><h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1></td>
        </tr>
        <tr>
            <td><span class="style10">Cat1 :</span></td>
            <td>1st name</td>
        </tr>
        <tr>
            <td width="32%"><span class="style10">Cat2 :</span></td>
            <td width="68%"><b><i><a href="./secondname.php" target="_blank">secondname</a></i></b></td>
        </tr>
        <tr>
            <td><span class="style10">cat4 :</span></td>
            <td>Bla bla</td>
        </tr>
        <tr>
            <td><span class="style10">Cat3 :</span></td>
            <td>thirdName2</td>
        </tr>
    </table>

</td>
<td valign="top">

    <table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
        <tr>
            <td colspan="2">
                <div align="center">
                <a href="/title/name.php" target="_blank">
                <img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
                </a>
                </div>
            </td>
        </tr>
        <tr>
            <td colspan="2"><h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1></td>
        </tr>
        <tr>
            <td><span class="style10">Cat1 :</span></td>
            <td>1st name</td>
        </tr>
        <tr>
            <td width="32%"><span class="style10">Cat2 :</span></td>
            <td width="68%"><b><i><a href="./secondname.php" target="_blank">secondname</a></i></b></td>
        </tr>
        <tr>
            <td><span class="style10">cat4 :</span></td>
            <td>Bla bla</td>
        </tr>
        <tr>
            <td><span class="style10">Cat3 :</span></td>
            <td>thirdName2</td>
        </tr>
    </table>

</td>"""

soup = BeautifulSoup(html)

for table in soup.find_all("table", class_="main_tb3"):
    print table.find('a').get('href')
    print table.find('h1').text

For the HTML you have given, this will print the following: 对于您提供的HTML,将打印以下内容:

/title/name.php
Title - secondname
/title/name.php
Title - secondname

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM