简体   繁体   English

从URL查找字符串中所有匹配的单词

[英]find all matching words in string from URL

I'm trying to write my first python script. 我正在尝试编写我的第一个python脚本。 I want to write a program that will get information out of a website. 我想编写一个程序,该程序将从网站上获取信息。

I managed to open the website, read all the data and transform the data from bytes to a string. 我设法打开了网站,读取了所有数据,并将数据从字节转换为字符串。

import urllib.request

response = urllib.request.urlopen('http://www.imdb.com/title/tt0413573/episodes?season=10')
website = response.read()
response.close()

html = website.decode("utf-8")

print(type(html))
print(html)

The string is massive, I don't know if I show transform it to a list and iterate over the list or just keep it as a string. 字符串很大,我不知道是否显示将其转换为列表并遍历列表,还是仅将其保留为字符串。

What I would like to do if find all the keyword airdate and them get the next line in the string. 如果找到所有关键字airdate并获得字符串的下一行,我想做什么。

When I scroll through the string this is the relevant bits: 当我滚动字符串时,这是相关的位:

<meta itemprop="episodeNumber" content="10"/>
<div class="airdate">
  Nov. 21, 2013
</div>

This happens lots of times inside the string. 这在字符串中发生了很多次。 What I'm trying to do is to loop through the string and return this result: 我想做的是遍历字符串并返回此结果:

"episodeNumber" = some number
"airdate" = what ever date

For overtime this happens in the string. 对于加班,这发生在字符串中。 I tried: 我试过了:

keywords = ["airdate","episodeNumber"]
for i in keywords:
    if i in html:
        print (something)

I hope I'm explaining myself in the right way. 我希望我能以正确的方式解释自己。 I will edit the question if needed. 如果需要,我将编辑问题。

When dealing with structured texts like HTML/XML it is a good idea to use existing tools that leverage this structure. 在处理HTML / XML之类的结构化文本时,最好使用利用这种结构的现有工具。 Instead of using regex or searching by hand, this gives a much more reliable and readable solution. 代替使用正则表达式或手工搜索,这提供了更加可靠和可读的解决方案。 In this case, I suggest to install lxml to parse the HTML. 在这种情况下,我建议安装lxml来解析HTML。

Applying this principle to your problem, try the following (I assume that you use Python 3 because you imported urllib.request): 将此原理应用于您的问题,请尝试以下操作(我假设您使用的是Python 3,因为您导入了urllib.request):

import lxml.html as html
import urllib.request

resp = urllib.request.urlopen('http://www.imdb.com/title/tt0413573/episodes?season=10')

fragment = html.fromstring(resp.read())

for info in fragment.find_class('info'):
    print('"episodeNumber" = ', info.find('meta').attrib['content'])
    print('"airdate" =', info.find_class('airdate')[0].text_content().strip())

To make sure that the episode number and airdate are corresponding, I search for the surrounding element (a div with class 'info') and then extract the data you want. 为了确保情节编号和播出日期相对应,我搜索了周围的元素(类为“ info”的div),然后提取所需的数据。

I'm sure the code can be made prettier with a fancier selection of elements, but this should get you started. 我敢肯定,可以通过选择一些更好的元素使代码更漂亮,但这应该可以帮助您入门。


[Added more information on the solution concerning the structure in the HTML.] [添加了有关与HTML中的结构有关的解决方案的更多信息。]

The string containing the data of one episode looks as follows: 包含一个情节的数据的字符串如下所示:

<div class="info" itemprop="episodes" itemscope itemtype="...">
  <meta itemprop="episodeNumber" content="1"/>
  <div class="airdate">Sep. 26, 2013</div> <!-- already stripped whitespace -->
  <strong>
    <a href="/title/tt2911802/" title="Seal Our Fate" itemprop="name">...</a>
  </strong>
  <div class="item_description" itemprop="description">...</div>
  <div class="popoverContainer"></div>
  <div class="popoverContainer"></div>
</div>

You first select the div containing all data of one episode by its class 'info'. 首先,按其“ info”类选择包含一个情节的所有数据的div。 The first information you want is in a child of the div.info element, the meta element, stored in its property 'content'. 您想要的第一个信息位于div.info元素(即meta元素)的子元素中,该元素存储在其属性“ content”中。

Next, you want the information stored in the div.airdate element, this time it is stored inside the element as text. 接下来,您希望将信息存储在div.airdate元素中,这一次将其作为文本存储在元素中。 To get rid of the whitespace around it, I then used the strip() method. 为了摆脱周围的空白,我使用了strip()方法。

Would that work? 那行得通吗?

lines = website.splitlines()
lines.append('')
for index, line in enumerate(lines):
    for keyword in ["airdate","episodeNumber"]:
        if keyword in line:
            print(lines[index + 1])

It prints the next line if the keyword is found in the line. 如果在该行中找到了关键字,它将打印下一行。

If that is your first Python script, it is really impressive to see you have made so far. 如果这是您的第一个Python脚本,那么到目前为止您所取得的成就确实令人印象深刻。

You will use some legit parser to help you with your parsing. 您将使用一些合法的解析器来帮助您进行解析。

Check out BeautifulSoup4 看看BeautifulSoup4

# intellectual property belongs to imdb    
import urllib2
from bs4 import BeautifulSoup

# get the SOUP: tree structure out of the HTML page
soup = BeautifulSoup(urllib2.urlopen("http://www.imdb.com/title/tt0413573/episodes?season=10"))

result = {}
for div in soup.find_all("div", {"class":"airdate"}):
    # get the date and number and store in a dictionary
    date = div.text.encode('utf-8').strip()
    number = div.find_previous_sibling()['content']
    result[number] = date

print result

output 产量

{'10': 'Nov. 21, 2013', '1': 'Sep. 26, 2013', '3': 'Oct. 3, 2013', '2': 'Sep. 26, 2013', '5': 'Oct. 17, 2013', '4': 'Oct. 10, 2013', '7': 'Oct. 31, 2013', '6': 'Oct. 24, 2013', '9': 'Nov. 14, 2013', '8': 'Nov. 7, 2013'}

Let me know if I understood and answered your question correctly. 让我知道我是否正确理解并正确回答了您的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM