简体   繁体   English

使用 Python 将单词列表拆分为多个网页

[英]Scraping a list of words split into multiple webpages with Python

I'd like to make a program that can calculate which word or words would get the player the most points at scrabble.我想制作一个程序,可以计算哪些单词或哪些单词会让玩家在拼字游戏中获得最高分。 However, for this, I need a list of words that are accepted.但是,为此,我需要一个被接受的单词列表。 I have found the sowpods list for English, but I'd like to make the program work also in French, which is my native language.我找到了英语的 sowpods 列表,但我想让这个程序也能用法语工作,这是我的母语。

I have found this website that provides such a list, but it's split into 918 webpages, which would make it rather long to copy-paste everything...我发现这个网站提供了这样一个列表,但它被分成 918 个网页,这会使复制粘贴所有内容变得相当长......

I had tried to use a Python library (I don't remember which one) for web scraping to get the words but as I didn't really know how to use it, it seemed very hard.我曾尝试使用 Python 库(我不记得是哪一个)来进行 web 抓取以获取单词,但由于我真的不知道如何使用它,它似乎很难。 I got the whole text of the website and I could then go character-by-character to select only the list of words, but as the number of characters on each page would be different, this couldn't be automated very easily.我得到了网站的全部文本,然后我可以逐个字符到 go 到 select 只是单词列表,但是由于每个页面上的字符数会有所不同,所以这不能很容易地自动化。

I've thought about using regex to select only words in capital letters (as they are on the website) but if there are other words or characters in capital letters such as titles on the website, then my list of words won't be correct.我考虑过使用正则表达式来 select 仅大写字母的单词(因为它们在网站上)但是如果还有其他大写字母的单词或字符,例如网站上的标题,那么我的单词列表将不正确.

How could I get all the words without having to change my code for each page?我怎样才能得到所有的单词而不必更改我的每一页的代码?

It looks like the website has a consistent structure across the pages.看起来该网站在页面之间具有一致的结构。 The words are stored in a span tag with a class named mot .这些单词存储在一个span标签中,其中一个名为mot的 class 。 The library you mentioned could be BeautifulSoup, which makes the automation really easy.您提到的库可能是 BeautifulSoup,这使得自动化非常容易。 You just need to request each page, select the tag mot and extract the innerHtml.您只需要请求每个页面,select 标签mot并提取 innerHtml。 Splitting the contents by the space character ( " " ) will give you an array with all the words you need from that page in particular.用空格字符 ( " " ) 拆分内容将为您提供一个数组,其中包含您需要从该页面中获取的所有单词。

Let me know if this helps you and if you need more help with the coding part.让我知道这是否对您有帮助,以及您在编码部分是否需要更多帮助。

Edit: I have included some code below.编辑:我在下面包含了一些代码。

import requests
from bs4 import BeautifulSoup

def getWordsFromPage(pageNumber):
    if pageNumber == 1:
        # page 1
        url = 'https://www.listesdemots.net/touslesmots.htm'
    else:
        # from page 2 to 918
        url = 'https://www.listesdemots.net/touslesmotspage{pageNumber}.htm'
        url = url.format(pageNumber=pageNumber)

    response = requests.get(url)
    html_soup = BeautifulSoup(response.text, 'html.parser')
    span = html_soup.find("span", "mot")

    words = span.text.split(" ")

    return words

print("Page 1")
print(getWordsFromPage(1))
print("Page 24")
print(getWordsFromPage(24))
print("Page 918")
print(getWordsFromPage(918))

All pages are equal and the words are in a span tag.所有页面都是相同的,并且单词在 span 标签中。 Using requests and beautifulsoup makes it quite easy:使用requestsbeautifulsoup很容易:

import time
import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 YaBrowser/19.10.3.281 Yowser/2.5 Safari/537.36'
}

pages = ['https://www.listesdemots.net/touslesmots.htm']  # adds the first page
for p in range(2, 919):  # adding the remaining pages
    pages.append('https://www.listesdemots.net/touslesmotspage%s.htm' % p)

for p in pages:
    data = None
    try:
        #  download each page
        r = requests.get(p, headers=headers)
        data = r.text
    except:
        #  we dont handle errors
        print("Failed to download page: %s" % p)

    if data:
        #  instanciate the html parser and search for span.mot
        soup = BeautifulSoup(data, 'html.parser')        
        wordTag = soup.find("span", {"class": "mot"})

        words = 0
        if wordTag:
            #  if there is a span.mot tag found, split its contents by one blanc
            words = wordTag.contents[0].split(' ')

        print ('%s got %s words' % (p, len(words)))
    time.sleep(5)

Output: Output:

https://www.listesdemots.net/touslesmots.htm got 468 words
https://www.listesdemots.net/touslesmotspage2.htm got 496 words
https://www.listesdemots.net/touslesmotspage3.htm got 484 words
https://www.listesdemots.net/touslesmotspage4.htm got 468 words
....

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM