繁体   English   中英

"创建一个 python 网络爬虫来获取 google play store 应用程序的元数据"

[英]Creating a python web scraper to get metadata for google play store apps

我对 Python 非常陌生,并且对学习更多内容非常感兴趣。 我目前正在做的一门课程给了我一个任务……

我建议你使用 BeautifulSoup。 首先,使用此代码

from bs4 import BeautifulSoup
r = requests.get("url");
# optionally check status code here
soup = BeautifulSoup(r.text)

使用汤对象,您可以使用选择器从页面中提取元素

在这里阅读更多: https ://www.crummy.com/software/BeautifulSoup/bs4/doc/

为了解析图标、标题、描述,尤其是屏幕截图,您必须使用正则表达式从内联 JSON 中解析它。

它比使用 CSS 选择器解析更安全,在这种情况下,您必须渲染页面以抓取它,因此您可以通过浏览器自动化来实现它,但它会更慢。


在线 IDE中使用requestsbeautifulsouplxml和正则表达式的代码和完整示例:

from bs4 import BeautifulSoup
import requests, lxml, re, json


def scrape_google_play_app(appname: str) -> list:

     headers = {
        "user-agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"
    }

    params = {
        "id": appname,
        "gl": "us"      # country
        # other search parameters
    }

    html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=10)
    soup = BeautifulSoup(html.text, "lxml")

    # where all app data will be stored
    app_data = []

    # <script> position is not changing that's why [12] index being selected. Other <script> tags position are changing.
    # [12] index is a basic app information
    # https://regex101.com/r/DrK0ih/1
    basic_app_info = json.loads(re.findall(r"<script nonce=\".*\" type=\"application/ld\+json\">(.*?)</script>",
                                            str(soup.select("script")[12]), re.DOTALL)[0])

    app_name = basic_app_info["name"]
    app_type = basic_app_info["@type"]
    app_url = basic_app_info["url"]
    app_description = basic_app_info["description"].replace("\n", "")  # replace new line character to nothing
    app_category = basic_app_info["applicationCategory"]
    app_operating_system = basic_app_info["operatingSystem"]
    app_main_thumbnail = basic_app_info["image"]

    app_content_rating = basic_app_info["contentRating"]
    app_rating = round(float(basic_app_info["aggregateRating"]["ratingValue"]), 1)  # 4.287856 -> 4.3
    app_reviews = basic_app_info["aggregateRating"]["ratingCount"]

    app_author = basic_app_info["author"]["name"]
    app_author_url = basic_app_info["author"]["url"]

    # https://regex101.com/r/VX8E7U/1
    app_images_data = re.findall(r",\[\d{3,4},\d{3,4}\],.*?(https.*?)\"", str(soup.select("script")))
    # delete duplicates from app_images_data
    app_images = [item for item in app_images_data if app_images_data.count(item) == 1]

    app_data.append({
        "app_name": app_name,
        "app_type": app_type,
        "app_url": app_url,
        "app_main_thumbnail": app_main_thumbnail,
        "app_description": app_description,
        "app_content_rating": app_content_rating,
        "app_category": app_category,
        "app_operating_system": app_operating_system,
        "app_rating": app_rating,
        "app_reviews": app_reviews,
        "app_author": app_author,
        "app_author_url": app_author_url,
        "app_screenshots": app_images
    })

    return app_data

print(json.dumps(scrape_google_play_app(appname="com.nintendo.zara"), indent=2))

定义一个函数:

def scrape_google_play_app(appname: str) -> list:
    # whatever
  • appname应该是一个string
  • 函数的return值将是-> list

创建标题和搜索查询参数:

headers = {
    "user-agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"
}

params = {
    "id": appname,  # app name
    "gl": "US"      # country
}

传递标头、参数、发出请求并创建一个BeautifulSoup对象,所有 HTML 处理都将在其中发生:

html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=10)
soup = BeautifulSoup(html.text, "lxml")
  • timeout将告诉requests在 10 秒后停止等待响应。
  • lxml是一个 XML/HTML 解析器。

创建一个临时list ,将临时存储所有应用程序数据,使用正则表达式匹配来自内联 JSON 的应用程序信息:

app_data = []

# https://regex101.com/r/DrK0ih/1
basic_app_info = json.loads(re.findall(r"<script nonce=\".*\" type=\"application/ld\+json\">(.*?)</script>",
                                        str(soup.select("script")[12]), re.DOTALL)[0])
  • json.load()将 JSON 字符串转换为 Python 字典。

从解析的 JSON 字符串中获取数据:

app_name = basic_app_info["name"]
app_type = basic_app_info["@type"]
app_url = basic_app_info["url"]
app_description = basic_app_info["description"].replace("\n", "")  # replace new line character to nothing
app_category = basic_app_info["applicationCategory"]
app_operating_system = basic_app_info["operatingSystem"]
app_main_thumbnail = basic_app_info["image"]

app_content_rating = basic_app_info["contentRating"]
app_rating = round(float(basic_app_info["aggregateRating"]["ratingValue"]), 1)  # 4.287856 -> 4.3
app_reviews = basic_app_info["aggregateRating"]["ratingCount"]

app_author = basic_app_info["author"]["name"]
app_author_url = basic_app_info["author"]["url"]

通过正则表达式匹配屏幕截图数据并过滤重复项:

# https://regex101.com/r/VX8E7U/1
app_images_data = re.findall(r",\[\d{3,4},\d{3,4}\],.*?(https.*?)\"", str(soup.select("script")))
# delete duplicates from app_images_data
app_images = [item for item in app_images_data if app_images_data.count(item) == 1]

将数据Append到临时listreturn

app_data.append({
    "app_name": app_name,
    "app_type": app_type,
    "app_url": app_url,
    "app_main_thumbnail": app_main_thumbnail,
    "app_description": app_description,
    "app_content_rating": app_content_rating,
    "app_category": app_category,
    "app_operating_system": app_operating_system,
    "app_rating": app_rating,
    "app_reviews": app_reviews,
    "app_author": app_author,
    "app_author_url": app_author_url,
    "app_screenshots": app_images
})

return app_data

打印数据:

print(json.dumps(scrape_google_play_app(appname="com.nintendo.zara"), indent=2))

完整输出:

[
  {
    "app_name": "Super Mario Run",
    "app_type": "SoftwareApplication",
    "app_url": "https://play.google.com/store/apps/details/Super_Mario_Run?id=com.nintendo.zara&hl=en_US&gl=US",
    "app_main_thumbnail": "https://play-lh.googleusercontent.com/5LIMaa7WTNy34bzdFhBETa2MRj7mFJZWb8gCn_uyxQkUvFx_uOFCeQjcK16c6WpBA3E",
    "app_description": "A new kind of Mario game that you can play with one hand.You control Mario by tapping as he constantly runs forward. You time your taps to pull off stylish jumps, midair spins, and wall jumps to gather coins and reach the goal!Super Mario Run can be downloaded for free and after you purchase the game, you will be able to play all the modes with no additional payment required. You can try out all four modes before purchase: World Tour, Toad Rally, Remix 10, and Kingdom Builder.\u25a0World TourRun and jump with style to rescue Princess Peach from Bowser\u2019s clutches! Travel through plains, caverns, ghost houses, airships, castles, and more.Clear the 24 exciting courses to rescue Princess Peach from Bowser, waiting in his castle at the end. There are many ways to enjoy the courses, such as collecting the 3 different types of colored coins or by competing for the highest score against your friends. You can try courses 1-1 to 1-4 for free.After rescuing Princess Peach, a nine-course special world, World Star, will appear.\u25a0Remix 10Some of the shortest Super Mario Run courses you'll ever play!This mode is Super Mario Run in bite-sized bursts! You'll play through 10 short courses one after the other, with the courses changing each time you play. Daisy is lost somewhere in Remix 10, so try to clear as many courses as you can to find her!\u25a0Toad RallyShow off Mario\u2019s stylish moves, compete against your friends, and challenge people from all over the world.In this challenge mode, the competition differs each time you play.Compete against the stylish moves of other players for the highest score as you gather coins and get cheered on by a crowd of Toads. Fill the gauge with stylish moves to enter Coin Rush Mode to get more coins. If you win the rally, the cheering Toads will come live in your kingdom, and your kingdom will grow. \u25a0Kingdom BuilderGather coins and Toads to build your very own kingdom.Combine different buildings and decorations to create your own unique kingdom. There are over 100 kinds of items in Kingdom Builder mode. If you get more Toads in Toad Rally, the number of buildings and decorations available will increase. With the help of the friendly Toads you can gradually build up your kingdom.\u25a0What You Can Do After Purchasing All Worlds\u30fb All courses in World Tour are playableWhy not try out the bigger challenges and thrills available in all courses?\u30fb Easier to get Rally TicketsIt's easier to get Rally Tickets that are needed to play Remix 10 and Toad Rally. You can collect them in Kingdom Builder through Bonus Game Houses and ? Blocks, by collecting colored coins in World Tour, and more.\u30fb More playable charactersIf you rescue Princess Peach by completing course 6-4 and build homes for Luigi, Yoshi, and Toadette in Kingdom Builder mode, you can get them to join your adventures as playable characters. They play differently than Mario, so why not put their special characteristics to good use in World Tour and Toad Rally?\u30fb More courses in Toad RallyThe types of courses available in Toad Rally will increase to seven different types of courses, expanding the fun! Along with the new additions, Purple and Yellow Toads may also come to cheer for you.\u30fb More buildings and decorations in Kingdom BuilderThe types of buildings available will increase, so you'll be able to make your kingdom even more lively. You can also place Rainbow Bridges to expand your kingdom.\u30fb Play Remix 10 without having to waitYou can play Remix 10 continuously, without having to wait between each game.*Internet connectivity required to play. Data charges may apply. May contain advertisements.",
    "app_content_rating": "Everyone",
    "app_category": "GAME_ACTION",
    "app_operating_system": "ANDROID",
    "app_rating": 4.0,
    "app_reviews": "1619972",
    "app_author": "Nintendo Co., Ltd.",
    "app_author_url": "https://supermariorun.com/",
    "app_screenshots": [
      "https://play-lh.googleusercontent.com/dcv6Z-pr3MsSvxYh_UiwvJem8fktDUsvvkPREnPaHYienbhT31bZ2nUqHqGpM1jdal8",
      "https://play-lh.googleusercontent.com/SVYZCU-xg-nvaBeJ-rz6rHSSDp20AK-5AQPfYwI38nV8hPzFHEqIgFpc3LET-Dmu-Q",
      "https://play-lh.googleusercontent.com/Nne-dalTl8DJ9iius5oOLmFe-4DnvZocgf92l8LTV0ldr9JVQ2BgeW_Bbjb5nkVngrQ",
      "https://play-lh.googleusercontent.com/yIqljB_Jph_T_ITmVFTpmDV0LKXVHWmsyLOVyEuSjL2794nAhTBaoeZDpTZZLahyRsE",
      "https://play-lh.googleusercontent.com/5HdGRlNsBvHTNLo-vIsmRLR8Tr9degRfFtungX59APFaz8OwxTnR_gnHOkHfAjhLse7e",
      "https://play-lh.googleusercontent.com/bPhRpYiSMGKwO9jkjJk1raR7cJjMgPcUFeHyTg_I8rM7_6GYIO9bQm6xRcS4Q2qr6mRx",
      "https://play-lh.googleusercontent.com/7DOCBRsIE5KncQ0AzSA9nSnnBh0u0u804NAgux992BhJllLKGNXkMbVFWH5pwRwHUg",
      "https://play-lh.googleusercontent.com/PCaFxQba_CvC2pi2N9Wuu814srQOUmrW42mh-ZPCbk_xSDw3ubBX7vOQeY6qh3Id3YE",
      "https://play-lh.googleusercontent.com/fQne-6_Le-sWScYDSRL9QdG-I2hWxMbe2QbDOzEsyu3xbEsAb_f5raRrc6GUNAHBoQ",
      "https://play-lh.googleusercontent.com/ql7LENlEZaTq2NaPuB-esEPDXM2hs1knlLa2rWOI3uNuQ77hnC1lLKNJrZi9XKZFb4I",
      "https://play-lh.googleusercontent.com/UIHgekhfttfNCkd5qCJNaz2_hPn67fOkv40_5rDjf5xot-QhsDCo2AInl9036huUtCwf",
      "https://play-lh.googleusercontent.com/7iH7-GjfS_8JOoO7Q33JhOMnFMK-O8k7jP0MUI75mYALK0kQsMsHpHtIJidBZR46sfU",
      "https://play-lh.googleusercontent.com/czt-uL-Xx4fUgzj_JbNA--RJ3xsXtjAxMK7Q_wFZdoMM6nL_g-4S5bxxX3Di3QTCwgw",
      "https://play-lh.googleusercontent.com/e5HMIP0FW9MCoAEGYzji9JsrvyovpZ3StHiIANughp3dovUxdv_eHiYT5bMz38bowOI",
      "https://play-lh.googleusercontent.com/nv2BP1glvMWX11mHC8GWlh_UPa096_DFOKwLZW4DlQQsrek55pY2lHr29tGwf2FEXHM",
      "https://play-lh.googleusercontent.com/xwWDr_Ib6dcOr0H0OTZkHupwSrpBoNFM6AXNzNO27_RpX_BRoZtKIULKEkigX8ETOKI",
      "https://play-lh.googleusercontent.com/AxHkW996UZvDE21HTkGtQPU8JiQLzNxp7yLoQiSCN29Y54kZYvf9aWoR6EzAlnoACQ",
      "https://play-lh.googleusercontent.com/xFouF73v1_c5kS-mnvQdhKwl_6v3oEaLebsZ2inlJqIeF2eenXjUrUPJsjSdeAd41w",
      "https://play-lh.googleusercontent.com/a1pta2nnq6f_b9uV0adiD9Z1VVQrxSfX315fIQqgKDcy8Ji0BRC1H7z8iGnvZZaeg80",
      "https://play-lh.googleusercontent.com/SDAFLzC8i4skDJ2EcsEkXidcAJCql5YCZI76eQB15fVaD0j-ojxyxea00klquLVtNAw",
      "https://play-lh.googleusercontent.com/H7BcVUoygPu8f7oIs2dm7g5_vVt9N9878f-rGd0ACd-muaDEOK2774okryFfsXv9FaI",
      "https://play-lh.googleusercontent.com/5LIMaa7WTNy34bzdFhBETa2MRj7mFJZWb8gCn_uyxQkUvFx_uOFCeQjcK16c6WpBA3E",
      "https://play-lh.googleusercontent.com/DGQjTn_Hp32i88g2YrbjrCwl0mqCPCzDjTwMkECh3wXyTv4y6zECR5VNbAH_At89jGgSJDQuSKsPSB-wVQ",
      "https://play-lh.googleusercontent.com/pzvdI66OFjncahvxJN714Tu5pHUJ_nJK--vg0tv5cpgaGNvjfwsxC-SKxoQh9_n_wEcCdSQF9FeuZeI"
    ]
  }
]

如果您想了解有关抓取 Google Play Store App 的更多说明,请查看我在 SerpApi 的 Python 博客文章中的 Scrape Google Play Store App 此外,更简单的方法是使用Python google-play-search-scraper为您完成所有工作。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM