如何通過python修復抓取的url數據的正則表達式形式？

Question

我正在嘗試使用正則表達式清理我的 url 數據。 我已經把它繞過了，但我還有最后一個問題，我不知道如何解決。

這是我從一些新聞中心抓取的數據，它由主題部分和源部分組成。

我需要從 url 中抓取源模式並省略主題部分，以便將其放在 numpy 數組中進行進一步分析。

我抓取的網址如下所示：

/video/36225009-report-cnbc-russian-sanctions-ukraine/

/health/36139780-cancer-rates-factors-of-stomach/

/business/36187789-in-EU-IMF-reports-about-world-economic-environment/

/video/35930625-30stm-in-last-tour-tv-album-o-llfl-/?smi2=1

/head/36214416-GB-brexit-may-stops-process-by/

/cis/36189830-kiev-arrested-property-in-crymea/

/incidents/36173928-traffic-collapse-by-trucks-incident/

..............................................................

我已嘗試使用以下代碼來解決此問題，但它不起作用並返回整個字符串，而不僅僅是主題部分。

import numpy as np
import pandas as pd
import re

regex = r"^/(\b(\w*)\b)"

pattern_two = regex
prog_two = re.compile( pattern_two )

with open('urls.txt', 'r') as f:

    for line in f:
        line = line.strip()
    
    if prog_two.match( line ):
          print( line )

我還檢查了正則表達式（在 regex101.com 上），如regex = r"^/(\\b(\\w*)\\b)"和regex = r"^/[az]{0,9}./" ，但它也不能正常工作。 我在正則表達式方面沒有很強的技能，也許我做錯了什么？

我期望的最終結果如下：

video
health
business
video
head
cis
incidents  
...........

非常感謝您的幫助！

Answer 1

更改為以下方法：

regex = r"^/([^/]+)"
pat = re.compile(regex)

with open('urls.txt', 'r') as f:
    for line in f:
        line = line.strip()
        m = pat.search(line)
        if m:
            print(m.group(1))

或者不使用正則表達式，使用內置字符串函數：

...
for line in f:
    line = line.strip()
    if line.startswith('/'):
        print(line.split('/', 1)[0])

Answer 2

您也許可以在這里使用split() ：

with open('urls.txt', 'r') as f:
    for line in f:
        line = line.strip()   # this might be optional
        if line.startswith('/'):
            print(line.split("/")[1])

一般來說，如果避免調用正則表達式引擎是可能的，為了只使用基本字符串函數，我們應該選擇后一個選項。

如何通過python修復抓取的url數據的正則表達式形式？

問題描述

2 個解決方案

解決方案1
0 2019-06-10 09:49:00

解決方案2
0 已采納 2019-06-10 09:50:12

如何通過python修復抓取的url數據的正則表達式形式？

問題描述

2 個解決方案

解決方案1 0 2019-06-10 09:49:00

解決方案2 0 已采納 2019-06-10 09:50:12

解決方案1
0 2019-06-10 09:49:00

解決方案2
0 已采納 2019-06-10 09:50:12