如何通过python修复抓取的url数据的正则表达式形式？

Question

I am trying to clean my url data using regular expression.我正在尝试使用正则表达式清理我的 url 数据。 I have already cleaned it bypass, but I have a last problem that I don't know how to solve.我已经把它绕过了，但我还有最后一个问题，我不知道如何解决。

It is a data that I have scraped from some newshub and it consists from theme part and a source part .这是我从一些新闻中心抓取的数据，它由主题部分和源部分组成。

I need to scrape the source pattern from url and leave out the theme part in order to put it on to the numpy array for the further analysis.我需要从 url 中抓取源模式并省略主题部分，以便将其放在 numpy 数组中进行进一步分析。

My scraped urls look like this:我抓取的网址如下所示：

/video/36225009-report-cnbc-russian-sanctions-ukraine/

/health/36139780-cancer-rates-factors-of-stomach/

/business/36187789-in-EU-IMF-reports-about-world-economic-environment/

/video/35930625-30stm-in-last-tour-tv-album-o-llfl-/?smi2=1

/head/36214416-GB-brexit-may-stops-process-by/

/cis/36189830-kiev-arrested-property-in-crymea/

/incidents/36173928-traffic-collapse-by-trucks-incident/

..............................................................

I have tried the following code to solve this problem, but it doesn't work and returns a whole string back instead of just theme parts.我已尝试使用以下代码来解决此问题，但它不起作用并返回整个字符串，而不仅仅是主题部分。

import numpy as np
import pandas as pd
import re

regex = r"^/(\b(\w*)\b)"

pattern_two = regex
prog_two = re.compile( pattern_two )

with open('urls.txt', 'r') as f:

    for line in f:
        line = line.strip()
    
    if prog_two.match( line ):
          print( line )

Also I have checked the regular expression (on regex101.com) like regex = r"^/(\\b(\\w*)\\b)" and like regex = r"^/[az]{0,9}./" , but it also doesn't work properly.我还检查了正则表达式（在 regex101.com 上），如regex = r"^/(\\b(\\w*)\\b)"和regex = r"^/[az]{0,9}./" ，但它也不能正常工作。 I don't have a strong skills in regex and maybe I am doing something wrong?我在正则表达式方面没有很强的技能，也许我做错了什么？

The final result that I expect is following:我期望的最终结果如下：

video
health
business
video
head
cis
incidents  
...........

Thank you very much for helping!非常感谢您的帮助！

Answer 1

Change to the following approach:更改为以下方法：

regex = r"^/([^/]+)"
pat = re.compile(regex)

with open('urls.txt', 'r') as f:
    for line in f:
        line = line.strip()
        m = pat.search(line)
        if m:
            print(m.group(1))

Or without regex, with builtin string functions:或者不使用正则表达式，使用内置字符串函数：

...
for line in f:
    line = line.strip()
    if line.startswith('/'):
        print(line.split('/', 1)[0])

Answer 2

You might be able to just use split() here:您也许可以在这里使用split() ：

with open('urls.txt', 'r') as f:
    for line in f:
        line = line.strip()   # this might be optional
        if line.startswith('/'):
            print(line.split("/")[1])

In general, if avoiding the invocation of a regex engine is possible, in favor of just using base string functions, we should go for the latter option.一般来说，如果避免调用正则表达式引擎是可能的，为了只使用基本字符串函数，我们应该选择后一个选项。

如何通过python修复抓取的url数据的正则表达式形式？

问题描述

2 个解决方案

解决方案1
0 2019-06-10 09:49:00

解决方案2
0 已采纳 2019-06-10 09:50:12

如何通过python修复抓取的url数据的正则表达式形式？

问题描述

2 个解决方案

解决方案1 0 2019-06-10 09:49:00

解决方案2 0 已采纳 2019-06-10 09:50:12

解决方案1
0 2019-06-10 09:49:00

解决方案2
0 已采纳 2019-06-10 09:50:12