[英]How to string split, match, and output a specific pattern?
I'm trying to solve a problem which I have done it with PHP, not sure how to do that in Python. 我正在尝试解决我用PHP完成的问题,不确定如何在Python中完成。
In the following three Rows, we like to match based on these two patterns: 在以下三行中,我们希望基于以下两种模式进行匹配:
only vine.co and twitter.com URLs (other domains should be ignored) 仅vine.co和twitter.com URL(其他域应忽略)
only URLs before commas , (last URL in each Row should be ignored) 只有逗号之前的网址(每行一个网址就应该被忽略)
Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1
The output would be an array in Python (which this output is based on PHP): 输出将是Python中的数组(此输出基于PHP):
array(3) {
[0]=>
string(30) "https://vine.co/v/5W2Dg3XPX7a
"
[1]=>
string(64) "https://twitter.com/dog_rates/status/836677758902222849/photo/1
"
[2]=>
string(63) "https://twitter.com/dog_rates/status/835264098648616962/photo/1"
}
$input = 'Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1';
$array = preg_split('/Row\s\d:\s/s', $input);
$output = array();
foreach ($array as $key => $value) {
if (strlen($value) > 1) {
$URL_arrays = explode(',', $value);
foreach ($URL_arrays as $key => $value) {
if ($key = sizeof($URL_arrays) - 1) {
unset($URL_arrays[sizeof($URL_arrays) - 1]);
} else {
$match = preg_match('/twitter\.com|vine\.co/s', $value);
if ($match) {
array_push($output, $value);
}
}
}
}
}
var_dump($output);
This question is based on this RegEx problem , which you may answer either of which. 此问题基于此RegEx问题 ,您可以回答其中一个。
You can use this regex to capture all URLs having vine.com
or twitter.com
domain which have a comma just after the URL, 您可以使用此正则表达式来捕获所有具有vine.com
或twitter.com
域的URL,这些URL vine.com
是逗号,
https:\/\/(?:www\.)?(?:vine\.co|twitter\.com)[^,\s]*(?=,)
As you wanted, the key point is this positive look ahead (?=,)
which ensures, your URL is followed by a comma immediately after the URL. 如您所愿,关键是要积极向前看(?=,)
,这可以确保URL后面紧跟一个逗号。
Python code extracting URLs using re.findall
使用re.findall
提取URL的Python代码
import re
s = '''Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1'''
print(re.findall(r'https:\/\/(?:www\.)?(?:vine\.co|twitter\.com)[^,\s]*(?=,)', s))
Outputs, 输出,
['https://vine.co/v/5W2Dg3XPX7a', 'https://twitter.com/dog_rates/status/836677758902222849/photo/1', 'https://twitter.com/dog_rates/status/835264098648616962/photo/1']
Because you don't need to hold duplicates, I would suggest to use a set instead of array (but order changes): 因为您不需要保留重复项,所以我建议使用集合而不是数组(但是顺序会发生变化):
{url for x in s.split('\n') for url in x.split(': ')[1].split(',') if 'vine.co' in url or 'twitter.co' in url}
Code : 代码 :
s = '''Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1'''
print({url for x in s.split('\n') for url in x.split(': ')[1].split(',') if 'vine.co' in url or 'twitter.co' in url})
# {'https://twitter.com/dog_rates/status/835264098648616962/photo/1',
# 'https://twitter.com/dog_rates/status/836677758902222849/photo/1',
# 'https://vine.co/v/5W2Dg3XPX7a'}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.