[英]How to string split, match, and output a specific pattern?
我正在嘗試解決我用PHP完成的問題,不確定如何在Python中完成。
在以下三行中,我們希望基於以下兩種模式進行匹配:
僅vine.co和twitter.com URL(其他域應忽略)
只有逗號之前的網址(每行一個網址就應該被忽略)
Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1
輸出將是Python中的數組(此輸出基於PHP):
array(3) {
[0]=>
string(30) "https://vine.co/v/5W2Dg3XPX7a
"
[1]=>
string(64) "https://twitter.com/dog_rates/status/836677758902222849/photo/1
"
[2]=>
string(63) "https://twitter.com/dog_rates/status/835264098648616962/photo/1"
}
$input = 'Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1';
$array = preg_split('/Row\s\d:\s/s', $input);
$output = array();
foreach ($array as $key => $value) {
if (strlen($value) > 1) {
$URL_arrays = explode(',', $value);
foreach ($URL_arrays as $key => $value) {
if ($key = sizeof($URL_arrays) - 1) {
unset($URL_arrays[sizeof($URL_arrays) - 1]);
} else {
$match = preg_match('/twitter\.com|vine\.co/s', $value);
if ($match) {
array_push($output, $value);
}
}
}
}
}
var_dump($output);
此問題基於此RegEx問題 ,您可以回答其中一個。
您可以使用此正則表達式來捕獲所有具有vine.com
或twitter.com
域的URL,這些URL vine.com
是逗號,
https:\/\/(?:www\.)?(?:vine\.co|twitter\.com)[^,\s]*(?=,)
如您所願,關鍵是要積極向前看(?=,)
,這可以確保URL后面緊跟一個逗號。
使用re.findall
提取URL的Python代碼
import re
s = '''Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1'''
print(re.findall(r'https:\/\/(?:www\.)?(?:vine\.co|twitter\.com)[^,\s]*(?=,)', s))
輸出,
['https://vine.co/v/5W2Dg3XPX7a', 'https://twitter.com/dog_rates/status/836677758902222849/photo/1', 'https://twitter.com/dog_rates/status/835264098648616962/photo/1']
因為您不需要保留重復項,所以我建議使用集合而不是數組(但是順序會發生變化):
{url for x in s.split('\n') for url in x.split(': ')[1].split(',') if 'vine.co' in url or 'twitter.co' in url}
代碼 :
s = '''Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1'''
print({url for x in s.split('\n') for url in x.split(': ')[1].split(',') if 'vine.co' in url or 'twitter.co' in url})
# {'https://twitter.com/dog_rates/status/835264098648616962/photo/1',
# 'https://twitter.com/dog_rates/status/836677758902222849/photo/1',
# 'https://vine.co/v/5W2Dg3XPX7a'}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.