简体   繁体   English

删除python 3.7中的特殊字符

[英]remove special characters in python 3.7

I've been testing for rip the url using python and I get the result from str我一直在使用 python 测试 rip url,我从 str 得到结果

itdUrlforrip.text content: http://itdmusic.in/category/new-releases/page/4 itdUrlforrip.text 内容: http ://itdmusic.in/category/new-releases/page/4

the complete code完整的代码

#!/usr/bin/python
import requests
import re
import regex
from pyquery import PyQuery

#get each
link1 = open('/Users/R/Downloads/itdUrlforrip.txt','r').read()
list1 = link1.split('\n')
list2 = []
for eachlink1 in list1:
    linkSub1 = requests.get(eachlink1).text
    splitContent = linkSub1.split("Facebook")
    splitContent1 = splitContent[0]
    list2.append(splitContent1)

list2GLStr = ("\n".join(list2))
urlAll = regex.findall('itdmusic\.in\/\d\d\/.+\.html', list2GLStr)
allUrlrmDup1 = list(dict.fromkeys(urlAll))

#get list of url from input
allUrlrmDup1Ah = regex.sub('itdmusic', 'http://itdmusic', str(allUrlrmDup1))
allUrlrmDup1Ah2 = regex.sub('\'', '', str(allUrlrmDup1Ah))
allUrlrmDup1Ah3 = regex.sub('\[', '', str(allUrlrmDup1Ah2))
allUrlrmDup1Ah4 = regex.sub('\]', '', str(allUrlrmDup1Ah3))
allUrlrmDup1AhGL = ("\n".join(list(allUrlrmDup1Ah4.split(', '))))
allUrlrmDup1AhList = allUrlrmDup1AhGL.split('\n')

list3 = []
list4 = []
for eachlink2 in allUrlrmDup1AhList:
    linkSub2 = requests.get(eachlink2).text
    urlGdr = regex.findall('drive\.google\.com\/.{41}', linkSub2)
    urlOth = regex.findall('https\:\/\/www\d\d\d\.zippyshare\.com\/v.{19}|https\:\/\/www\d\d\.zippyshare\.com\/v.{19}|https\:\/\/www\d\.zippyshare\.com\/v.{19}|https?:\/\/douploads\.com\/.{12}|https?:\/\/www\.mirrored\.to\/.{14}|https?:\/\/mir\.cr\/.{8}|https?:\/\/hexupload\.net\/.{12}|https?:\/\/intoupload\.net\/.{12}|https?:\/\/www\.dropbox\.com\/s\/.{15}|https?:\/\/dbree\.org\/v\/.{6}|https?:\/\/dropapk\.to\/.{12}|https?:\/\/www\.sendspace\.com\/file\/.{6}|https?:\/\/gestyy\.com\/.{6}|https?:\/\/ouo\.io\/\w{6}|https?:\/\/mega\.nz.{55}|https?:\/\/bit\.ly.{8}', linkSub2)
    urlska = regex.findall('https?\:\/\/itdmusic\.in\/skipads\/.+\/\'', linkSub2)
    urlskaStr = str(urlska)
    urlska2 = regex.sub('\/\'', '', urlskaStr)
    list3.append(urlGdr)
    list3.append(urlOth)
    list4.append(urlska2)

then I然后我

print(list4)

and the result is结果是

'[]', '[]', '[]', '[]', '[]', '[]', '[]', '[]', '["http://itdmusic.in/skipads/2020/03/12/luke-bryan-one-margarita-pre-single"]', '["http://itdmusic.in/skipads/2020/03/12/kota-banks-italiana-single"]'

for 32s 32s

so is there a way to get rid of '[]' and just get the url in here?那么有没有办法摆脱'[]'并在这里获取网址? I try bunch of things and still cannot figure out using regex and re.我尝试了很多东西,但仍然无法弄清楚使用正则表达式和重新。 I'm little bit confusing by using for xxx in xxx.我在 xxx 中使用 for xxx 有点困惑。

The thing is regex.findall() returns a list and you are appending it to another list, thus you are getting the '[]'.事情是 regex.findall() 返回一个列表,您将它附加到另一个列表,因此您得到了“[]”。

You should use "list4.extend(urlska2)" instead of "list4.append(urlska2)"您应该使用“list4.extend(urlska2)”而不是“list4.append(urlska2)”

which would give you what you want.这会给你你想要的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM