[英]python: extract items of different lists and put them in one set
I have a file like this: 我有一个像这样的文件:
93.93.203.11|["['vmit.it', 'umbertominnella.it', 'studioguizzardi.it', 'telestreet.it', 'maurominnella.com']"]
168.144.9.16|["['iipmalumni.com','webdesignhostingindia.com', 'iipmstudents.in', 'iipmclubs.in']"]
195.211.72.88|["['tcmpraktijk-jingshen.nl', 'ellen-siemer.nl'']"]
129.35.210.118|["['israelinnovation.co.il', 'watec-peru.com', 'bsacimeeting.org', 'wsava2015.com', 'picsmeeting.com']"]
I want to extract domains in all the lists and add them to one set. 我想提取所有列表中的域并将它们添加到一组中。 ultimately, i would like to have a fine with each unique domain in one line. 最终,我希望每一行都包含一个唯一域。 Here is the code I have written: 这是我编写的代码:
set_d = set()
f = open(file,'r')
for line in f:
line = line.strip('\n')
ip,list = line.split('|')
l = json.loads(list)
for e in l:
domain = e.split(',')
set_d.add(domain)
print set_d
but it gives the below error: 但它给出以下错误:
set_d.add(domain)
TypeError: unhashable type: 'list'
Can anybody help me out? 有人可以帮我吗?
You should call update
instead of add
; 您应该调用update
而不是add
;
set_d.update(domain)
Example; 例;
>>> set_d = {'a', 'b', 'c'}
>>> set_d.update(['c', 'd', 'e'])
>>> print set_d
{'a', 'b', 'c', 'd', 'e'}
Use str.translate to clean the text and add to the set using update: 使用str.translate清理文本并使用update添加到集合中:
set_d = set()
with open(file,'r') as f:
for line in f:
lst = (x.strip() for x in line.split("|")[1].translate(None,"\"'[]").split(","
set_d.update(lst)
outputs a unique set of individual domains: 输出一组独特的单个域:
set(['vmit.it', 'tcmpraktijk-jingshen.nl', 'umbertominnella.it', 'studioguizzardi.it', 'telestreet.it', 'watec-peru.com', 'bsacimeeting.org', 'webdesignhostingindia.com', 'wsava2015.com', 'iipmstudents.in', 'maurominnella.com', 'ellen-siemer.nl', 'picsmeeting.com', 'iipmalumni.com', 'iipmclubs.in', 'israelinnovation.co.il'])
which you can write to a new file: 您可以将其写入新文件:
set_d = set()
with open(file,'r') as f,open("out.txt","w") as out:
for line in f:
lst = (x.strip() for x in line.split("|")[1].translate(None,"\"'[]").split(","))
set_d.update(lst)
for line in set_d:
out.write("{}\n".format(line))
The output: 输出:
$ cat out.txt
vmit.it
tcmpraktijk-jingshen.nl
umbertominnella.it
studioguizzardi.it
telestreet.it
watec-peru.com
bsacimeeting.org
webdesignhostingindia.com
wsava2015.com
iipmstudents.in
maurominnella.com
ellen-siemer.nl
picsmeeting.com
iipmalumni.com
iipmclubs.in
israelinnovation.co.il
Your code will not separate into individual domains, your json call does not really do anything to help. 您的代码不会分成单独的域,您的json调用实际上并没有任何帮助。 Changing your code to update will output something like the following: 更改代码以更新将输出如下内容:
{" 'maurominnella.com']", " 'wsava2015.com'", "'webdesignhostingindia.com'", " 'iipmclubs.in']", " 'ellen-siemer.nl'']", " 'umbertominnella.it'", " 'picsmeeting.com']", "['israelinnovation.co.il'", "['vmit.it'", " 'iipmstudents.in'", "['tcmpraktijk-jingshen.nl'", " 'studioguizzardi.it'", "['iipmalumni.com'", " 'watec-peru.com'", " 'bsacimeeting.org'", " 'telestreet.it'"}
Also don't use list as a variable name either it shadows the python list
也不要使用list作为变量名,否则它会遮盖python list
As the result of split
function is a list ( domain = e.split(',')
)and lists are unhashable you cant add them to set
. 由于split
函数的结果是一个列表( domain = e.split(',')
),并且列表不可散列,因此无法将其添加到set
。 instead you can add those elements to your set with set.update()
, But you dont need Json
as it doesn't separate your domain and doesn't give you the desire result instead you can use ast.literal_eval
to split your list : 相反,您可以使用set.update()
将这些元素添加到集合中,但是您不需要Json
因为它不会分隔您的域,也不会给您带来期望的结果,而是可以使用ast.literal_eval
来拆分列表:
import ast
set_d = set()
f = open(file,'r')
for line in f:
line = line.strip('\n')
ip,li = line.split('|')
l = ast.literal_eval(ast.literal_eval(li)[0])
for e in l:
domain = e.split(',')
set_d.update(domain)
print set_d
Note that dont use of python built-in functions or types as your variable! 请注意,请勿将python内置函数或类型用作变量!
And as a more efficient way you just can use regex to grub your domains : 作为一种更有效的方法,您可以使用正则表达式来搜索您的域:
f = open(file,'r').read()
import re
print set(re.findall(r'[a-zA-Z\-]+\.[a-zA-Z]+',f))
result: 结果:
set(['vmit.it', 'tcmpraktijk-jingshen.nl', 'umbertominnella.it', 'studioguizzardi.it', 'telestreet.it', 'israelinnovation.co', 'bsacimeeting.org', 'webdesignhostingindia.com', 'iipmstudents.in', 'maurominnella.com', 'ellen-siemer.nl', 'picsmeeting.com', 'watec-peru.com', 'iipmalumni.com', 'iipmclubs.in'])
[Finished in 0.0s]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.