简体   繁体   English

如何知道我的cmd命令是否仍在python中工作

[英]How to know whether my cmd command still working in python

I have collected a large set of text(a online newspaper website) by scraping using Scrapy Framework which I have stored in 'nahidd.txt' file. 我已经使用Scrapy Framework进行了抓取,收集了一大堆文本(在线报纸网站),这些文本已存储在“ nahidd.txt”文件中。 The txt file size is almost 240MB. txt文件大小几乎为240MB。

Now In this txt file I have several word redundancy. 现在,在此txt文件中,我有几个单词冗余。 For example, word 'love' may seen in multiple lines in that txt file. 例如,在该txt文件的多行中可以看到单词“ love”。 However, I need only one presence of word 'love' 但是,我只需要出现一个“爱”字

I have used the following code to remove redundancy from my large 'nahidd.txt' file. 我使用以下代码从大型“ nahidd.txt”文件中删除了冗余。

file_object = open("nahidd.txt", "r", encoding='utf-8-sig')
file_object_all_text = file_object.read().split()
file_object_redundancy_removed = " ".join(sorted(set(file_object_all_text), key=file_object_all_text.index))
file_object = open("nahidd_pure.txt", "w", encoding='utf-8-sig')
file_object.write(file_object_redundancy_removed)

But the problem is that whenever I put a command in cmd. 但是问题是,每当我在cmd中放置命令时,该命令就会生效。

scrapy runspider nahidBot.py 刮擦runpider nahidBot.py

It works perfectly fine but It takes forever (since file size is large) and I see a single cursor blinking for hours. 它工作得很好,但是它要花很多时间(因为文件很大),而且我看到单个光标闪烁了几个小时。 It's difficult to understand whether my command is still working or just hanged. 很难理解我的命令是仍在工作还是挂起。 I just need to show some kind of text in cmd just like 'line 1 processed', 'line 2 processed' or Percentage of background work done. 我只需要在cmd中显示某种文本即可,例如“第1行已处理”,“第2行已处理”或完成的背景工作的百分比。 So that anyone can understand how much work is left or to understand that my command is still working. 这样任何人都可以了解还剩下多少工作或我的命令仍在工作。

Thanks in advance. 提前致谢。 Nahid 纳希德

this line performs a sort 这条线执行sort

file_object_redundancy_removed = " ".join(sorted(set(file_object_all_text), key=file_object_all_text.index))

but uses linear search in the key, which is very bad for performance. 但是在键中使用线性搜索,这对性能非常不利。

If you don't need to preserve order, just do: 如果您不需要保留订单,请执行以下操作:

file_object_redundancy_removed = sorted(set(file_object_all_text))

If you need to preserve order "as occurring", which you're trying to emulate with your sort , a faster way would be to store the words you already encountered in an auxiliary set: 如果您需要保留sort顺序模拟的“正在发生”的顺序​​,则一种更快的方法是将您已经遇到的单词存储在辅助集中:

marker = set()
file_object_redundancy_removed = []
for w in file_object_all_text:
   if w not in marker:
      marker.add(w)
      file_object_redundancy_removed.append(w)

you now have a list with redundancy removed, and order of first word occurrences preserved. 您现在有了一个删除了冗余的列表,并保留了第一个单词出现的顺序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM