简体   繁体   English

newsplease commoncrawl.py 文件中的异常

[英]exception in newsplease commoncrawl.py file

i am using newsplease library that i have cloned from https://github.com/fhamborg/news-please .我正在使用从https://github.com/fhamborg/news-please克隆的 newsplease 库。 i want to use newsplease to get news artices from commoncrawl news datasets.我想使用 newsplease 从 commoncrawl 新闻数据集中获取新闻文章。 i am running commoncrawl.py file as instruct here .我正在按照此处的说明运行 commoncrawl.py 文件。 i have used the command below -我使用了以下命令-

python -m newsplease.examples.commoncrawl

on executing the following command i am getting following errors -在执行以下命令时,我收到以下错误 -

my_local_download_dir_warc=./cc_download_warc/
my_local_download_dir_article=./cc_download_articles/
delete_warc_after_extraction=False
my_number_of_extraction_processes=1
INFO:newsplease.crawler.commoncrawl_crawler:executing: aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/ --no-sign-request > .tmpaws.txt && awk '{ print $4 }' .tmpaws.txt && rm .tmpaws.txt
INFO:newsplease.crawler.commoncrawl_crawler:found 2 files at commoncrawl.org
INFO:newsplease.crawler.commoncrawl_crawler:creating extraction process pool with 1 processes
INFO:newsplease.crawler.commoncrawl_extractor:found local file ./cc_download_warc/https%3A%2F%2Fcommoncrawl.s3.amazonaws.com%2F, not downloading again due to configuration
Traceback (most recent call last):
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 236, in _detect_type_load_headers
    rec_headers = self.arc_parser.parse(stream, statusline)
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 312, in parse
    raise StatusAndHeadersParserException(msg, parts)
warcio.statusandheaders.StatusAndHeadersParserException: Wrong # of headers, expected arc headers ['uri', 'ip-address', 'archive-date', 'content-type', 'length'], Found ['<?xml', 'version="1.0"', 'encoding="UTF-8"?>']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/examples/commoncrawl.py", line 172, in <module>
    main()
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/examples/commoncrawl.py", line 168, in main
    continue_process=True)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 320, in crawl_from_commoncrawl
    log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 230, in __start_commoncrawl_extractor
    log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 338, in extract_from_commoncrawl
    self.__run()
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 292, in __run
    self.__process_warc_gz_file(local_path_name)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 231, in __process_warc_gz_file
    for record in ArchiveIterator(stream):
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/archiveiterator.py", line 110, in _iterate_records
    self.record = self._next_record(self.next_line)
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/archiveiterator.py", line 262, in _next_record
    self.check_digests)
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 88, in parse_record_stream
    known_format))
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 243, in _detect_type_load_headers
    raise ArchiveLoadFailed(msg + str(se.statusline))
warcio.exceptions.ArchiveLoadFailed: Unknown archive format, first line: ['<?xml', 'version="1.0"', 'encoding="UTF-8"?>']

what is the error here how can i resolve this.这里有什么错误我该如何解决这个问题。

https://github.com/fhamborg/news-please says that adopt the config section in newsplease/examples/commoncrawl.py. https://github.com/fhamborg/news-please表示采用 newsplease/examples/commoncrawl.py 中的配置部分。 what does this mean?这是什么意思?
i have copied the configurations from this file and pasted in config.cfg which is present in the newsplease/config directory.我已经从这个文件中复制了配置并粘贴到了newsplease/config目录中的config.cfg中。 is this what thay have instructed?这是他们指示的吗? or i have made a mistake here.或者我在这里犯了一个错误。

i am using python 3.6.我正在使用 python 3.6。 i have only one python installed in my machine.我的机器上只安装了一个 python。

this error is because of the libraries being used by the newsplease.此错误是由于 newsplease 正在使用的库。 mistake is made when we manually install every library, while installing focus on the versions of packages.当我们手动安装每个库时会犯错误,而安装的重点是包的版本。 version info of every library is given in setup.py file.每个库的版本信息都在 setup.py 文件中给出。 install exact version given in setup.py file.安装 setup.py 文件中给出的确切版本。 now there may be problems while executing the setup.py.现在执行 setup.py 时可能会出现问题。

so use this command -所以使用这个命令 -

python3 setup.py install

if you need to uninstall all the previous verions of installed packeges then run -如果您需要卸载所有以前版本的已安装包,请运行 -

pip3 freeze --user | xargs pip3 uninstall -y

for more ways to do this click here有关执行此操作的更多方法,请单击此处

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM