简体   繁体   English

熊猫功能耗时太长

[英]Pandas function taking too long

I am trying to extract the top level URLs and ignore the paths. 我试图提取顶级URL并忽略路径。 I am using the code below: 我使用下面的代码:

for row in Mexico['Page URL']:
    parsed_uri = urlparse( 'http://www.one.com.mx/furl/Conteúdo Raiz/Meu' )
    Mexico['SubDomain'] = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)

This script has been running for the past hour. 此脚本已在过去一小时内运行。 When I ran it, it gave the following warning: 当我运行它时,它发出以下警告:

/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until

I would appreciate it if anyone could advise on a quicker way, perhaps pointers on the method the 'warning' suggests 我很感激,如果有人能以更快的方式提出建议,也许可以指出“警告”暗示的方法

Calling a Python function once for each row of a Series can be very slow if the Series is very long. 如果Series很长,那么为Series的每一行调用一次Python函数可能会非常慢。 The key to speeding this up is replacing the multiple function calls with (ideally) one vectorized function call. 加快这一点的关键是用(理想情况下)一个向量化函数调用替换多个函数调用。

When using Pandas, that means rewriting the Python function (eg urlparse ) in terms of vectorized string functions . 使用Pandas时,这意味着根据矢量化字符串函数重写Python函数(例如urlparse )。

Since urlparse is a fairly complicated function, rewriting urlparse would be pretty hard. 由于urlparse是一个相当复杂的函数,重写urlparse会非常困难。 However, in your case we have the advantage of knowing that all the urls that we care about begin with https:// or http:// . 但是,在您的情况下,我们的优势在于知道我们关心的所有网址都以https://http://开头。 So we don't need urlparse in its full-blow generality. 所以我们不需要urlparse的全面通用性。 We can perhaps make do with a much simpler rule: The netloc is whatever characters follow https:// or http:// until the end of the string or the next / , whichever comes first . 我们也许可以使用更简单的规则: netloc是跟随https://http://任何字符,直到字符串的结尾或下一个/ ,以先到者为准 If that is true, then 如果那是真的那么

Mexico['Page URL'].str.extract('(https?://[^/]+)', expand=False)

can extract all the netlocs from the entire Series Mexico['Page URL'] without looping and without multiple urlparse function calls. 可以从整个Series Mexico['Page URL']提取所有netlocs而不需要循环,也不需要多个urlparse函数调用。 This will be much faster when len(Mexico) is big. len(Mexico)很大时,这会快得多。


For example, 例如,

import pandas as pd

Mexico = pd.DataFrame({'Page URL':['http://www.one.com.mx/furl/Conteúdo Raiz/Meu',
                                   'https://www.one.com.mx/furl/Conteúdo Raiz/Meu']})

Mexico['SubDomain'] = Mexico['Page URL'].str.extract('(https?://[^/]+)', expand=False)
print(Mexico)

yields 产量

                                        Page URL               SubDomain
0   http://www.one.com.mx/furl/Conteúdo Raiz/Meu   http://www.one.com.mx
1  https://www.one.com.mx/furl/Conteúdo Raiz/Meu  https://www.one.com.mx

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM