简体   繁体   中英

Python awswrangler failing to index dataframe

I'm using the awswrangler package to pass a pandas dataframe to AWS Opensearch. This mostly works but seems to fail on rows where the column content is very large.

I've essentially extracted the contents of thousands of documents (pdfs, csv, txt etc) and i'm trying to make that content searchable.

My main issue is that I don't get an error, I'm assuming it is timing out but i haven't had much luck with the documentation.

Has anyone got a suggestion or alternative method?

code:

import pandas as pd
import awswrangler as wr


#connect to opensearch
client = wr.opensearch.connect(
    host='back end url',
    username='username',
    password='password'
)


#read df from pickle
df= pd.read_pickle("outputs/final_pics/df.pkl")

#create index
wr.opensearch.create_index(
    client=client,
    index="indexname",
)


wr.opensearch.index_df(
    client,
    df=df,
    index="indexname",
    id_keys=["index_no"],
    #max_retries=3, #doesnt seem to help
    bulk_size=1000, #number of documents
    chunk_size=500
)

output:

Indexing:   0% (0/8536)|                                 |Elapsed Time: 0:00:00
Indexing:  11% (1000/8536)|###                           |Elapsed Time: 0:00:30
Indexing:  23% (2000/8536)|#######                       |Elapsed Time: 0:01:16
Indexing:  35% (3000/8536)|##########                    |Elapsed Time: 0:01:31
Indexing:  46% (4000/8536)|##############                |Elapsed Time: 0:01:47
Indexing:  58% (5000/8536)|#################             |Elapsed Time: 0:02:06
{'success': 5000, 'errors': []}

EDIT: I think the issue is the contents of what is being uploaded... If I replace the document contents with a simple string it works. Still trying to narrow it down.

After a bit of searching it came down to stripping/splitting some of the content out and changing the bulk_size variable.

Splitting up the larger document by page or X number of words seemed to do the trick and also made OpenSearch run quicker.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM