[英]How to read html table in Pandas efficiently with speed?
可以以小尺寸读取pandas中的html表,但是html表中10MB或10000行/记录的大文件使我等待10分钟仍然没有任何进展,而csv中的相同文件很快就被解析了。
请帮助加快在熊猫中读取html表的速度,或将其转换为csv。
file='testfile.html'
dfdefault = pd.read_html(file, header = 0, match='Client Inventory Details')
#print(dfdefault)
df = dfdefault[0]
HTML数据集仍然是数据集。 为了更快地读取Pandas中的大型数据集,您可以选择不同的策略,它也适用于read_html:
1.Sampling
2.Chunking
3,优化熊猫dtypes
import pandas import random filename = "data.csv" n = sum(1 for line in open(filename))-1 # Calculate number of rows in file s = n//10 # sample size of 10% skip = sorted(random.sample(range(1, n+1), ns)) # n+1 to compensate for header df = pandas.read_csv(filename, skiprows=skip)
import pandas from sklearn.linear_model import LogisticRegression datafile = "data.csv" chunksize = 100000 models = [] for chunk in pd.read_csv(datafile, chunksize=chunksize): chunk = pre_process_and_feature_engineer(chunk) # A function to clean my data and create my features model = LogisticRegression() model.fit(chunk[features], chunk['label']) models.append(model) df = pd.read_csv("data_to_score.csv") df = pre_process_and_feature_engineer(df) predictions = mean([model.predict(df[features]) for model in models], axis=0)
大大减少Pandas Dataframe大小的另一种方法是将dtype Object的列转换为category。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.