[英]javascript : how to most efficiently manage data import across two files
I have two API endpoints that I am polling via a node.js / coffeescript script: an /addresses
endpoint that returns a list of home addresses in a given city and a /homevalue
endpoint that returns the value of a home at a given address. 我有两个通过node.js / coffeescript脚本进行轮询的API端点:一个/addresses
端点,它返回给定城市中的家庭住址列表;和一个/homevalue
端点,它返回一个给定地址中的住所值。
I am polling each endpoint in series for a given city, let's say Buffalo. 我正在调查给定城市的系列每个端点,比方说布法罗。 For auditing purposes, I am saving the content of each in local directories, at .../addresses/addresses.txt
and .../homeValues/homeValues.txt
. 出于审计目的,我将每个目录的内容保存在本地目录中,位于.../addresses/addresses.txt
和.../homeValues/homeValues.txt
。 The script runs though all of the homes in a city, then saves these to the addresses directory, then polls the /homevalue
endpoint and saves the results in a text file in the homeValues directory. 该脚本在城市中的所有房屋中运行,然后将其保存到地址目录,然后轮询/homevalue
端点,并将结果保存在homeValues目录中的文本文件中。
I then do some transformative work to convert both addresses and home values into a canonicalized format, saving each of these into a separate directory, .../canonicalAddresses
and .../canonicalHomeValues
. 然后,我进行一些转换性工作,将地址和原始值都转换为规范化格式,并将它们分别保存到单独的目录.../canonicalAddresses
和.../canonicalHomeValues
。 I then merge the canonical addresses and home values into a text file at .../unifiedAddresses/unifiedAddresses.txt
然后,我将规范地址和原始值合并到文本文件中,该文件位于.../unifiedAddresses/unifiedAddresses.txt
I cannot save these files as JSON, I have to save them in a text fileas a series of json objects, one per line. 我无法将这些文件另存为JSON,我必须将它们保存为文本文件中的一系列json对象,每行一个。 I am also doing this synchronously rather than async because I want to maintain an audit trail. 我也正在同步执行此操作,而不是异步执行此操作,因为我想维护审核跟踪。
The canonicalized address file is a series of lines like: 规范化的地址文件包含以下几行:
{id: 12345, address: {...}}
{id: XYZAB, address: {...}}
The home values list is historical by year and is a series of lines like: 房屋价值清单是按年份列出的历史记录,并且由以下几行组成:
[{id: 12345, homevalue: {year: 1990,...}, {id: 12345, homevalue: {year: 1991,...}}...]
[{id: XYZAB, homevalue: {year: 1990,...}, {id: 12346, homevalue: {year: 1991,...}}...]
This is my greatly simplified pseudocode for that merge, which requires that I read both .../addresses/addresses.txt
and .../homeValues/homeValues.txt
from disk: 这是我为合并而大大简化的伪代码,它要求我从磁盘读取.../addresses/addresses.txt
和.../homeValues/homeValues.txt
:
canonicalizedHomeValuesFile = "..."
canonicalizedAddressesFile = "..."
unifiedAddressFile = "..."
getHomeValue = (addressID) ->
fs.readFileSync(canonicalizedHomeValuesFile).toString().split('\n').forEach((homevalue)=>
<< return the canonicalized home value if homevalue.ID is addressID >>
)
fs.readFileSync(canonicalizedAddressFile).toString().split('\n').forEach((address)=>
address.value = getHomeValue(address.ID)
fs.appendFileSync(unifiedAddressFile, JSON.stringify(address) + "\n")
)
This approach works fine for small numbers of houses but is insanely slow to unify large numbers of addresses. 这种方法适用于少量房屋,但是统一大量地址的速度很慢。 For about 2000 houses, this approach takes upwords of 4 minutes per house. 对于大约2000所房屋,此方法每个房屋要花4分钟。
It seems to me the real bottleneck is the getHomeValue()
function. 在我看来,真正的瓶颈是getHomeValue()
函数。 What is a more efficient way to approach that lookup? 有什么更有效的方法来进行该查找?
If the data is large enough it might be worth preloading the objects and then matching them using a binary search. 如果数据足够大,则可能需要预加载对象,然后使用二进制搜索将其匹配。 It looks like you are loading the file from disk each time you get the home value. 好像您每次获得原始值时都从磁盘加载文件。 It also looks like you are writing to disk with each iteration in readFileSync. 看起来您在readFileSync中的每次迭代都在写磁盘。 If that is the case you might consider writing the file after all transactions are complete. 如果是这种情况,您可以考虑在所有事务完成之后写入文件。 I would minimize drive access as much as possible by batching the load and save. 我将通过分批加载和保存来最大程度地减少驱动器访问。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.