简体繁体 English

BizTalk服务器问题

[英]BizTalk server problem

原文 2009-12-10 10:35:01 0 6 c#/ sql-server/ xpath/ biztalk/ load-balancing

we have a biztalk server (a virtual one (1!)...) at our company, and an sql server where the data is being kept. 我们公司有一个biztalk服务器（一个虚拟的（1！）...）和一个保存数据的sql server。 Now we have a lot of data traffic. 现在我们有很多数据流量。 I'm talking about hundred of thousands. 我说的是成千上万。 So I'm actually not even sure if one server is pretty safe, but our company is not that easy to convince. 所以我实际上甚至不确定一台服务器是否相当安全，但我们的公司并不那么容易说服。

Now recently we have a lot of problems. 最近我们遇到了很多问题。

Allow me to situate in detail, so I'm not missing anything: 请允许我详细说明，所以我没有遗漏任何东西：

Our server has 5 applications: 我们的服务器有5个应用：

One with 3 orchestrations, 12 send ports, 16 receive locations. 一个有3个编排，12个发送端口，16个接收位置。
One with 4 orchestrations, 32 send ports, 20 receive locations. 一个有4个业务流程，32个发送端口，20个接收位置。
One with 4 orchestrations, 24 send ports, 20 receive locations. 一个有4个编排，24个发送端口，20个接收位置。
One with 47 (yes 47) orchestrations, 37 send ports, 6 receive locations. 一个有47个（是47个）编排，37个发送端口，6个接收位置。
One with common application with a couple of resources. 一个具有几个资源的常见应用程序。

Our problems have occured since we deployed the applications with the 47 orchestrations. 自从我们使用47个业务流程部署应用程序以来，我们遇到了问题。 A lot of these orchestrations use assign shapes which use c# code to do the mapping. 很多这些编排使用赋值形状，使用c＃代码进行映射。 This is because we use HL7 extensions and this is kind of special, so by using c# code & xpath it was a lot easier to do the mapping because a lot of these schema's look alike. 这是因为我们使用HL7扩展，这是一种特殊的，所以通过使用c＃code和xpath，映射很容易，因为很多这些模式看起来很相似。 The c# reads in XmlNodes received through xpath, and returns XmlNode which are then assigned again to biztalk messages. c＃读入通过xpath接收的XmlNodes，并返回XmlNode，然后再将其分配给biztalk消息。 I'm not sure if this could be the cause, but I thought I'd mention it. 我不确定这可能是原因，但我想我会提到它。

The send and receive ports have a lot of different types: File, MQSeries, SQL, MLLP, FTP. 发送和接收端口有许多不同的类型：文件，MQSeries，SQL，MLLP，FTP。 Each of these types have a different host instances, to balance out the load. 每种类型都有不同的主机实例，以平衡负载。 Our orchestrations use the BiztalkApplication host. 我们的业务流程使用BiztalkApplication主机。

On this server also a couple of scripts are running, mostly ftp upload scripts & also a zipper script, which zips files every half an hour in a daily zip and deletes the zip files after a month. 在这台服务器上还运行了几个脚本，主要是ftp上传脚本和拉链脚本，每天半小时压缩一次文件，并在一个月后删除zip文件。 We use this zipscript on our backup files (we backup a lot, backups are also on our server), we did this because the server had problems with sending files to a location where there were a lot (A LOT) of files, so after the files were reduced to zips it went better. 我们在备份文件上使用这个zipscript（我们备份很多，备份也在我们的服务器上），我们这样做是因为服务器在将文件发送到有很多（很多）文件的位置时出现问题，所以之后文件减少到拉链它变得更好。

Now the problems we are having recently are mainly two major problems: 现在我们最近遇到的问题主要是两个主要问题：

Our most important problem is the following. 我们最重要的问题如下。 We kept a receive location with a lot of messages on a queue for testing. 我们在队列中保留了一个包含大量消息的接收位置以进行测试。 After we start this receive location which uses the 47 orchestrations, the running service instances start to sky rock. 在我们启动这个使用47个业务流程的接收位置之后，正在运行的服务实例开始转向天空。 Ok, this is pretty normal. 好的，这很正常。 Let's say about 10000, and then we stop the receive location to see how biztalk handles these 10000 instances. 假设大约10000，然后我们停止接收位置以查看biztalk如何处理这10000个实例。 Normally they would go down pretty fast, and it does sometimes, but after a while it starts to "throttle", meaning they just stop being processed and the service instances stay at the same number, for example in 30 seconds it goes down from 10000 to 4000 and then it stays at 4000 and it lowers very very very slowly, like 30 in 5minutes or something. 通常情况下它们会很快下降，有时会发生故障，但过了一段时间它会开始“节流”，这意味着它们只是停止处理并且服务实例保持相同的数字，例如在30秒内它从10000下降到4000，然后它保持在4000并且非常非常缓慢地降低，例如在5分钟或者30分钟内降低30。 So this means, that all the other service instances of the other applications are also stuck in here, and they are also not processed. 所以这意味着，其他应用程序的所有其他服务实例也都停留在这里，并且它们也没有被处理。

We noticed that after restarting our host instances the instance number went down fast again. 我们注意到在重新启动主机实例后，实例编号再次快速下降。 So we tried to selectively restart different host instances to locate the problem. 因此，我们尝试有选择地重新启动不同的主机实例以找到问题。 We noticed that eventually restarting the file send/receive host instance would do the trick. 我们注意到最终重新启动文件发送/接收主机实例就可以了。 So we thought file sends would be the problem. 所以我们认为文件发送会成为问题。 Concidering that we make a lot of backups. 结合我们做了很多备份。 So we replaced the file type backups with mqseries backups. 所以我们用mqseries备份替换了文件类型备份。 The same problem occured, and funny thing, restarting the file send/receive host still fixes the problem. 发生同样的问题，有趣的是，重新启动文件发送/接收主机仍然可以解决问题。

No errors can be found in the event viewer either. 在事件查看器中也找不到任何错误。

A second problem we're having is. 我们遇到的第二个问题是。 That sometimes at arround 6 am, all or a part of the host instances are being stopped. 有时在早上6点左右，全部或部分主机实例正在停止。

In the event viewer we noticed the following errors (these are more than one): 在事件查看器中，我们注意到以下错误（这些错误不止一个）：

The receive location "MdnBericht SQL" with URL "SQL://ZNACDBPEG/mdnd0001/" is shutting down. 具有URL“SQL：// ZNACDBPEG / mdnd0001 /”的接收位置“MdnBericht SQL”正在关闭。 Details:"The error threshold has been exceeded. The receive location is shutting down.". 详细信息：“已超出错误阈值。接收位置正在关闭。”。

The Messaging Engine failed to add a receive location "M2m Othello Export Start Bestand" with URL "\\m2mservices\\Othello_import$\\DataFilter Start*.xml" to the adapter "FILE". 消息传递引擎无法将具有URL“\\ m2mservices \\ Othello_import $ \\ DataFilter Start * .xml”的接收位置“M2m Othello Export Start Bestand”添加到适配器“FILE”。 Reason: "The FILE adapter cannot access the folder \\m2mservices\\Othello_import$\\DataFilter Start. Verify this folder exists. Error: Logon failure: unknown user name or bad password. ". 原因：“FILE适配器无法访问文件夹\\ m2mservices \\ Othello_import $ \\ DataFilter Start。验证此文件夹是否存在。错误：登录失败：未知用户名或密码错误。”

The FILE adapter cannot access the folder \\m2mservices\\Othello_import$\\DataFilter Start. FILE适配器无法访问文件夹\\ m2mservices \\ Othello_import $ \\ DataFilter Start。 Verify this folder exists. 确认此文件夹存在。 Error: Logon failure: unknown user name or bad password. 错误：登录失败：未知的用户名或密码错误。

An attempt to connect to "BizTalkMsgBoxDb" SQL Server database on server "ZNACDBBTS" failed. 尝试连接到服务器“ZNACDBBTS”上的“BizTalkMsgBoxDb”SQL Server数据库失败。 Error: "Login failed for user ''. The user is not associated with a trusted SQL Server connection." 错误：“用户登录失败”。用户未与受信任的SQL Server连接关联。

It woould seem that there's a login failure at this time and that because of it other services are also experiencing problems, and eventually they are shut down. 似乎此时登录失败，因此其他服务也遇到问题，最终它们被关闭。

The thing is, our user is admin, and it's impossible that it's password is wrong "sometimes". 问题是，我们的用户是管理员，并且“有时”密码错误是不可能的。 We have concidering that the problem could be due to an infrastructure problem, but that's not really are department. 我们已经确认问题可能是由于基础设施问题，但这不是真正的部门。

I know it's a long post, but we're not sure anymore what to do. 我知道这是一个很长的帖子，但我们不确定该怎么做。 Would adding another server and balancing the load solve our problems? 添加另一台服务器并平衡负载会解决我们的问题吗？ Is there a way to meassure our balance and know where to start splitting? 有没有办法确保我们的平衡，并知道从哪里开始拆分？ What are normal numbers of load etc? 什么是正常的负载等？

I appreciate any answers because these issues are getting worse and we're also on a deadline. 我感谢任何答案，因为这些问题越来越严重，我们也处于最后期限。

Thanks a lot for replies! 非常感谢您的回复！

6 个解决方案

Your immediate problem is BizTalk throttling feature . 您当前的问题是BizTalk 限制功能。 It's supposed to help BizTalk survive temporary overload conditions. 它应该帮助BizTalk在临时过载条件下生存。 One of its many problems is that you can see the throttling kick-in only in the performance monitor and not in the event log. 其中的一个问题是，您只能在性能监视器中看到限制启动，而不是在事件日志中。

What you should do: 你应该做什么：

Separate the new application to a different host than the rest of the applications. 将新应用程序分离到与其他应用程序不同的主机。 Throttling is done in the host level. 限制在主机级别完成。 So the problematic application wont affect the rest of the applications. 因此，有问题的应用程序不会影响其他应用程序。
Read about how to disable throttling in the link above. 阅读有关如何在上面的链接中禁用限制的内容。
What we have done is implementing an external throttling service. 我们所做的是实施外部限制服务。 That feed the BizTalk receive location in small digestible packets. 这将BizTalk接收位置提供给易消化的小包。 Its ugly, but the problem is ugly. 它很难看，但问题很难看。

Update to comment: You have enough host instances. 更新评论：您有足够的主机实例。 So Ignore that advice. 所以忽略这个建议。 You may reorder the applications between the instances. 您可以在实例之间重新排序应用程序。 But there are no clear guidelines to do that. 但是没有明确的指导方针可以做到这一点。 So its just shuffling and guessing. 所以它只是在洗牌和猜测。
About the safeness of disabling throttling. 关于禁用限制的安全性。 This feature doesn't make much sense in many scenarios. 在许多情况下，此功能没有多大意义。 You have to study it. 你必须要研究它。 Check which of the throttling parameters you are hitting (this can be seen in the performance monitor) and decide how to change the thresholds. 检查您正在击中哪个限制参数（这可以在性能监视器中看到）并确定如何更改阈值。

How many host instances do you have? 你有多少个主机实例？

From the line: 从行：

The send and receive ports have a lot of different types: File, MQSeries, SQL, MLLP, FTP. 发送和接收端口有许多不同的类型：文件，MQSeries，SQL，MLLP，FTP。 Each of these types have a different host instances, to balance out the load. 每种类型都有不同的主机实例，以平衡负载。 Our orchestrations use the BiztalkApplication host 我们的业务流程使用BiztalkApplication主机

It sounds like you have a lot - I recently did an audit of a system where BizTalk was self throttling and the issue was in part due to too many host instances. 听起来你有很多 - 我最近对BizTalk自我限制的系统进行了审计，问题部分是由于主机实例太多。 Each host instance places its own load upon the BizTalk messagebox, as well as chewing up a minimum of 200mb memory. 每个主机实例都在BizTalk消息框上放置自己的负载，并且至少要占用200mb内存。

Reading your comment, you have 20 - this is too many and would be a big part of your problems. 阅读你的评论，你有20个 - 这太多了，并且会成为你问题的重要组成部分。

A good starting host setup would be: 一个好的启动主机设置将是：

A dedicated tracking host 专用的跟踪主机
One host that contains all receive handlers for adapters 一个主机，包含适配器的所有接收处理程序
One host that contains all orchestrations 一个包含所有业务流程的主机
One host that contains all send handlers for adapters 一个主机，包含适配器的所有发送处理程序
One host for adapters that need to be clustered (like FTP and MSMQ) 一个需要集群的适配器主机（如FTP和MSMQ）

You can then also consider things like introducing "real time" hosts and batched hosts, so you can tune the real time hosts for low latency. 然后，您还可以考虑引入“实时”主机和批量主机，这样您就可以调整实时主机以实现低延迟。

You can also have hosts for specific applications if there are known to be unstable, but in general this should not be done. 如果已知不稳定，您也可以拥有特定应用程序的主机，但通常不应该这样做。

I run a BizTalk system that has similar problems and can empathize with what you are seeing. 我运行的BizTalk系统有类似的问题，可以理解你所看到的。 I don't know if it's the same, but I thought I'd share my experience in case. 我不知道它是否相同，但我想我会分享我的经验。

In the same manner restarting the send/receive seems to fix the problem. 以同样的方式重新启动发送/接收似乎可以解决问题。 In my case I found a direct correlation to memory usage by the host processes. 在我的例子中，我发现主机进程与内存使用率直接相关。 I used performance counters to see when a given host was throttled for memory. 我使用性能计数器来查看给定主机何时被限制为内存。 By creating extra hosts, and moving orchestrations and ports between them I was able to narrow down which business sets were causing the problem. 通过创建额外的主机，并在它们之间移动编排和端口，我能够缩小导致问题的业务集。 Basically in my case restarting the hosts was the equivalent to the ultimate "garbage collection" to free up memory. 基本上在我的情况下重新启动主机相当于释放内存的最终“垃圾收集”。 This was of course until enough instances came through to gobble it up again. 当然，直到有足够的实例再次吞噬它。

I'm afraid I have not solved the issue yet, but a few things I found to alleviate the issue: 我担心我还没有解决这个问题，但我找到了一些缓解这个问题的方法：

Raise the memory to a given process so that throttling does not occur or occurs later 将内存提升到给定进程，以便不会发生限制或稍后发生限制
Each host instance, while informative, does have an overhead that is added. 每个主机实例虽然提供了信息，但确实有一个额外的开销。 Try combining hosts that are not your problem children together to reduce the memory foot print. 尝试将不是您的问题孩子的主机组合在一起，以减少内存占用量。
Throw hardware at the problem, ram is cheap 抛出硬件问题，ram很便宜
I measure the following every few minutes in perfmon so I can diagnose where the problem is: 我在perfmon中每隔几分钟测量以下内容，这样我就可以诊断出问题所在：
BizTalk:MessageAgent(*)\\Process memory usage (MB) BizTalk：MessageAgent（*）\\进程内存使用量（MB）
BizTalk:MessageAgent(*)\\Process memory usage threshold BizTalk：MessageAgent（*）\\进程内存使用量阈值
Memory\\Available MBytes 内存\\可用MBytes

A few other things to take a look at. 其他一些事情要看一看。 Make sure any custom pipelines use good BizTalk memory practices (ie no XML DOM manipulation hiding somewhere, etc). 确保任何自定义管道使用良好的BizTalk内存实践（即没有隐藏在某处的XML DOM操作等）。 Also theoretically reducing the number of threads for a given host should lower the amount of memory it can seize at one time. 理论上，减少给定主机的线程数也应该降低它一次可以占用的内存量。 I did not seem to have much luck with this one. 我似乎对这个没有太多运气。 Maybe the BizTalk throttling overrode it as others have mentioned, I don't know. 也许正如其他人提到的那样，BizTalk限制会超越它，我不知道。 Also, on a final note, if you dump the perfmon results to a csv, with Excel you can make some pretty memory usage graphs. 另外，最后要注意的是，如果将perfmon结果转储到csv，使用Excel可以制作一些漂亮的内存使用图。 These might be useful for talking to management about buying more hardware. 这些可能对于与管理层讨论购买更多硬件非常有用。 That's assuming your issue fits this scenario as well. 这是假设您的问题也适合这种情况。

We fixed the problem temporarily due to a combination of all ur answers. 由于所有答案的组合，我们暂时解决了问题。

We set the process memory usage throttling parameters of some hosts higher. 我们将某些主机的进程内存使用限制参数设置得更高。

We divided the balance of the host instances better after I analyzed all the memory usage of all hosts, thanks to performance counters and also with the use of a tool called MsgBoxViewer. 在分析了所有主机的所有内存使用情况后，我们更好地划分了主机实例的余额，这要归功于性能计数器以及使用名为MsgBoxViewer的工具。

And now we're trying to get more physical memory & hopefully also an extra server or a 64bit server. 而现在我们正试图获得更多的物理内存，并希望还有一台额外的服务器或64位服务器。

Thanks for all replies! 感谢所有回复！

We recently installed a 64-bit server in cluster with our older server. 我们最近使用旧服务器在群集中安装了64位服务器。 Thanks to this we can balance the memory even better which solved a lot of problems. 多亏了这一点，我们可以更好地平衡内存，从而解决了很多问题。

Although the 64-bit didn't give us much improvements (except for a bit more memory) since it can't use 64-bits on IBM MQ's, MLLP's, HL7 pipelines etc... 虽然64位没有给我们太多的改进（除了更多的内存），因为它不能在IBM MQ，MLLP，HL7管道等上使用64位......

The other answers are helpful for run-time performance tuning, but i would recommend a design change as well. 其他答案有助于运行时性能调整，但我也建议进行设计更改。

You say that you do a lot of message manipulation in the orchestration in the message assignment shapes. 您说您在消息分配形状的业务流程中进行了大量的消息操作。

I would recommend moving that code to dedicated transforms. 我建议将该代码移动到专用转换。 They are much more light weight, and can be executed faster. 它们重量更轻，可以更快地执行。 You can combine custom xslt and c# in these maps to do the hard work. 您可以在这些地图中组合自定义xslt和c#来完成艰苦的工作。 Orchestrations cost more in development, design and testing, and a whole lot more in run-time performance. 业务流程在开发，设计和测试方面的成本更高，而且在运行时性能方面的成本更高。

You can then use transforms for message transformation, and leave the orchestrating (what is left of it after moving the message assignment code) to the orchestrations. 然后，您可以使用转换进行消息转换，并将编排（在将消息分配代码移动后剩下的内容）留给业务流程。

The added benefit of using transforms over orchestrations is that they are much more testable. 使用变换而不是编排的额外好处是它们更易于测试。