Yet Another Non-Deterministic BizTalk Zombie Pattern

Victor Fehlberg’s article on Zombies has prompted me to blog about a similar problem I recently encountered with instances that have ‘completed without consuming all of their messages’ and propose yet another BizTalk Zombie Pattern.

The Technet article on Zombies in BizTalk 2006 details three types of Zombie messages:

  • Terminate control messages;
  • Parallel listen receives; and
  • Sequential convoys with non-deterministic endpoints.

Victor’s article deals with the common pattern that falls into the third category – a while loop surrounding a listen with one branch having a receive and the other having a delay shape, followed by a construct which sets a variable to indicate that the while loop should stop. This is non-deterministic since the delay could be triggered, but a message could still be delivered.

Non-Deterministic Orchestration - Creates ZombieBut I believe I’ve discovered yet another non-deterministic pattern that will always produce zombies; it doesn’t quite fit the pattern described above or on Technet, but it does use sequential convoys and in my opinion, is also non-deterministic.

The branching non-deterministic zombie pattern

Consider the orchestration in the image to the left, we have a sequential convoy (albeit without the while loop, but you get the idea) which can either branch and collect the next message in the convoy (down the ELSE branch) or terminate the orchestration (down the TERMINATE branch).

If a second message is received and correlated to this particular orchestration, but the workflow logic has determined that we will go down the TERMINATE branch, the second message will never be received by the orchestration, causing a zombie. Every. Single. Time. In my opinion, this is non-deterministic in the truest of senses: at the start of the orchestration you cannot determine whether the ELSE or TERMINATE branch will be traversed!

Is there a good reason for using this sort of pattern?

I’ve been trying to think why a sequential convoy would require this kind of branching functionality and I honestly can’t think of a good reason, unless the messages being delivered to the orchestration were controlled in some way. 

More to the point, I’m surprised that the XLANG/S compiler lets you even create an orchestration with the sort of potential damage that this pattern presents!

If you are using this kind of pattern, I’d be interested to know how (and why)?

And in case you’re wondering, a very similar pattern was noticed by our testing team who encountered missing messages on our UAT environment; on closer inspection of live, we had just over 10,500 suspended, zombie instances…. Ouch.

Debugging the problem took half a day and made me realise that we really do need a good tool to help search for subscriptions based on the context properties in a message, rather than the out-of-the-box Admin Console Subscription Viewer.

Sample project

If you’re interested in trying out the pattern for yourself, download the sample project and test messages.

Advertisements

2 thoughts on “Yet Another Non-Deterministic BizTalk Zombie Pattern

  1. Joe Duffy, whois an expert on all things multi-t and locking, and regularly writes for MSDN magazine wrote this once:

    13. A race condition or deadlock in library code is always a bug.

    http://www.bluebytesoftware.com/blog/PermaLink,guid,f8404ab3-e3e6-4933-a5bc-b69348deedba.aspx

    BizTalk is the library in this case and this problem should have been better addressed by MS.

    A workaround for certain scenarios much like the one of Victor: disable the receive location by code in the orchestration, wait twice the cache interval (forgot why exactly) and asynchronously start orchestration that enables the RL.

    Antoher workaround that I used for singleton: a service window around midnight for the RL, inside the delay branch of the listen shape you can check the time and if you are inside the service window (take a margin here + or – 1 or 2 minutes) you can more or less safely recycle. That solved the problem for me. I guess it is not bullet proof either but it worked for us.

    Gregory

  2. Hello there.
    We have this kind of scenario in our orchestration.

    I agree, it looks awkward but this is what our business process is.
    We are lucky actually to have this pattern twice in one orchestration:
    1. loop in which we determine if all the required data available. if yes – go further, no – send request to user and wait answer or timeout. User response starts the check cycle again.
    2. loop again where orchestration tries to send a message to ext. system and wait for a response or times out. If response then exit from orch. If timeout – weird business logic – we still wait answer from external system and in the same time command from user to retry or for the final timeout which is orch. exit.

    I can imagine that there is possibility to decouple this orchestration and replace this non deterministic logic with series of pusher-subscriber chains but this orch is so complex and has so many variables/messages/operations/exception handlings aside from this weird pattern that I can’t figure it out….

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s