[Nsi-wg] Call Tomorrow and Agenda
John MacAuley
john.macauley at surfnet.nl
Wed Feb 8 09:10:16 EST 2012
Slides for today.
On 2012-02-08, at 2:59 AM, Inder Monga wrote:
>
> Hi all,
>
> The following is dial-in information for Wednesday's NSI call, time: 7:00 PDT 10:00 EDT, 15:00 GMT, 16:00 CET, 24:00 JST
>
> 1. Dial Toll-Free Number: 866-740-1260 (U.S. & Canada) 2. International participants dial: Toll Number: 303-248-0285 Or International Toll-Free Number: http://www.readytalk.com/intl 3. Enter 7-digit access code 8937606, followed by “#”
>
> Agenda:
>
> 1. Firewall issues: John Macauley
> 2. Error Handling: Henrik
> 3. Other topics
>
> Thanks
> Inder
>
>
> Henrik's email attached:
>
> --
>
>
> Failure scenarios and recovery for the NSI protocol version 1.0 and 1.1
>
> == Introduction ==
>
> The main focus will be on the control plane interaction, and how to deal with
> message loss, crashes, and how to recover from them.
>
> With the exception of the forcedEnd primitives, all NSI control plane
> message interactions happens like this:
>
> Requester NSA Provider NSA
>
> operation ->
> <- operation received ack
> <- operation result
> operation result ack ->
>
> The main idea between the separation of the operation and operation result is
> that they may be separated by a significant time, especially the provision
> operation which can be separated with several days or months between the
> operation and the result.
>
> For failure scenarios, the loss of any of the four messages should be
> considered along with crashes of one or both of the NSAs at any point in time.
> These failure scenarios can be generalized into the availability of an NSA,
> i.e., it does not matter if it is the network or NSA that is down, the
> distinction is if the NSA received the message or not.
>
> In general the problem is to ensure that the (intended) state of a connection
> is kept in sync. There are two significant problems in the current protocol:
>
> * No clear semantics for the operation received ack
> * No clear division of responsibility between requester and provider
>
> Both of these are semantic issues (i.e., behavior), and hence solving them
> should not require any changes for the wire-protocol.
>
> From a theoretical point of view and assuming an asynchronous network model
> (note that async means something else in distributed systems than in networks)
> the problem is impossible to solve. Taking a slightly less pessimistic view
> (i.e., a partial synchronous network model), it becomes possible to recover
> some failures. Taking a pragmatic approach most errors are recoverable, given
> that the network and NSAs becomes functional at some point in time.
>
>
> == Control Plane Failure Scenarios & Recovery ==
>
> The following will go through a range of failure scenarios, and describe how to
> recover from them. Note that some of the scenarios can be solved in multiple
> ways. I've taken the approach that it is the responsibility of the requester to
> ensure that the connection at the at the provider is in the required state.
>
> A: Requester NSA did not receive the operation received ack.
>
> Note: This failure is equivalent to not being able to dispatch the message
> (here there failure just occurs earlier).
>
> Note: If the operation result message is received within the timeout, this case
> can be ignored.
>
> Potential causes: Message loss, network outage, provider NSA is down
>
> If the requester NSA after a certain amount of time have not received the
> operation received ack it must assume that the connection cannot be created or
> the state change. This can be dealt with in multiple ways:
>
> 1. Do nothing (hope it comes up again)
> 2: An alternative circuit can be found.
> 3: Tear down the connection and send operation failure up the tree.
>
> Which strategy to choose here is policy dependent and is up the individual
> implementation and organization. OpenNSA currently does 3.
>
> For the sake of preventing stale connections, the requester can keep a list of
> "dead" connections. The status of these connections can then be checked at
> certain intervals via query and a control primitive for fixing the status send
> if needed.
>
>
> B: Provider NSA could not deliver the operation received ack
>
> This situation is a special case of scenario A, but seen from the provider
> point of view.
>
> Repeated delivery attempts can be tried, but this an only an incremental
> improvement/optimization and does affect the end result.
>
> The provider should not try and change the state of connection, besides the
> latest received primitive from the requester (do the least surprising thing).
> It is up to the requester to discover the current state (via query) and change
> it if needed.
>
> Since it is the responsibility of the requester to discover the state, there is
> no need for the provider to perform "reverse query". In fact, using the reverse
> query, for connection state update may cause more harm than good, as having the
> provider change the connection status automatically may not be what the
> provider wants (he might have compensated somehow) and does not follow
> the element of least surprise, and leaves the control of the connection at two
> parties.
>
> Alternatively, a "Hi, I'm alive; sorry for the downtime" primitive be
> introduced from provider to requester, which the requester can then use to fire
> off any controlling primitives. This is, however, just an optimization.
>
>
> C: Provider NSA could not deliver the operation result message
>
> This case should be handled as described in scenario B.
>
>
> D: Requester NSA did not receive the operation result message.
>
> This case should be handled as described in scenario A.
>
>
> E: Operation result ack was not received.
>
> This case should be handled as described in scenario B.
>
>
> == Data Plane Failure Scenarios & Recovery ==
>
> Data plane failures are somewhat different from control plane failures. I am
> not well-versed in networking and NRMs, but will try to come up with a
> strategy:
>
> In general, I see two sorts of failures:
>
> 1. The failure is happening in my local domain.
> 2. The failure is happening outside my local domain.
>
> This might be an overly simplistic view of things.
>
> We assume that any fail-over, etc. have also failed, so the failure cannot be
> corrected (if it can be corrected quickly, it probably should).
>
> The further handling of a data plane failure will probably be policy dependent.
> For some users the network, might be completely unusable after a failure,
> where some would like to try and have it repaired. However trying to decide /
> figure out where and how this policy should be enforced is a rather tricky
> process, and probably out of scope of NSI for now.
>
> Instead I would suggest sending terminate messages downwards and forcedEnd
> upwards. Once this propagates to the initial requester a policy-correct action
> can be taken. I.e., convert a data-plane failure into a control-plane issue.
>
>
> == Recommendation / Action items ==
>
> * Make the exact semantics of the operation received ack clear
>
> Recommendation:
> - The message has been received (duh)
> - The request is sane
> - The request has been serialized (crash safe).
> - The specified connection exists (for provision, release, terminate)
> - The request was authorized
>
> This has the following implication:
> - Once the operation received ack has been received by the requester,
> the connection should show up on a query result. If we cannot expect
> the connection to show after the receival, the primitive should be
> removed as it has no semantic value.
> - Failing early will save message exchanges and time.
>
> * Make it clear which of the NSAs has the responsibility for what
>
> Recommendation:
> - The provider is the authority for connection status (duh)
> - Keeping connection state synchronized is the responsibility of the requester
>
> This has the following implication:
> - Any (non-scheduled) connection state change must only be done at the
> initiative of the requester
> - The requester query interface is not needed.
>
> --
>
> --
> Inder Monga
> 510-486-6531
> http://www.es.net
> Follow us on Twitter: ESnetUpdates/Twitter
> Visit our blog: ESnetUpdates Blog
>
> _______________________________________________
> nsi-wg mailing list
> nsi-wg at ogf.org
> https://www.ogf.org/mailman/listinfo/nsi-wg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/nsi-wg/attachments/20120208/329dc2cf/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: NSI-firewall issues-v1.pdf
Type: application/pdf
Size: 1414400 bytes
Desc: not available
URL: <http://www.ogf.org/pipermail/nsi-wg/attachments/20120208/329dc2cf/attachment-0001.pdf>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/nsi-wg/attachments/20120208/329dc2cf/attachment-0003.html>
More information about the nsi-wg
mailing list