Frustrating network instability solved

ssaunders · Jan 13, 2009

Thought I would share an experience in case it helps someone else from slowly going mad... Check MMI rates.

I have a small network (1 x CNI, 1 x AUX, 1 x L5508D1P, 1 x DLT, 3 x 4-gang Saturn, 2 x 2-gang Saturn, dedicated PC with HomeGate / Toolkit, and ~80 metres cable), which will be the start of something bigger. This install has gone into a partial renovation, so choosing a cabling route to avoid other wiring proximity was tough.

The problem: Quite high error rates in diagnostic (our experts here report that anything more than an error every three seconds will give trouble - it does, btw). I had an error rate of just under one every three seconds at times. HomeGate was frequently dropping the network offline, usually three or four times an hour. And the number of drop-offs increased as I increased the frequency of network syncs in C-Gate.

To resolve this I tried the following:

First: Split off parts of the network to identify cable issues.
This worked, but didn't solve anything. I would segregate a network portion, see the error rates drop to zero, re-do all connections on that cable run, then put everything back together. High error rate would return. Stragely, when I would split off a second network portion, the same thing would happen, indicating that I had more then one problem.

Second: Replaced and re-routed some cable runs.
There were some runs that passed parallel to 240V, in some cases with wires actually resting along side. I fixed these, whilst monitoring error rate changes in the diagnostic utility traffic analyser. It worked during the re-cable effort, but didn't solve anything. I would change the cable, re-connect the wires in readiness to re-attach the switch, check error rates - none - so hook the switch back up. Errors would return.

Third: Thought I had a dodgy component somewhere.
Set about systematically removing components to see when the errors stopped. It worked, but again didn't solve anything. Taking some of the wall switches out dropped the error rates to zero. But when this switch was used to substitute another switch elsewhere on the network, the nil error rate remained. Also, when a second new switch was substituted, errors retruned. Not the switches.

Fourth: Opened my eyes when using dignostic utility.
Steady state for my network, with no artificial traffic or wall switches pressed produced the one initial error on analyser start, then no more errors. The only traffic was MMI packets. As soon as a wall switch button was pressed, an error occurred almost every time. As soon as traffic was generated with the diagnostic utility, errors always occurred.

What I was missing was that MMI packets were being transmitted at an average rate of 1.0 per second. Based on the settings for Status Reports on the Global tab, all wall switches were set to 3 second report.

So why the high MMI rate?

My theory is that even though the Status Reports were all at 3 seconds, some of the switches - and may have been just one - were ignoring this. As I understand it, all switches choose the lowest value for status reports on any given network, so if one was weird, all would be so inclined.

Strangely, a bulk change in toolkit of status report interval did not work. Why, I don't know. In traffic analyser, the unit listed for the MMI packets was 21, so I opened up toolkit, set the interval for that unit to 5 seconds, saved, then set back to three seconds.

All errors stopped.

I then did the same thing with toolkit to all the other wall switches to be sure.

So far, HomeGate has been running for 11 hours with no issue.

In looking at what was happening in level 9 C-Gate log, when the dimmer was re-synchronising it was going slowly, sending packets at an interval corresponding to 'response-delay' in C-Gate config. There were more packets transmitted during an error event, so I can only surmise these were re-trys or some other form of error recovery. I had fiddled with 'response-delay' in config setting it to 1000 and 2000 and default of 10000 during my diagnosing frenzy (with no positive effect), so maybe because MMIs were sent every second, the response delay being on a one second boundary simply let C-Gate blunder into another MMI 'collision' on retry. I have no way of verifying this theory now, as the status report can not be set to one second in toolkit (min 3). I also don't know whether setting this to some odd value (example, 1233) would have helped to avoid the condition.

Hope this helps someone. Or if this sort of thing is a usual occurrence, then you installers definitely earn your money.

Don · Jan 13, 2009

Thanks for your detailed report!

I think there is something fishy going on here - I'm not convinced the status reports are your real problem. Reason: Status reports are always issued ( by key or sensor units) at the lowest possible message priority. This means that if any other message (lighting command or scene trigger) is issued, the worst-case latency on a moderately busy network will be on the order of 130ms, but the command will always issue. This should not be interpreted by a diagnostic tool as an error.

About the reptition rate of your status reports - the status reporting interval is adjustable on a PER PRIMARY APPLICATION basis. If you are using multiple applications for 'lighting' control (could be any load type but using standard key / sensor inputs issuing on / off or ramp-to-level commands), the status reports will be more frequent. For example - if you have three groups of key input units operating in three different primary applications (secondary applications will not issue status reports) and have every key / sensor set for a status report repetition rate of 3 seconds, then you will see one status report (MMI) every second (one for each of the three applications every 3 seconds). If you are seeing more frequent status reports and have only a single primary application, then you are suffering from something I've never heard of before.

Your problems are not typical. Hopefully we can work out what's going on here.

ssaunders · Jan 13, 2009

Fishy MMI

Thanks, Don.

I had a doctor tell me the same thing about my problems not being typical, once

Two primary applications are used in the network. Trigger control (on 202) for the AUX, and Lighting (on 56 only) for dimmers and wall switches. I am definitely now seeing an average of one MMI every three seconds after changing the MMI interval on switch unit 21 (a 5082NL with 1.8.00 firmware) to 5 seconds then back to 3 seconds. So something was clearly wrong with that unit.

Don't know if a C-Gate log would help to see what was up. But if you (or anyone else) is interested, I've got a 43MB file with eight days of level 9 log chock full of fishy error goodness just waiting to be pondered. I'm also still collecting at level 9 with everything running 'smoothly'.

ssaunders · Jan 13, 2009

24 hours later...

Approx 24 hours after MMI traffic reduced, there has been only one error event recorded in the C-Gate log, but this did not affect HomeGate availability. Log below. Happy HomeGate = Happy Wife. Happy Wife = Happy Life.

Being somewhat conservative (paranoid?), I have backed off the MMI traffic to a rate of one status update every four seconds from the default of three.

Tonight, I stopped HomeGate, etc., and ran another traffic analysis with continuous toggle @ 100ms interval. Image below. After 20 minutes, and >15,000 messages, the only error logged was the usual initial one. So far so good.

Still waiting for the 'gotcha' which will pop my bubble of content. Which Don may deliver when I report that the only time I see errors on the network is when any wall switch is pressed (ramp or toggle, but toggle produces more errors). Image also below. But this is only in conjunction with high load diagnostic traffic. I struggle to think what the difference is between on/off messages produced at a rate of over ten per second by diagnostic utility and pressing a wall switch. Interestingly, when no high rate diagnostic traffic is in progress, analysis produces no errors when any of the wall switches is pressed. Image also below. By the way, it doesn't matter the wall switch type (DLT, Saturn, or classic) used, or whether the switching event is actually switching something 'real' --- one of my still to be installed switches toggles a group address associated with a yet-to-be-purchased dimmer, and this still produces an error with the diagnostic traffic running.

Inexperience says this may be normal behaviour, but the engineer in me says: "What the?" Anyone else see the same thing?

20090113-143107 753 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 Net Sync: synchronizing unit 1 of 10 at address 0
20090113-143107 735 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 sent cmd (fastpci): \4600002104g
20090113-143107 765 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 got packet confirm: g.
20090113-143107 734 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 response: 860000008D04FFFFFF7D83187EC3E7BB0000F1
20090113-143107 735 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 sent cmd (fastpci): 1A2A06h
20090113-143107 765 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 got packet confirm: h.
20090113-143107 734 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 response: 86000000872ABE2A2DCE4C5E3C
20090113-143107 735 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 sent cmd (fastpci): 1A2306i
20090113-143107 765 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 got packet confirm: i.
20090113-143107 734 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 response: 860000008723A31A3279E79EE3
20090113-143107 763 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 C-Bus PCI V3 or later detected at unit: 0
20090113-143107 735 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 sent cmd (fastpci): 1AF201j
20090113-143107 765 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 got packet confirm: j.
20090113-143107 734 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 response: 8600000082F2FF07
20090113-143107 753 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 Net Sync: address 0 type PCLOCAL4 version 4.6.00 catalog 5500PC
20090113-143107 735 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 sent cmd (fastpci): 2110k
20090113-143107 765 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 got packet confirm: k.
20090113-143107 734 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 response: 86000000821002E6
20090113-143107 735 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 sent cmd (fastpci): 2104l
20090113-143108 765 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 got packet confirm: l.
20090113-143108 734 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 response: 860000008D04FFFFFF7D83187EC3E7BB0000F1
20090113-143108 735 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 sent cmd (fastpci): \4601002104m
20090113-143108 765 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 got packet confirm: m%
20090113-143108 827 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 C-Bus Send Error: (%)(retrying)
20090113-143108 735 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 sent cmd (fastpci): \4601002104n
20090113-143108 765 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 got packet confirm: n%
20090113-143108 827 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 C-Bus Send Error: (%)(retrying)
20090113-143108 735 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 sent cmd (fastpci): \4601002104o
20090113-143108 765 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 got packet confirm: o%
20090113-143108 827 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 C-Bus Send Error: (%)(retrying)20090113-143108 735 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 sent cmd (fastpci): \4601002104p
20090113-143108 765 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 got packet confirm: p%
20090113-143110 785 //IRIS/254/p/1 5f963150-87e4-102b-aa0c-8eb3dabcd4e9 Bad extendo-diagnostic summary (null) for unit: 1
20090113-143110 735 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 sent cmd (fastpci): \4602002104q
20090113-143110 765 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 got packet confirm: q.
20090113-143110 734 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 response: 860200008D0438FFFF7D83187FA1D1C20328BB
20090113-143110 735 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 sent cmd (fastpci): 2110r
20090113-143110 765 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 got packet confirm: r.
20090113-143110 734 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 response: 8602000085100BFFFF04D6
20090113-143110 735 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 sent cmd (fastpci): 2106s
20090113-143110 765 //IRIS/254 ebaa2300-87e3-102b-a9ed-8eb3dabcd4e9 got packet confirm: s.
2

ssaunders · Jan 13, 2009

Just when I thought I achieved stability

Image attached of more spurious, if not occasional errors whilst doing nothing at all on the network (apart from hammering it with 100ms diagnostic traffic, that is ). But at least HomeGate appears unaffected.

In the image attached, I'm also seeing eight messages of type 'Lighting (48) (48)' logged during traffic analysis, but I'm not using application 48. Weird. Again, could be noob inexperience...

Don · Jan 14, 2009

Interesting

Note that the reported error symbol (%) indicates a timing error - this occurs when the PC interface can not recognise a C-Bus network synchronising clock pulse (it's either too narrow or too wide to be accepted as valid). Possibly there is still a source of noise on your network - burst noise might occasionally cause this - not sure - but this error type has nothing to do with status reports or malformed messages - it is an indication of the health of the physical medium.

For your information - the other possible error symbols are:
. = ok
# = too many transmissions (usually no unit at the destination application or address)
& = PC interface timeout ( result of 'busy' NACK from destination unit(s) too many times)
' ( single quote) = PC interface character overflow - too many characters sent to PC interface between carriage returns.

Don · Jan 14, 2009

Another possibility:

If the network voltage is below 16V, output units which are generating the C-Bus system communication clock waveform will stop generating the clock. If the network voltage is hovering close to this value, then it is possible that the clock is ocaisionally dropping out causing your troubles. Most units draw a steady current from the bus irrespective of communication, but unit current is affected by temperature by a small extent.

.. just another bit of possibly useful information for you.

ashleigh · Jan 14, 2009

The other thing is that a small number of errors (about one per status report / MMI interval) is normal and to be expected. Its an artifact of the way the analyser works and nothing to be concerned about.

So, for example, if you are really seeing a status report / MMI about every 3 seconds (the normal setting), and you see an error every 3 seconds, then thats all fine. MORE THAN THAT and there is value in digging deeper.

[Don: you can contact me privately about why this happens, if you like, its a long tedious explanation...]

ssaunders · Jan 14, 2009

Love your work

Love your work, Ashleigh.

For giggles (hah!), I ran analyser again tonight, and came across this very thing. For almost every MMI there was an error recorded when running a group toggle at 100ms interval. Less errors as the traffic rate was reduced.

Interestingly, when I power cycled the whole thing (before I read your post), the errors dropped to a rate of 7 in 4000 messages, with a total of 100 MMI in that period.

No doubt part of the long and tedious explanation for Don's delight.

In analyser, the MMI count is attributed to a given source unit. Not sure whether it is of consequence, and also not sure whether I am imagining it (no hard evidence yet), but could an increase in analyser reported error rate be possibly associated with analyser reporting a given source unit as attributed with MMIs? For example, before total restart (when errors were occurring), unit 21 was noted. After restart, unit 30 is shown, and the error rate is almost insignificant. Interestingly, it was key unit 21 that I found to be causing MMIs at a frantic rate of one per second some days ago, hence the whole reason for my initial post.

Thanks.

ssaunders · Jan 14, 2009

Voltage

Thanks for the heads up, Don.

Network voltage is hanging steady at around 30 volts, so no issue there.

And guys, should Diagnostic Utility have an apoplexy from time to time? Seems that with some actions (like running a long high rate analysis, then stopping, then trying to get a network voltate report ), everything goes non-communicative until disconnect/reconnect to C-Bus.

ashleigh · Jan 14, 2009

The MMIs are attributed to the unit that kicked them off, and this unit is selected using a magical arbitration process at the time the network is powered up.

Often you'll find that the same unit becomes the "MMI Master" - that is, the guy who says "OK folks **its MMI time**" (imagine a circus ringmaster, hollering away). That is... when the intervals are set the same. If there is a unit with a shorter interval than all the others, it will always win the election.

So.. if your units all have the same interval set, then most likely the same unit will win each time, but there is no guarantee of this, and if different units win and become ringmaster, that's normal and no big deal. And in fact, if you were to go to a running network and disconnect that ringmaster unit, you will find one of the others auto-magically takes over the ringmaster role.

Its all very clever...

By the way, those ringmaster roles can only be taken on by key input units. In some circumstances, some touchscreens may also take it on but this is about out of the ordinary.

ssaunders · Jan 15, 2009

Roll up!

Thanks Ashleigh. Now I can't get images of clowns out of my head. Brilliant.

ssaunders · Jan 15, 2009

HomeGate still happy

For the record, HomeGate is still happy since the MMI rate was backed off. The only errors present in the log since are all expected.

I am still not 100% convinced that the MMI rate of one per second was entirely to blame, and maybe the jiggling with it righted some other anomoly with the key input in question. Still haven't worked out how the darn thing got set to a one minute report in the first place, and can't set it back to verify. Sigh.

Don · Jan 15, 2009

I am aware of one shortcoming of the "Traffic Analyser" tool - not sure if is the same as the "Diagnostic Utility" you are using. The problem is that the Analyser doesn't completely model the behaviour of C-Bus and if there is a (normal) re-transmission for any reason, without any loss of data, it will still report as a communication error (which it shouldn't)

C-bus implements the lower three layers of a fairly complete "protocol stack" to achieve a high quality of communication. This includes up to three automatic re-transmissions in event of negative acknowledgements from any communication partner. A Diagnostic tool should use re-transmissions only as a measure of the network quality, but should not consider them as errors if the message does evenually get through.

ssaunders · Jan 15, 2009

Great info

Great info, Don and Ashleigh, and thank you sincerely for your comments and involvement on the forum.

Wonkey · Jan 16, 2009

Having read through this interesting post it brought to mind a problem I have seen which turned out to be something really simple a C-bus cable fault down to earth.
Check you wiring to the building earth if you use a meter on the dc range you shoud see that earth is floating somewhere between your 30v with respect to both C-Bus +ve and C-Bus -ve.

ssaunders · Feb 8, 2009

Interesting - CBus voltage wrt earth

Interesting post, Wonkey, and thanks for the comment. Took me a while to check, but here's what I found.

With the network fully connected (~21V at the moment), I see around 10V between CBus +ve and earth, and -10V between CBus -ve and earth. Makes sense.

Taking the same measurements with just one dimmer only (with inbuilt power supply, no other connected cables) directly from the CBus connections at the unit results in CBus voltage >~30V, with +/- ~15V from each of the CBus lines wrt earth. Again, makes sense.

What I'm seeing is what I am expecting.

With the fault you experienced, was one or the other CBus lines deviating from the expected half CBus voltage?

Frustrating network instability solved

ssaunders

Don

ssaunders

ssaunders

Attached Files:

analysis.JPG

analysis2.JPG

analysis3.JPG

ssaunders

Attached Files:

analysis4.JPG

Don

Don

ashleigh Moderator

ssaunders

ssaunders

ashleigh Moderator

ssaunders

ssaunders

Don

ssaunders

Wonkey

ssaunders

Want to reply to this thread or ask your own question?

C-Touch via Network Module

Network Interface

Network Conflicts

Loss of Network Connection Help Needed

Network Problem?

Integration 3 C-bus network and 8 Dali network Problem

Command fusion and Cbus very frustrating

How to re-enable network clock when unable to connect to C-BUS network

Frustrating network instability solved

Attached Files:

Attached Files:

ashleigh Moderator

ashleigh Moderator

Want to reply to this thread or ask your own question?

Useful Searches