Stability for DC v3.03 and Skylark v5.5

wbrown · April 16, 2019, 12:26pm

I experienced another hard lock-up last night.

Last time I measured, the board had +5.1 VDC on the test points. The power brick is a dedicated power brick, with a permanent cable going out to a micro-USB connector, and essentially zip-cord (maybe 20-22awg by eye). It is made to power a Raspberry Pi 3.

The only thing I had going on perhaps out of the ordinary, is I had an external computer hitting the API interface once every 10 seconds, and another hitting the filescan point every 10 minutes. I also left an SSH shell connected, but it was just sitting at the command prompt.

I will try letting the DC operate as intended, with no external API calls going into it, and no SSH shells left open.

If it provides a datapoint, when it locks up, there are only two LEDs lit solidly, all the others are dark. I have to hold the power button down for about 5 seconds and it will power off.

If it locks up again I will try re-flashing.

w

kb3cs · April 16, 2019, 12:57pm

fwiw, leaving a ssh session going does not appear to tickle any misbehavior. neither does having the second SDcard installed. 1.03e6 more packets processed without incident.

ps: may we please have ‘htop’ in the next firmware release?

Syed · April 16, 2019, 3:15pm

@kb3cs It’s best to add this to a feature-request thread, so it doesn’t get buried.

donde · April 16, 2019, 4:50pm

Has anyone reported using a quality lab power supply 24/7 with absolutely no lockups outside of mains failure? Also maybe test with a large size (uF) capacitor filter circuit just before the micro USB connector on the board. Use memory oscilloscope to record momentary 5 volt transients when using normal 2 amp power supplies.

wbrown · April 17, 2019, 2:28am

I may resort to doing something with a 'scope if issues continue.

I also noted this in the Diagnostics page tonight:
Last few h/w reports:

   Apr 16 19:25:25 spi_master spi32764: spi32764.0: timeout transferring 64 bytes@10000000Hz for 110(100)ms
   Apr 16 19:25:26 spi_master spi32764: spi32764.0: timeout transferring 64 bytes@10000000Hz for 110(100)ms
   Apr 16 19:25:27 spi_master spi32764: spi32764.0: timeout transferring 64 bytes@10000000Hz for 110(100)ms
   Apr 16 19:25:29 spi_master spi32764: spi32764.0: timeout transferring 64 bytes@10000000Hz for 110(100)ms
   Apr 16 21:04:42 spi_master spi32764: spi32764.0: timeout transferring 64 bytes@10000000Hz for 110(100)ms

any comments on these?

w

kb3cs · April 17, 2019, 11:17am

i think that’s related to the audio output.
ps has: /usr/bin/aplay -B 10000000 -T 0 -t raw -r 48000 -c 1 -f s16_le -
my rate of spi timeouts is about 1 per 2 hours.

wbrown · April 17, 2019, 12:11pm

OK thanks. I saw previous references to this spi timeout in other threads on stability.

kb3cs · April 19, 2019, 9:32am

can scratch my previous reports of reset swtich being unresponsive. i’m thinking now it only seemed unresponsive because, i suspect, i did not hold it down long enough.

DC appears to be quite stable as long as i avoid long-term monitoring of the tuner stats (by web or by LCD). not sure if that’s because of reading tuner chip data or because of the constantly updating dynamic readout/display.

wbrown · April 21, 2019, 8:45pm

I have been slowly working through setup permutations, and am still not getting more than 48hrs up-time. The most recent post by kb3cs gives me a hint, as I have been externally logging tuner stats since I began.

Forgoing listing all the permutations here, the most recent lockup occurred with a +5V USB power brick of adequate power plugged into a UPS, with a 'scope set to trigger on a falling edge at +4.8V (monitoring the Vusb test point on the DC board). Steady-state voltage is about +5.1VDC. The scope never triggered, and the unit locked up. So now I have ceased external tuner stats logging, and will leave the DC (with screen-blanking enabled) on the main screen (not the tuner stats screen) and see where that gets me.

A thought for next revisions, or the Lantern proper… is to put a hardware watchdog into the system. I understand the DC is interim hardware, still on final-approach for a consumer-oriented Lantern device. But if there is some edge-case that can cause the unit to lock-up, it would seem that a hopefully simple and cheap ($1-2 in parts?) hardware watchdog timer that resets everything in case of a timeout would be beneficial. Since the file transmission scheme has redundancy built-in, I would image other than reviewing a boot-log or similar, having the watchdog kick in and reset the system when appropriate would probably go entirely unnoticed.

Will

sv_sigint · April 24, 2019, 9:22pm

Logging statistics isn’t a universal dreamcatcher killer - I’ve been polling tuner stats at 5 second intervals for the last two weeks.

One thing I’m curious about is overheating. If you do sed -Ee 's/(.+?)(...)/\1.\2/' < /sys/class/thermal/thermal_zone0/temp you get back a number that looks suspiciously like processor temperature. My dreamcatcher is sitting outside in an opaque box and it current reads ‘52.400’… which seems pretty reasonable for an electronic thing sitting in the sun. lnbstatus.sh doesn’t report any overheating though. You could probably patch diags.sh to also include a temperature output to see if that gives you any leads.

kb3cs · April 24, 2019, 10:31pm

not using any GUI, by any chance?

$ sed -Ee 's/(.+?)(...)/\1.\2/' < /sys/class/thermal/thermal_zone0/temp
28.000
$ sudo lnbstatus.sh
Password: 
Bias-T Config: 0x8b

[ ok ] Bias-T is configured on
[ ok ] Bias-T voltage is set to 14.2V

LNB Status: 0x23

[ ok ] LNB power is configured on
[ ok ] LNB detected, normal current flow
[ ok ] Bias-T Voltage normal

wbrown · April 24, 2019, 10:38pm

I’ve stopped all logging and still saw lockups. I pulled the 2nd SD card the other day. It’s a waiting game to see if this fixes it. I’ve read about SD card issues. It always recognizes the 2nd card, but has never copied any files over, except on reboot.

sv_sigint · April 24, 2019, 10:56pm

Nope, just a little python API I wrote to dump statistics to a CSV file that gets plotted with RRDtool until I figure out a better long term statistics mechanism. I’m not even sure what screen my dreamcatcher is on but knowing me, it’s on the signal/lock graph screen.

28C seems like a lovely temperature, heck my idle raspberry pis don’t run that cool.

Hm. It looks like the dreamcatcher already has a watchdog… using an in-kernel watchdogd? So if the kernel is alive enough to keep feeding the watchdog it won’t reboot even if userland is totally borked.

[Skylark][othernet@othernet:/etc]$ dmesg | grep watchdog
[    2.144345] sunxi-wdt 1c20c90.watchdog: Watchdog enabled (timeout=16 sec, nowayout=0)
[Skylark][othernet@othernet:/etc]$ ps ax | grep watchdog
   20 root     [watchdogd]
[Skylark][othernet@othernet:/etc]$ sudo ls -l /proc/20/exe
ls: /proc/20/exe: cannot read link: No such file or directory
lrwxrwxrwx    1 root     root             0 Jan  1 00:00 /proc/20/exe

Time to read up on the sunxi watchdog driver to see if it’s possible to write a shell script to feed the watchdog and maybe that can be used reboot on failure.

I guess the next thing I’d try is a combination of three things and wait for the next lock-up:

ssh in and run top
ssh in (again) and run tail -f /var/log/messages
attach a serial console and see if you get any kernel messages that might indicate a panic

kb3cs · April 24, 2019, 11:53pm

not doing anything special. DC is sitting on a counter indoors. no direct sunlight on it, either.

wbrown · April 25, 2019, 1:04pm

Had a lockup with the 2nd SD Card removed. Had the SSH tasks running as sv_sigint suggested (but not the serial console). Output looked normal until it froze and disconnected. Just did a factory reset using the GUI, and will just let it run as it comes, with no external connections at all. (the unit will be in AP mode, and I will not even connect to it for this next test)

sv_sigint · April 25, 2019, 3:52pm

Booo. I was hoping that there might be some indication from top of a process gone awry, or a message sent to syslog right before it dies.

Does the dreamcatcher respond to ping? If it does, I would guess that the watchdogd (in kernel?) is keeping the watchdog from rebooting. I’ve not yet looked into exercising that functionality to see if the watchdog will actually reboot the board, or if there’s a way to make the health check depend on actual working userland.

wbrown · April 26, 2019, 1:09pm

I forgot to try ping… but the last lockup had a frozen-on LED on the WiFi dongle. Running now with the WiFi dongle removed (and no other network adapter in its place).

kenbarbi · April 26, 2019, 1:31pm

My 2 Cents

I have been experiencing occasional problems with my Dreamcatcher rebooting itself (such as after a power failure in the house) where it comes up fine, but does not connect to my network and receive a local IP address. I know it is working fine, because after I finally get it connected to the network (by doing reboots - - sometimes 3 or 4 are needed), my What’s New Tab shows content received all during the time when the Dreamcatcher was not connected to my router.

Any thoughts?? Ken

wbrown · April 26, 2019, 7:56pm

With no slight to the Othernet project, I am guessing the WiFi dongle provided with the kit is most likely one of the cheapest parts available. And my experience with the cheapest parts available is that while 80% may work fine, 20% do not. And any price premium paid for a higher cost part is usually to move that 20% down to under 1%. Also my 2 cents.

w

maxboysdad · April 27, 2019, 12:24am

While we are in the 2 cent comment area, I think you will find that the Othernet/Outernet engineers tested other WiFi dongles with the DreamCatcher 3.02 and not seen any better results from expensive ones. As one person stated, “it either works or it doesn’t”. The goal of “the project” is to put together an effective receiver at as low a cost as possible, which includes researching both least expensive and more expensive parts prior to producing a development board for production. When there is no need to spend more money to get the best result, we get the desired cost reduction. I think you will find that in cost comparison to performance that EDUP is higher quality than most units of this size and functionality. The challenge is then in our laps to produce what “works better” if we can, and then report on how it was done, as you see many of our peers reporting on here from day to day.