Splitting and recombining larger files

sam_uk · November 2, 2016, 8:01pm

@syed @Abhishek I’ve been struggling to split the RSS html files into tiny bits, this is going to be a bigger problem. You don’t want to read a manual or book in 30 parts…

If you are somehow able to specify which folder a file ends up in was thinking maybe you could set chip to run

$ for d in ./*/ ; do (cd “$d” && sh recombine.sh ); done

Every 5 mins or so

Then before sending you could

$md5sum espanol-26-1-2016.htm
$split -b 1000 espanol-26-1-2016.htm

Send the files including Md5sum >>>>>

And then send a recombine script as the final file.

—recombine.sh—

#!/bin/sh
cat x* > espanol-26-1-2016.htm
md5sum espanol-26-1-2016.htm
if md5sum -c espanol-26-1-2016.htm.md5; then
#rm – “$0”
cp espanol-26-1-2016.htm > /final-location
rm -rf /path/to/directory/*

The MD5 sum matched

else

The MD5 sum didn’t match

fi

There’s almost certainly a better way of doing this, but some version of it would sort out larger files?

Abhishek · November 3, 2016, 8:54am

Welcome to my world. Its not that simple. The issue is - some of these parts may be received and others may not be. Some may be received later than others. Parts may be received completely out of order.

So split and recombine sounds easy but is quiet complicated to successfully do in practice, especially if one wants any reasonable end-user experience.

For example - In theory one can just wait for all pieces to show up - but what if just one piece never gets received?

If the probability that a particular file of 10KB size is successfully received is - lets say 95% - the probability that 10 parts, each of 10KB - are successfully received is less than 60%! And this still assumes infinite time available, which is not a reasonable assumption.

Hence the Wikipedia files that were converted to “Singlefile” form for transmission. It is essential that a file be usable by itself without relying on the presence of another file. Hence the complicated way we do the weather data as well. A “file” if not atomic, will cause problems for the end user.

That doesn’t mean its impossible to do. Splitting large files is definitely on my agenda and one of the priority items. Its just going to take some effort to get right.

But especially with news items - where recency matters - it is critical that we send down news as very small individual files that are usable independent of other files. So the split-and-recombine won’t help us with the news part of things. Thats the whole reason for the small file size Syed recommended. You may have yourself noticed that some files from Wikipedia - the larger ones like “Hillary Clinton” - never finish and appear to “be stuck”. Best way to avoid it is to have content go out in small, independently usable units.

sam_uk · November 3, 2016, 9:13am

Ah fair enough.,. I’ll carry on trying to hack the news then…

I’m trying to get it so you get:

A list of headlines News-%date.html with clickable links to local files
A set of small .html files one per news item.
Some theming so the above files are pleasant to read.

sam_uk · November 3, 2016, 9:58am

Looking at split you can give it a file name and numeric suffix?

So to give an example of ‘Where-there-is-no-doctor’ (Assuming all files are html)

split -d -b 10000 --additional-suffix=.html where-there-is-no-doctor where-there-is-no-doctor

where-there-is-no-doctor-01.html (10kb)
where-there-is-no-doctor-02.html (10kb)
where-there-is-no-doctor-03.html (10kb)

So this would seem to be the best of both worlds?

In scenario 1) where all the parts come down, the recombine script runs sucessfully the user gets it all in one easy to navigate file (200kb) Maybe copied to a ‘library’ folder or similar

In scenario 2) where one part is always missing then at least they can read all the other parts?

We could even script adding a Next>

Tag to the bottom of each chunk for easier reading? (This would need to be stripped out again by the recombine.sh script)

Seasalt · November 3, 2016, 10:42am

Abhishek,

This is a very interesting topic.

Lets talk real numbers.

Are you able to see diagnostics on successful downloads on the Library units giving telemetry that are connected to the internet?

In my opinion Outernet reception needs to be 100% received packets (including FEC) otherwise a bigger/ stronger signal or antenna is needed.

The other Issue is how are you multiplexing the Data files coming down?

Are you reinventing the wheel or has someone already done this kind of “one way” send of information and some kind of “one way send meta data markup methodology” so that it can easily be sorted into the correct location at the other end.

I think the military users of the world would already have bumped into this problem whilst sending out background information. (but that is a guess)

Abhishek · November 3, 2016, 10:48am

The ideal, in our current context of news would be to always present a “next” to the next available article.

Once the article has been seen - changing it around later into a larger file is going to confuse some users, irritate others, and just be bad UX overall.

As the UI is also going to change dramatically, It would be best to just send individual parts and provide navigation at the other end specific to the content.

sam_uk · November 3, 2016, 10:52am

In the context of news I agree. I wasn’t proposing it for news, more for complete books, such as the Where there is no doctor, where 10kb chunks are bad UX

Abhishek · November 3, 2016, 11:06am

No, but I do see percentage of frame lock for various receiver.

The important point to understand is there is no “100%” in this kinda setup. At all. I have three receivers setup. One is at 100% lock, second one remains at around 97% lock and the third one due to h/w issues - its only at 50% lock.

Some users have good instincts about positioning, interference, etc. They get high SNR and high lock %. Others don’t and some times don’t get the same. Some live in areas with high local interference.

Bigger antenna is not an option. We tried a bigger antenna and its definitely a problem for most users.

Stronger signal is a possibility, but at this stage, this is what we have. We live in a world where even servers in datacenters are built to not be very reliable, and we make up the slack in software. This was done cause it was realized throwing more money at hardware reliability was inefficient use of money. That same applies here.

While throwing more power at it is one option, its also very expensive, and we need to chose between more power behind lesser bits or less power behind more bits. If we can recover enough data using software erasure coding, we might be able to deliver higher bitrates even with lower signal levels. So its not an simple “100% received packets” kinda scenario at all. Even your local FM station doesn’t manage to deliver all “frames”. And its just a couple miles away - our satellite is 36,000 KM away.

Militaries of the world have unlimited money and resources - including their own satellites. We definitely are not in that bracket. (yet ).

Broadcasting is done by many, and its a solved problem. But remember that broadcasting is of media content - which is loss-tolerant by nature - cause even if our satellite tv channel glitches, we just keep watching the next scene. If you have DTH or Satellite Radio, I am sure you have noticed how frequently it glitches - and those are extreme glitches that even the human senses notice. Notice that those channels also have very high link budgets!

Filecasting like we do is a whole different ball game, and to the best of my knowledge, no one has reached as far as we have in operationalizing it - and definitely not with our kinda inexpensive receiver and small antenna.

FEC can only do so much - FEC was again designed with broadcasting in mind.

So no matter what - file reception will remain at less than 100% probability of success. And our system design reflects that.

Abhishek · November 3, 2016, 11:14am

It might be possible to generate a sort of “Table of Content” at the receiver side for books. The ToC could update with newer chapters as they are received. Depending upon the book, it might even leave placeholders saying “this part not available yet” or something.

So what I am saying is, I’d still prefer - even in the case of books - to have small chunks that are at least to some extent self contained.

sam_uk · November 3, 2016, 11:17am

This sounds good. Some chapters will be greater than 10kb though? So there might be a place for the chunking I describe above?

Abhishek · November 3, 2016, 11:20am

How much bigger? As I said, the chunking is going to happen, but right now I am in the middle of the Great UI revolution of November, 2016 ™. So its going to be some time before I can implement chunking.

But meanwhile if news can start going out, that’d be a big additional utility for users.

Seasalt · November 3, 2016, 11:20am

I completely agree you guys are the Gold Standard on community broadcasting and as such are going where no man has gone before on Datacasting with small L-Band receivers.

But I think the point we are both agreeing on is Lossy data i.e . music, video , images has a some what ability to loose some data before it is visibly noticeable to the end user. ie a glitch in a satellite movie or satellite TV is expected and accepted.

But a verbatim data file ie a text file or pdf or .exe file, it is totally reliant on being received in a 100 %, as broadcast state.

Hence I say all files that are verbatim files have to be received as transmitted 100% of the time.

Abhishek · November 3, 2016, 11:28am

Sure - that’d be ideal. But practically, it is not achievable - RF simply doesn’t work that way. So the end-to-end setup needs to be aware of that and be designed to still provide the most utility despite the losses.

Given infinite iterations of the carousel - every file will be received 100%, but this is not a realistic assumption. So the real probability of complete file reception will always be less than 100%. This probability is a function mainly of file size - and larger files will fail to a higher degree.

sam_uk · November 3, 2016, 11:40am

Sorry I think I was unclear, I was speaking in a more general way about books etc.

In relation to the (hacky) script I wrote the file size varies. The largest today is 37kb

Is that acceptable for now?

Abhishek · November 3, 2016, 11:42am

Its worth trying. Should be ok for now. Maybe we can clamp the max size to 50KB instead of 10KB.

Are these sizes post-compression?

sam_uk · November 3, 2016, 11:54am

No compression on these yet. (The CSS is inline which I realise is not optimised at all, but does make the file portable)

The largest from today compressed as a zip comes out at 11.4kb :

Abhishek · November 3, 2016, 11:56am

should be ok then.

inline css is fine - it makes the file standalone.

please see if you can remove other fluff though - scripts etc. and possibly minify the css?

sam_uk · November 3, 2016, 12:03pm

I think I’ve done all I can for now.

I don’t have much time for it at the moment. I’m also learning bash scripting as I go, so progress is slower than you might expect!

I might have time to play with it in a couple of weeks, but are you up for having a go with it in the meantime?

Thanks

Sam

Abhishek · November 3, 2016, 1:57pm

Absolutely. It will take me a couple days to switch gears. So best thing would be to bundle up what you have at the end of Friday your time and drop it to me in an email. Make sure you have a license included (MIT preferably) for everything that you wrote so we know we can use it. abhishek @ our domain.

I will try and have things going out over the weekend.

and Thanks!

Jherman · December 4, 2016, 2:12pm

With wikipedia articles and other known format content, why are you not wanting to separate out styling? IMO it seems that you could pre-sseed a style sheet on the device (most likely remain static so maybe in install) then only send the content and markers. This would create smaller files and ensure a higher reception rate.