Friday, April 26, 2019

Bootstrapping your freedom-free/open source web of trust with Cygwin and GnuPG

[I've put together a better discussion of trust, in relation to downloading computer software, here: https://defining-computers.blogspot.com/2019/05/analyzing-mechanized-trust.html.

I've also put together a better discussion of using checksums, here: https://joels-programming-fun.blogspot.com/2021/11/getting-freelibre-software-gimp-and.html.]

[I was confused when I wrote this. I expected certain files for download to be something they are not, and I wrote this based on those assumptions. I need to replace it with a bit different discussion of how you establish a technological chain of trust using Linux or BSD operating systems, and a bit different discussion of the uses of cygwin, two separate posts.]

Why install Cygwin?

If you are stuck running a Microsoft OS, Cygwin gives you access to the basic GNU toolset, and a wide variety of tools from the freedom-free/open source world:

  • standard programming language compilers free from Microsoft distortion field wrappers
    • gcc (C, ForTran, Ada, et. al.)
    • clang (alternative to gcc)
    • BASIC (Gambas, etc.)
    • et. al.
  • and interpreters
    • Perl, Ruby, Python, et. al.
    • Clisp, Scheme (missing?), et. al.
    • GForth (Am I going to have to fix this?)
    • bc (somewhat dated arbitrary precision "Basic Calculator")
    • et. al.
  • office software
    • office suite (abiword, gnumeric, gnucash, et. al.)
    • mail (mutt, Evolution, etc., servers too)
    • text editing (gedit, geany, joe, et. al.)
    • typesetting (Tex and company)
    • et. al.
  • graphics
    • the GIMP (image/photo)
    • Inkscape (line/vector)
    • Krita (pixel free-hand)
    • ImageMagick (mass processing)
    • fonts and font creation tools
    • et. al.
  • audio
    • audio editors such as Audacious
    • players, mixers
    • midi tools
    • et. al.
  • database 
    • SQL servers (PostGreSQL, Mysql/Maria, others)
    • other kinds of database 
    • analysis tools
    • etc.
  • privacy/security software
    • gnupg GNU Privacy Guard (implements Pretty Good Privacy)
    • other encryption tools and libraries
    • et. al.
  • etc.
    • web servers
    • data analysis tools
    • astronomy and other scientific tools
    • Lots and lots of stuff
(Yeah that summary is a little much. The full list can be overwhelming.)

If you're like me, you'll look at that list and just give Microsoft's confining view of the world the boot -- Load a Linux or BSD distribution and don't look back. But even if you do that, you need some way to bootstrap your trust relationships in this larger, unfamiliar world.

(Cygwin is something of a McGuffin here. These principles mostly work for any other attempt to establish a chain of trust.)

I used a process similar to what I'll describe below when I bootstrapped my trust relationships in the larger world. I just used openBSD and freeBSD rather than Cygwin. Ubuntu or another Linux OS would serve as well, and I sometimes re-boot a web of trust for my workplace workstations using basically the same set of steps.

It's a fairly lengthy process, and it should be a fairly lengthy process. The time and effort you put into going through the process are a major part of the value of the process. If you don't take the time and go through the steps, you won't have properly bootstrapped your chain of trust when you start using it. Plan on it taking several weeks, at least.

The construction of a real chain/web of trust takes years, but you can't wait that long. If you wait forever, you never get started. A few weeks or a month or two should be a reasonable amount of time to get started.

(I know it seems like a downer, but this gives you a big clue as to why Microsoft software is so vulnerable. Too many people use it without thinking. It's worth the patience and the effort, and, if you don't give it both patience and effort, it will not be worth nearly as much to you. That's what experience costs and that's what it's worth. You cannot really trust without experience. If you make it a vicious cycle, it's vicious. Make it a virtuous cycle, instead. Dig in.)

Now, there are freedom-free and open source projects that Cygwin doesn't give you packages for. You noticed, perhaps, my comments on GForth and Scheme above? I asked on the general discussion list, and the problem there is simply a matter of getting somebody with a little free time and interest to maintain the packages at this time. Hopefully, within a few months, the situation will change relative to gforth.

(Interest is important. If you aren't interested, you don't do it right.)

You'll note, also, that they don't mirror all the Apache projects, or all of Oracle/Sun's Java. And, while the GIMP is closely tracked under Cygwin, that is because the GIMP project people have chosen to provide packages.

In other words, Cygwin should not be viewed as your source for all these. It can be your gateway to the larger world. (Nor should any OS distribution you use -- Ubuntu, Debian, Red Hat/Fedora, Cent, Scientific, Devuan, .... All should be considered gateway points, not sources. This is where Microsoft and Apple fail, really, in trying to be your God. Lately, Google is treading over the line on this, too.)

Hopefully, you noticed that I put a little emphasis on gnupg.

If you have Pretty Good Privacy or GNU Privacy Guard from some trusted source, or something similar, you have a technological root or basis for your trust relationships. You can download Cygwin from their website at https://www.cygwin.com, check the signature as shown in the install instructions at https://www.cygwin.com/install.html, and start the install process. (I'd move the discussion of the verification process up to the top of the page, but they put it down in the Q/A. Personal taste and strategy, I guess. But they do put the link to the signature file right next to the installer file, which is good. Don't overlook it.) (Oh, and there is a reason I'm not enabling linking from the URLs here. I'll explain below.)

But if you are like most people, you probably think of digital signatures as some sort of graphic picture of a signature, and don't have any idea why they would help verify a downloaded file using PGP or gpg. Microsoft is like that. They don't tell you about these things. They want you to trust them without knowing why, so you keep going back to them with your business. (That's their main product, you know, an unreasoning trust in their software. They want to be your God.) So they don't tell you how to get yourself loose from them. And they try to avoid even telling you how the trust relationship works with them, as well. You aren't supposed to think about such things.

If you are using Mac OS X or other BSD, or a Linux OS, you may have gpg available. If not, you probably have somewhat stand-alone commands like sha512sum (sha256sum, md5sum, et. al.) available. These will help, even if you can't directly use whatever packaging system the OS provides.

More importantly, with a Linux or BSD OS, you should be able to find gpg in your distribution's packages and just install it from there. But you really should know what's happening, so let's dig in.

Step 1: Go to the Cygwin main site. 


The URL is http://www.cygwin.com. But don't trust me without verifying. Go to your favorite search engine, google, Duck-duck-go, yahoo, bing, etc., and type "cygwin" into the search form. Look at the URL reported and compare it to the one I am giving you. They should match (for the foreseeable future).

(And this goes more or less the same for Ubuntu or openBSD. Search the web. Read some of what you find. Check the domain names.)

Now you have two witnesses, but you can search on other search engines and in other browsers, just to be safe. The more witnesses, the more eyes, the broader the view, the better you can see.

And you need to know what you are looking at.


"HTTP" (or "http", it doesn't matter in the current Internet) is the name of the application level protocol -- the rules under which communication takes place. It is not part of the identity of the site, and these day it seems to be the universal set of rules but there are others, for example, "ftp" for basic file uploading and downloading, and "smtp" for simple exchange of e-mail messages. Look up "OSI Application layer" for more information.

"HTTPS" ("https") is a somewhat secured form of http. People depend way too much on it, but it does have some advantages.


"Www.cygwin.com" is the domain name above. Again, capitalization (i.e., upper-case and lower-case) doesn't matter. But there are three parts to this one.

"com" is the top-level domain ("TLD"). It was originally supposed to be for international commercial entities like IBM, HP, the Mitsui Group, Dow Chemical, BP, and such, or, rather, for the sub-domains of the Internet that they would have assigned to them. But a lot of USA companies (which should have registered their subdomains under "co.us", but that's a long story) jumped into that domain as well. ("Apple.com", for instance, had international ambitions. "Ford.com" was already international.)

Another TLD of concern here is "org", for international non-profit organizations like, well the Free Software Foundation ("fsf.org"), the International Red Cross ("icrc.org"), and many international churches, like the Church of Jesus Christ of Latter-day Saints (originally "lds.org").  Again, a lot of USA organizations jumped into this, especially family and family history organizations. (I am not involved in "rees.org", if you are curious. At least, not when I write this.)

New TLDs like "biz" and "name" have altered the landscape, but this should give us enough to bootstrap our understanding of TLDs and second-level domain names.

"Cygwin.com" is the full domain name to the second level that we are primarily concerned with. "Sourceware.org" is another domain we will be interested in.

Anything within the cygwin.com domain will be addressed with a period on the left, and a third level domain to the left of that. "Www.cygwin.com" is the subdomain within Cygwin's domain which serves as the front door. "

Looking at the URL "https://www.cygwin.com/install for a moment, "Www.cygwin.com" is a functional subdomain within cygwin.com. Www.cygwin.com/install is a document subdomain, kind of like a file-system folder, within the www.cygwin.com domain.



One last thought, when connecting with the http protocol, everything you do is plainly visible to the servers and lines your data passes through. The server admin on some connecting point can tap your datastream and eavesdrop. When connecting on https, two things happen -- well, three.

One is that everything is encrypted. Without a password-like decryption key of some sort, the admin on any connecting point can't read it.

The other, and more important, thing is that your computer and the server on the other end dance this little dance of words that allows them to trade those keys without the important parts of the keys themselves becoming visible to the in-between server admins and others who might tap the lines -- in theory. So far, the theory holds well enough.

The encryption uses what is called asymmetric encryption, where there are two keys, and one key is used to encode, the other to decode, and the two keys can't be computed from each other. And within the dance, these things called encrypted certificates are used. If your browser has the other guy's certificate, he has to provide the encrypted certificate to prove he owns the full domain he is serving to you.

It's this last thing, the certificate, that means that (again, theoretically) when the server for www.cygwin.com provides you the webpage https://www.cygwin.com/install to you, you can be reasonably confident that it is what the cygwin organization put up there for you.

(I'm not explaining everything. You can read more elsewhere. It's not a perfect protocol, but it's way better than no reassurance at all.)


Write the URL down on a piece of paper and bookmark this blog post and shut your browser down. Hit the close button on every window, so you don't have any open browser windows keeping unseen code alive and running. Maybe log out and back in, or even reboot the computer to kill off all background jobs.

When it boots back up and you are logged in as a non-admin user --

Oh. Okay. Back up. Step zero. 


If you don't have a a non-admin user account on your computer to do your browsing and downloading with, stop now and set one up. Write down the URL of this blog before you go, so you can find it again:
https://joels-programming-fun.blogspot.com/2019/04/bootstrapping-your-freedom.html

You'll type that into the URL blank in your browser window when you've booted back up and logged back in on the non-admin account, and your browser is up.

Yeah, it'll take a bit of time, an hour or so, but it's worth the trouble, so go do it.

Why do all this? It is very easy for sneaky bad guys to run programs in your browser, where you don't see them running, and change just about anything that you see, including displayed URLs, etc. But if you haven't gone to a website they've owned since booting up, logging in, and launching your browser, it's much harder for them to do so.

And if you aren't on the admin account, at least the sneaky stuff can't start playing with your application and operating system software behind your back (without finding a new hole to sneak through). It's not perfect, but it stops the amateurs.

And if the sneaky bad guys dump eavesdropping tools or worse in your non-admin account, if worst comes to worst and you can't clean the mess up with a good anti-malware tool (bleaugh), you can log out, go to the admin account, and just erase the non-admin account and start over. If you can't clean it up properly, you may lose some stuff, but not everything.

So you really should not be working in your admin account, ever, even if your company thinks they don't want you to set up a separate login. Too many things can go wrong, including accidentally going to a dangerous website (via one of those ads, etc.) and getting a sneaky app running in the background, monitoring what you type when you type passwords, etc.

Yeah, it means you have to log out and back in when you load new software or change system-wide configurations, etc. But consider the time and resources lost to one bit of malware. Look at what happened to Omega Engineering (the time bomb set by Tim Lloyd) when they failed to properly administer their stuff. And consider that similar stories have played out repeatedly on every operating system in common use, including the one you are using now, when systems are not properly administered.

I'll be happy to tell you how bad Microsoft software, including the OS, is, but it would not be nearly so bad if the users simply rejected Microsoft's fair winks and implied sweet promises, and did sensible things like not doing ordinary jobs when logged in with admin privileges. Any other OS is going to turn into a sieve if you let it, too.

(Yes, improper system administration was one of the reasons it was so easy for Lloyd to do so much damage. In a perfect world, Omega's management would have been tried for mismanagement of their computer systems and would have been required to pay a significant part of the government's costs in prosecuting the case. In a perfect world.)

(If you can't convince your company to let you use separate login accounts for your separate purposes, at least keep your on-line habits clean. And do the downloading and checksum checking with a freshly rebooted session.)

Okay, now you're back and running in a non-admin account. Hopefully.

Back to step one:


You've already done a couple of websearches for Cygwin and confirmed the URL. Type it into your browser URL bar:
http://www.cygwin.com
Read through that page.

There are a couple of links to installation instructions, over in the index on the left, at the top, and also down in the Installing Cygwin paragraph. Either one will take you to something like
https://www.cygwin.com/install.html
(Hopefully, that won't move.) Read through that page, especially the "How do I verify the signature?" paragraph in the Q&A section. It tells you how to use gpg (GNU Privacy Guard, or gnupg) to verify the download against its developers' fingerprint and signature. If you have gpg or Pretty Good Privacy, those instructions will be very useful, although you may still want to read the rest of steps here.

And, for instance, if you can download the setup file on a system where you do have Gnupg or Pretty Good Privacy, check it there, and move it from there to your MSWindows box, that's a pretty good way to get things rolling, too.

The next question on that page is "What's the hash of setup?" And answer takes you to something that looks like this:
https://cygwin.com/sha512.sum 
where there are a couple of long gobbledygook-looking strings of code-looking stuff called checksums, one for setup-x86.exe, and one for setup-x86_64.exe. (That one is likely to move when SHA512 is no longer too hard to crack, hopefully at least five years from when I write this.) I'm not going to repeat the checksums here because they will change, and I won't be able to post the new one after every change. (Checksums are not keys, by the way, and these checksums are not encrypted. That's just the way they look -- these are really, really long hexadecimal numbers.)

Take the time now to review the licensing terms. There is a link to those in the index on the left. At this point in time they're at
https://cygwin.com/licensing.html
You need to have some level of understanding of the licensing terms there and the legal implications. (This is true of all software and all user licenses, and the GPL is one of the easier licenses to understand and to conform to.) Follow the link to the GPL and read through it. Don't trust my summary that it means you can use the stuff freely as long as you respect everyone else's freedom with it, at least not without reading it once or twice yourself.

(When you start understanding the GPL is when you start understanding Free-as-in-freedom software.)

Okay, that probably took several days to get through. Good. We're making progress.

Now we are ready for step three.


Look for the link to the mailing lists in the index on the left. The URL should look something like
https://cygwin.com/lists.html
Browse through the list archives. You should probably register for the general discussion list, or you may prefer to just read a few of the interesting topics in the archives.

Here is how to subscribe to the discussions list:


Follow the link to the mailing list FAQ under "Notes". The URL should look something like
https://sourceware.org/lists.html#faq

Full stop!


The top-level domain is not the same as the cygwin top-level domain. That tells you lots of things, but I'm not going to explain all of them. Just one.

If you go back and forth, you'll see that the TLD is, in fact, different between sites, and that the cygwin.com site does link you to the sourceware.org site in many places.

Unless someone has been able to sabotage the cygwin.com site, we should be able to trust that the sourceware.org is officially related, since the link to it is on Cygwin's page, in the cygwin.com domain.

That's a big if, and it would also help to have some external confirmation.

My mention here is one form of confirmation, but not to be completely trusted.

One way to gain confirmation is to take the time to read through the archives a bit. Another is to subscribe and ask questions, but you may want to hold off on questions until you've done research. Anyway, hold off on subscribing until a day or two from the day you first read this paragraph, maybe even a week later.

Get information from other places. (This is called out-of-band information. I'll have to explain that some other time, but keep the term in mind. Out-of-band information is essential to establishing and keeping a chain of trust.)

See what changes occur to the site over the course of a week.

By this time, you will have noticed the subscription form on the page mentioned above. Use that form, or, if you want to be a geek, you can send the subscription request yourself, by interpolating the address from the information given on unsubscribing. (And your first guess will probably be wrong, as you'll see after you register.)

Just use the form:
Mailing list name: cygwin
Your e-mail address: you.know.it@yourprovider.whatever
[ ] Digest version  subscribe   [Send request]
Click the digest version only if you want each day's conversations in a single bundle. It may help you mute the noise. I use mail filters, instead, myself.

Note the announcements archives. The URL looks something like this:
https://cygwin.com/ml/cygwin-announce/
At the time I write this, I'm surprised that they don't announce the checksums for setup-YYY.exe in this announcement mail list, in the announcements for the new versions of setup. I hope they will change that policy, because every place the checksum is posted after it changes is one more place you can confirm the current checksum hash value, one more place the sneaky bad guy has to get his hands into to keep the illusion intact.

These mailing lists actually provide a significant piece of the trust puzzle for many reasons. Unless they are staged (and, yes, they could be staged, but that's a lot of work to go through), they provide proof that there are real humans behind this operation, real humans putting their heart into things and putting their reputations on the line.


This is the ultimate baseline of every chain of trust, 
if not the bottom line of meaning itself: 
Effort and experience,
records and community, 
and taking risks. 
Consequences.


(I need to write up a little analysis of that or three sometime.)


Now we are ready for step four:


Hopefully, you've noticed the link to something called mirrors by now, up there in the Cygwin pages index on the left. If you've wandered into those, through a URL something like this:
https://cygwin.com/mirrors.html
you may have found file directory lists, and in those lists you may have found checksums. For instance, if we look at the mirrors in Japan, we find iij:
ftp://ftp.iij.ad.jp/pub/cygwin/
(I activate the link there because I think it's okay for you to link into mirrors from this blog post.)
In the iij mirror for Cygwin, you see at the root something like this:
md5.sum       45 B    2018/01/31 9:00:00
noarch/                    2018/09/25 9:00:00
sha512.sum  138 B   2018/01/31 9:00:00
unsupported/           2015/02/05 9:00:00
x86/                        2019/04/23 9:44:00
x86_64/                   2019/04/23 9:44:00
If you look at that sha512sum, it is not what you've seen before. Why not? It's the checksum for something else, for this directory, to be specific.

If you click on x86_64 here, you'll find something odd. There are a bunch of setup files, but none are the setup-YYY.exe that you can download from the cygwin site.

(I'm going to try to convince them to find some way to get a checksum for setup-YYY.exe in that part of the mirror sites, so that the mirrors can become a more effective part of the chain of trust, but at this point they are not there.)

This is a little awkward for me, for this blog post, but I think we can overcome this problem. You probably want to install the x86_64 version of setup, so let's get the two checksum files from the x86_64 directory,

  • md5.sum
  • sha512.sum

Make a directory (new folder) inside you downloads folder and call it "cygwinstuff". Save these two files, or change their names after downloading, to
  • md5-iij64.sum
  • sha512-iij64.sum
inside your download\cygwinstuff folder. Don't download other files just yet.

Now go to three or more other mirror sites at random and get the same checksum files from them, renaming them so you know which is from where. Right-click and Save-as, changing the names as you save can speed the process up a bit. You'll end up with something like this:
  • md5-iij64.sum
  • sha512-iij64.sum
  • md5-hkk64.sum
  • sha512-hkk64.sum
  • md5-cymru.sum
  • sha512-cymru.sum
But don't pick these exact three, because they are the three I picked, and then you are no longer picking at random, or at least not sufficiently at random.

Now, use the command-line to compare them. Underneath the infamous start menu in MSWindows, you'll find the folder of accessory programs. In that folder, you'll find the command-line shell or prompt. Open one of those terminal windows and then open up a file manager prompt. Change directories ("cd") to the download folder with the gygwinstuff:
C:>cd C:\Users\me\Downloads\cygwinstuff 
You can drag the folder to the terminal window to save some typing on the path name. Check you are in the right place with dir. Mine shows the following:
C:>dir
...
2019/04/26  23:39    <DIR>          .
2019/04/26  23:39    <DIR>          ..
2019/04/26  23:39               367 md5-cymru64.sum
2019/04/26  23:35               367 md5-hkk64.sum
2019/04/26  23:31               367 md5-iij64.sum
2019/04/26  23:39             1,132 sha512-cymru64.sum
2019/04/26  23:35             1,132 sha512-hkk64.sum
2019/04/26  23:31             1,132 sha512-iij64.sum
...
Now do the file compare command like this:
fc sha512-cymru64.sum sha512-hkk64.sum
I also compared iij with hkk, but I could have compared cymru with hkk. You should make sure each file of the same information is compared with one other of the same information. Do the same for the md5 files, just to be thorough. It should tell you there is no difference in the three sha512.sum files and no difference in the three md5.sum files.

If it doesn't it's probably that the mirrors are in the middle of updating the files, and one is behind the others. Check the file dates for clues. If the dates are the same day you're trying to check, try again the next day. If they are still different after that, you'll need to ask on the general discussion list if anyone knows why. Please ask, as it could indicate tampering.

Why do I suggest this, and why pick, at random, three sites that are physically far apart? The idea is that it will be harder for more than one of the mirrors to be compromised, and even for a man-in-the-middle to capture and successfully spoof the responses for your requests to all three sites. 

It's going to cost a lot of time and other resources to do the spoofs and/or compromise the servers, and time is money, and if you are worth that much to someone with that kind of money, well, you should be taking extra precautions. For example, you can compare files from two more mirrors, and connect physically from some other location to do so. And avoid wireless connections, at home, anyway.

This checking information from multiple mirrors
is the principle of
multiple witnesses

It isn't perfect proof, but, combined with the above, it gets you pretty close, relative to what can be done in this world.

We are almost there.

Step five:


Windows does not include the tool sha512sum.exe by default, but it does include a utility called "certutil". You use it for this purpose like this:
C:> certutil -hashfile sha512-cymru64.sum MD5
So we should probably be able to check the checksum files we downloaded. Or, at least one of them. It appears that the SHA512 checksums are calculated first, so the entry in that file for md5.sum is different. The rest should match up.

Here's how to check the MD5 checksum for the sha512.sum file. Open the md5.sum file in a text editor. (Start->all applications->accessories->{ either notepad or wordpad}.) With notepad, you can just drag the file into an open notepad window, but then you'll have to add line endings, and that can be tricky.

With wordpad, open an empty document, then use "Open file" from the file menu to navigate to the downloaded files. Wordpad will get the line endings right.

Go back to the command-line shell window and use right-click to select and copy:

  1. mouse right-click to start select mode, 
  2. select the text with the left button and drag, 
  3. right-click to capture 

(on current versions of MSWindows). Then paste it into the open text editor window, underneath the line for the sha512.sum file. On mine, it currently looks like this:
[... other files ...]
19a3348f30d4b718dc887f6dec0dd716  sha512.sum
19 a3 34 8f 30 d4 b7 18 dc 88 7f 6d ec 0d d7 16
Microsoft puts spaces in where you don't need them, but that's okay. Start at the left and delete the spaces, and watch the numbers line up. As you go, you'll watch them fall into place, and you'll know if they match or not:
[... other files ...]
19a3348f30d4b718dc887f6dec0dd716  sha512.sum
19a3348f30d4b718dc887f6dec0dd716
(Note that the actual values of checksums that you see above were valid on 27 April 2019, around 11:00 am JST, and will shortly change and not be valid.)

If you don't want to copy from the command-line window, or can't make it work, you can redirect the output:
C:>certutil -hashfile sha512-cymru64.sum MD5 > md5hash_sha512_sum.txt
(Keep it on one line: hit space, not return, after the ">", then hit return after "..._sum.txt".) Then you can open both files and to copy-paste from file to file.

open a text editor window and paste the checksums in.

What do we know from comparing the MD5 checksums for the sha512.sum file with the value we downloaded in the md5.sum file? We aren't absolutely 100% positive, but we are better than 99% sure that the sha512.sum file is identical to the file that the Cygwin folks generated of the checksums for the download files here.

Why?

One reason is that the files are identical across mirrors. There would really have to be some serious collusion, or a very coordinated and skillful attack, for the posted checksums to have been altered. Or, if deceiving you (or perhaps someone near you) is worth tens or hundreds of thousand of dollars to someone, there would have to be a very skillful man-in-the-middle attack on the route from where you are to the servers.

The other reason is that you have just matched the checksums. (MD5 is weak, but not on a file this small.) Yeah, if someone had pulled off an attack on you, they would have been smart enough to alter all the files together, but that also increases the odds of discovery, and makes the attack that much harder to pull off successfully.

You should be able to download the file you check from any server you've been to, or pick another mirror. Just make sure the dates are the same. Pick, for instance, setup.bz2.sig, making sure to save it in the cygwinstuff folder underneath Downloads, or to move it there after downloading.

[This is where my confusion comes in. I was expecting setup.bz2 and setup.xz to be the installer. But they are not the installer, they are configuration files, which are not really useful. So, this whole line of reasoning is undone.

If the good folks at Cygwin had published the installers, setup-x86.exe and setup-x86_64.exe in the files for mirroring, you could build a temporary case for trust like this, but that is not what they did. Perhaps they don't think it's as trustable as I think it is. Anyway, while I can borrow pieces of this post for other posts I want to put up, this post becomes specious reasoning at this point. Not a complete waste of time but definitely a wrong path.

Mea culpa.]

Let's do the checksums, redirecting the output:
C:>certutil -hashfile setup.bz2.sig MD5 > checksums.txt
C:>certutil -hashfile setup.bz2.sig SHA512 >> checksums.txt
Yes, the first is ">" and the second is ">>". This appends the second checksum into the first file, because I don't like extra files lying around. Copy the checksums between files and see how they line up. If they look good, let's go after one more file.

If you can decompress .xz format archives on your computer, you will prefer setup.xz. But go ahead and get the .sig file, too. We want it after we install Gnupg. If you can't decompress either .xz or .bz2 files, we'll be stuck with setup-YYY.exe.

I really wish the folks at Cygwin would make the setup-YYY.exe file they have on their main page available in the mirrors. And that they would post the contents of the sha512.sum and md5.sum files on their main site, as they have posted the checksums for setup-YYY.exe. Why? Although it is not 100% sure, it would give us two more places to check, and it would make the connections that much more clear. And it would make the setup-YYY.exe files that much less susceptible to man-in-the-middle attacks.

However, with all the out-of-band information we have been able to gather, we have a reasonably high level of confidence in the setup.bz2 or setup.xz installers from the mirrors. If we do the last check.

Download setup.bz2 or setup.xz and let's check it.
C:>certutil -hashfile setup.bz2 MD5 >> checksums.txt
C:>certutil -hashfile setup.bz2 SHA512 >> checksums.txt
Or setup.xz if that is what you are downloading.

Note something very important: if you don't already have tools for decompression .bz2 or .xz files on your computer, it won't help to load those tools from some random website out there. All that will do is shift your vulnerable points to some place even more vulnerable.

Winzip, 7zip, wherever, without some way to check full signatures, even publishing the checksums doesn't really help. All it tells you is that the file wasn't zapped by network problems or file system problems. The checksum by itself doesn't help you be sure that what you got is what the developer put up for downloading. Man-in-the-middle attacks are going to become more common in the future, so we need more than what they give.

That's why I am recommending cygwin if you can't go with a well known Linux or BSD distribution. There are mirrors, and the mirrors can give you quite a bit of reassurance.

You'd have more reassurance with the checksums for setup-YYY.exe in the mirrors and in the announcement lists. (Sorry to be so incessant on this, but the point must be understood. Checksums are not fingerprints.)

We'll use append mode for both this time, and keep the checksums.txt file around when we are done, just because we can. Pull the output into your text editor, delete the spaces, and watch the checksums line up. If they don't match, ask about it on the Cygwin discussions mailing list, please. If they do, you can now run the installer with a pretty high degree of confidence.

Not as high a level of confidence as using the fingerprints, but a reasonably high level of confidence at the time I write this. Until man-in-the-middle attacks become more common, it's a fairly reasonable level of confidence, even without the externally published checksums.

How do you get a higher level of confidence? Go to a Linux or BSD users' group meeting, exchange keys with ... wait. You don't have keys to exchange. If you didn't bootstrap your chains of trust when you were in college, you've still got the chicken-and-the-egg problem, and, if you did, hopefully, you aren't having to do this all over again. Other than something like what I have described here, the best approach you can take is to have a friend whom you trust burn you a copy of Linux or BSD distribution media, or perhaps the Plan 9 distribution or some such, and install a properly free-as-in-freedom OS.

Or buy a Mac. That'll work pretty well, too, just remember to get the developer tools when you do so.

Before you install Cygwin, save these files away, outside of downloads, say in C:\my\installers\cygwinstuff or something like that. You want the hashes for later reference. Copy the whole directory over, and, if you want to be as sure as possible, you can re-run the checksum checks after copying, to make sure there are no file system problems in your computer. Or use one of the MSWindows file comparison tools.

From here, follow the instructions on the Cygwin install page.

Launch the setup-YYY.exe the usual way, walk through the install guide, install the base system first. Once those are installed, start again from the front, walk through the initial screens, then when the packages screen comes up, view by category, type "gnupg" in the search field. You'll find gnupg in utilities. I've installed both 1.4 and 2.2.

This gives you tools to check checksums and digital fingerprints and such.

I should write a post about using gnupg, too.

Anyway, once you have gnupg installed, you can check out packages I've listed above for whatever tickles your fancy. And you can also download other stuff that isn't in Cygwin, like libreoffice, and you have the tools to check their fingerprints.



Saturday, April 20, 2019

Q&D Prime Sieve in Color Computer BASIC?

Okay. So the post on writing a C program to check the time required to count N primes wasn't my best offering on the subject of primes.

This will be even more bizarre, although closer in spirit to the post in the Vintage {computers | microprocessors | microcontrollers } FB group on a Fortran program for counting primes.

Let's count primes less than 256 in Tandy/Radio Shack Extended Color Computer BASIC!





1 REM A PROGRAM TO COUNT PRIMES IN COLOR COMPUTER BASIC
2 REM JOEL MATTHEW REES, APRIL 2019, AMAGASAKI, JAPAN
3 REM COPYRIGHT 2019, JOEL MATTHEW REES
4 REM MAY BE FREELY USED FOR NON-COMMERCIAL PURPOSES.
10 MAX=25520 DIM PRIMES(MAX)
30 FINAL = MAX/2
40 PRIMES(0)=0
50 PRIMES(1)=0
60 FOR I=2 TO MAX: PRIMES(I)=1: NEXT
70 FOR I=2 TO FINAL
80 FOR J=I+I TO MAX STEP I
90 PRIMES(J)=0
100 NEXT J
110 NEXT I
115 COUNT=0
120 FOR I=0 TO MAX
130 PRINT I;" IS ";
140 IF PRIMES(I)=0 THEN PRINT "NOT "; ELSE COUNT=COUNT+1
150 PRINT "PRIME."
160 NEXT I
170 PRINT COUNT;" PRIMES LESS THAN "; MAX; "."



(This is essentially the same algorithm as in this post on a small sieve, which also points to some implementations of the same in Ada and Forth.)

(You can download this as a file on my OSDN Japan account, here. I was not successful in using imgtool to put it back in a DECB disk, however.)

And let's see how long it takes. (Counting seconds in my head.)


Constructing the sieve itself takes about ten seconds, printing the sieve (Not very meaningful when the list mostly ends up scrolling off the screen.) takes another nine.

54 primes less than 255. Yep. That other program says the same.

255 numbers in 10 seconds is about 25 numbers per second, average, although average here does not mean what it seems to mean.

Now this program is very easy to modify to count the first thousand or more primes up to something like 15,000 on a 32K Coco.

First, check your RAM:

    PRINT MEM


 If it says something over 22,000, there should be plenty of room. Let's try 8,192, a nice round number. Edit line 10:

    EDIT 10
    10 MAX=8192
    RUN



Woops!

Byte arrays in BASIC aren't done that way. In fact, they aren't hardly done at all. We would have to go get BASIC09 to do byte arrays in a BASIC on the Color Computer, I think. (Or we could do it in BIF. Or assembler.)

(Would be faster. Maybe I need an excuse to fire up OS-9. Not Mac OS-9, Microware OS-9, which well predated the Macintosh itself. Tandy/Radio Shack missed serious opportunies on that. Tandy's stupid mistake of not letting the company's products compete with each other, as if there were only one prize to be won. Making money linear is the best way to lose. Put down the mike, I know, I know.)

Let's try again with only 1,024.


About 45 seconds to calculate the sieve, and another 35 to scroll the list off the screen. 


Says 172 primes less than 1024. That checks with that other program, too.

Checking 1024 numbers in 45 seconds averages 1024/45 numbers per second, or about 23 per second, which seems nearly linear to the earlier time.

And it looks like we have some memory left. In fact, it looks like our array only costs about 4 bytes per integer, which means we should easily be able to get away with 4,096. And time may be linear.


But I'm going to use the clock instead of counting for more than 3 minutes.

Expect, if the time is no better than linear, something like 4096/23, or 178 seconds, about three minutes.

Executing it. Yep. About three minutes to calculate.

Why should the time be nearly linear? The print loop is just a single loop, not nested, no branching. That makes sense.

But the sieve itself is nested, scanning the array and blocking out non-primes once for each prime it sees. But, first, the scan is only from the prime's double, skipping by the prime, so the later loops become really quick. And the loop ignores scanning on the non-primes, because those have all been picked up by the first loops. So the scans really don't take nearly as much time as we might think.

The scans after initialization will be half the count of the initialization, then a third, then a fifth, then a seventh, ...

Printing takes a rather long time. But that's always true. Almost three minutes to print the count.


564 primes less than 4,096. Check that? Yep. That other program agrees.

If we could extrapolate to 100,000,000 (which we can't, at least not this way), we would expect the Color Computer to take (100000000/23)/(60*60*24) days, or about 50 days. But the Color Computer would need about 500 megabytes of RAM, quite possible on modern desktop computers with multiple gigabytes of RAM, but not on the Color Computer.

Maybe a sieve array that big could be simulated on hard disk, but that would incur more time costs.

Left over memory with just 4,096 entries -- 1,850. If we want to do more on a Coco, we need a more sophisticated method.

Also, since the list scrolls off the screen, it would make sense to, instead of printing the list, write a loop that would allow the user to type in a beginning and ending number and list primes between the two numbers.

Those are left as exercises for the reader, for now. (There is a way to check whether a number is prime after this program ends. Do you know how?)

Sunday, February 24, 2019

Taking the Sieve Too Far

A post on a Fortran program for counting primes in the Vintage {computers | microprocessors | microcontrollers } FB group inspired me to dig an old post on small primes back up and refurbish it a bit.

(If you don't know how this works, refer to the post on small primes.)


/* Archetypical implementation of the sieve of eratosthenes in C
** By Joel Rees, Amagasaki, Japan, 2015, 2018, 2020
** All rights reserved.
** Permission granted by the author to use this code
** for any purpose,
** on condition that substantial use
** shall retain this copyright and permission notice.
*/

#include <stdio.h>
#include <stdlib.h>

#define MAXSIEVE 100000000L

#define FINALPASS ( ( MAXSIEVE + 1 ) / 2 )

char * sieve;

int main( int argc, char * argv[] )
{
    unsigned long index, prime;
    unsigned long count = 0;
    /* BUG! unsigned long maxsieve, finalpass; JMR20200101 */
    unsigned long maxsieve = MAXSIEVE;
    unsigned long finalpass = FINALPASS;
    char * unscanned;

    if ( argc > 1 )
    {
       maxsieve = strtoul( argv[ 1 ], &unscanned, 0 );
       if ( ! ( unscanned > argv[ 1 ] ) || ( maxsieve > MAXSIEVE ) )
       {  maxsieve = MAXSIEVE;
       }
    }
    finalpass = ( maxsieve + 1 ) / 2;
    sieve = malloc( maxsieve );
    if ( sieve == NULL )
    {
       printf( "Can't allocate %lu byte array, quitting.\n", maxsieve );
       return EXIT_FAILURE;
    }

    sieve[ 0 ] = 0;
    sieve[ 1 ] = 0;
    for ( index = 2; index < maxsieve; index = index + 1 )
    {   sieve[ index ] = 1;
    }

    for ( prime = 2; prime < finalpass; prime = prime + 1)
    {
        if ( sieve[ prime ] == 1 )
        {   for ( index = prime + prime; index < maxsieve; index = index + prime )
            {   sieve[ index ] = 0;
            }
        }
    }

    for ( index = 2; index < maxsieve; index = index + 1 )
    {
        if ( sieve[ index ]  != 0 )
        {
            ++count;
        }
    }
    printf( "%lu primes less than %lu\n", count, maxsieve );
    return EXIT_SUCCESS;
}



This version only counts primes, and it alows you to specify the largest number.

Execution time on this old 1.2 GHz single processor:

$ time ./primecount 100000000
5761455 primes less than 100000000

real    0m14.008s
user    0m13.733s
sys    0m0.196s

It uses the same sieve-in-array approach as the small primes example, so you probably should not run it at a hundred million if you don't have at least a gigabyte of RAM.

I hope there's no bug and that's the correct answer.

A note:

I had the array statically allocated at one point, and the compile actually took several seconds to write that big empty array. But the execute time was not quite a second shorter. The extra execution time for the dynamic array may have simply been the system thrashing after the previous version (because this box only has one gigabyte of RAM and a 100 Meg of a Gig is enough to push things around in virtual memory).

[JMR-2018-02-28
And here's how it runs under Cygwin on a much more recent 2.5 GHz dual core Intel CPU:
$ time ./primecount 100000000
5761455 primes less than 100000000

real    0m11.349s
user    0m11.107s
sys     0m0.062s

It's only using one core, but why doesn't the double clock speed halve the execution time? I dunno. You tell me.

I could use the second core by forking the process and having the second process lag behind the first one. The algorithm becomes semi-non-obvious:

First, the first processor sets the flags in the lower half of the array, and the second process sets the flags in the upper half. Maybe bus contention would not eat up the savings in halving the load.

We'll start the parent process on a single pass at multiples of 2, and it will quickly try to do this to the first part of the sieve array:

0123456789 10111213141516171819 20212223242526272829
0011010101 0101010101 0101010101

We'll start the child process on a single pass at multiples of 3, and it will quickly try to do this to the first part of the sieve array:

0123456789 10111213141516171819 20212223242526272829
0011110110 1101101101 1011011011

Note that the effects are independent of each other.

Now, if either the parent or child finishes its first single pass before the other finishes even starts its first, the only damage is an unneeded extra pass on 4, and maybe even 6, 8, and 9. For small arrays, trying to use both processors could end up being (slightly) slower. For large arrays, we can be pretty sure that there won't be more than a few unnecessary passes.

Ideally, by the time one or the other finishes its first pass, the other will have already made its way through the first 15 of its pass, and it will have

0123456789 10111213141516171819 20212223242526272829
0011010100 0101000101 0001010001

and this is good enough to let the process avoid wasting a pass on 4 and start 5.

But the next process to finish will also see 5. (Check for yourself why probing the second or third in the pass isn't sufficient.)

So it looks like we need a shared variable for each to store its current pass on. And there will be some non-zero probability of read-interleaved write, so we should probably protect it with some semaphore or other mutual-exclusion technique.

This shared variable will help at the end, as well, when passes become very quick.

Lack of such mutual exclusion won't make it give the wrong result, it will just result in wasted passes.

Maybe I'll do the code for multiple processes later on.
]