Why You Need an Open Source Software Governance & Provenance framework

10-Jul-2013

First, lest there be any doubt, open source and the Open Source movement are good. Even if you choose the most conservative route and do not use open source, the code tends to be very well written and documented and well-factored and simply studying it can serve as a best-practices guide for your own software development efforts.

The issue is that an alarming amount of open source usage is not using open source but rather pre-built binaries, typically (but not always) downloaded from the same site hosting the source. Platform independent binaries like Java make it tantalizingly easy to just grab a jar file from the internet, drop it into your classpath, and begin to use it.

You probably wouldn't download somethingYouFound.exe from someplaceOnTheNet.co.br and just run it on your PC.

Why would you do it for Java?
Because the Java Duke character is cute?
Because Java and the Open Source movement are hip?
Because the Open Source community is beyond malfeasance?
You think Java is immune from virus and malware transmission?

Remember that the argument of "everyone else is using it so it must be OK" is flawed for a number of reasons including "Everyone else is using it" is not acceptable corporate governance.

Perl, Python, and Ruby environments also tend to run afoul of this problem, especially Perl, with the thousands of modules hosted in the CPAN environment. Although there is typically more compile activity in using open source from these environments than with Java, the maturity of the highly lubricated download/build/install frameworks in these environments leads to the same sort of "blindly capture, process, and install" behavior.

In The Old Days, code was either built in-house or purchased from a vendor. In both cases, the software development lifecycle (SDLC) including design review, testing of all kinds (functional, performance, security/robustness), and build/release process were well known. Yes, the design and utility of some of that code was poor. But the governance and certainly the provenance of the code was clear. Increasingly, internal and external audit functions are requiring a higher standard of code stewardship and management for a number of reasons, not the least of which being an increased level of cyber-borne threats. It is very likely that nearly 100% of Java applications today contain significant (>10%) amounts of open source-originating software for which no clear provenance exists, e.g. exactly where did that jxpath.jar come from and how are you sure it was bullt cleanly from the proper source code? In some cases, the unprovenanced percentage is much higher. And ironically, in a large development organization, diligent groups that embrace the tenets and value of open source management will end up duplicating effort by building and managing the same component of open source software in multiple locations.

It is therefore vital that if you use open source, you must create a comprehensive enterprise-wide process to categorize and manage it.

Steps to Take To Provide Open Source Governance & Provenance

First (and this is probably the hardest nut to crack), modify your software practices policy to forbid the downloading of any binary executable content that does not pass through an authorized channel intended to provide some sort of front door protection/inspection of the content. At the absolute minimum, the activity needs to be logged and reviewed on a weekly basis as a reactive control.
Create a consumption/build independent component repository. This can be a filesystem or a blob-oriented database. It is not Maven; Maven is a Java-oriented, build and build-dependency oriented ecosystem involving both a repo component and a client-side build utility (mvn) and specific handling of dependency resolution. The purpose of the repository is to carry source, build instructions, build output, and governance/provenance info (more on this in a moment). The repository should be granular to version of source and support holding multiple versions of build for that source. There is a nuance here in that although you can use an SCM (git, clearcase, cvs, etc.), in theory you don't need to because you're not doing comprehensive work on the source. In fact, it's more about associating a single, unchanging image of source with a compiled output, and that is more than the traditional boundary of SCM.
As a stretch goal, the component repository should declare the component versions of dependencies used at both compile-time and link-time (test drivers and other actual executables). This is also similar to Maven functionality but importantly, the repository has no built-in logic or assuptions about how to perform transitive closure on the dependency graph and especially common dependency version mismatch resolution.
Develop a provenance model that objectively captures aspects of the code base such as
- Compiled in house
- Declared external dependencies match same revs used to compile internally
- Not compiled in house but SHA1/MD5 digest captured and matches a reputable source for the component
- License type rationalization (GPL, MIT, etc.)
- Test driver coverage / quality
- Static / lexical analysis performed
- Dynamic / path / mutation analysis performed
Together, these objective inputs can be used to form a subjective interpretation of the "level of security." Different subjective models are permitted ("one man's red flag is another man's yellow") but the objective inputs are the same. A key feature of the model is that it accomodates components that in fact have not been compiled in house but rather have been imported directly as binaries which is undesirable but nevertheless transparent and managable in this model. This enables the entire footprint to be managed consistently and offers plenty of runway for a "Continuous Open Source Improvement" program wherein periodically, the set of "less secure" components is assessed for sensitivity/uptake and the highest priority items are "beefed" up. If the component repository contains compile-time and link-time dependencies, then a utility can be easily crafted to determine the overall posture of any dependency graph of components.
Establish a sandbox compilation environment to ensure that all source is compiled, scanned, analyzed, and tested in exactly the same environment, no matter who is performing the work. Many frameworks (such as Vagrant) can be used to do this.
Modify software build environments to consume components from the component repository. Ideally, the build should be able to access the repo directly, but it is acceptable to copy components from the repo as long as sufficient indexing material is also copied to provide an unambiguous link back to the repo. Specifically, a Maven repo server becomes the slave to the master repo. Do not fall into the trap of letting a Maven repo server be the master component repository for open source! Open source management is both broader than Java and deeper than just housing source and build artifacts.
If you have a sufficiently large development team footprint (more than 50 active developers) and open source is a large and vital component of your software, consider making one of the developers a full time Provenance Man to continuously improve the quality of the footprint, address version conflict issues, security issues, etc.

With such a framework in place, you will be able to accurately and reliably attest to the provenance of open source software running in production.

Like this? Dislike this? Let me know