Jeroens blog

Wikibase DataModel 0.7.3 released

I am happy to announce the 0.7.3 release of Wikibase DataModel.

Wikibase DataModel is the canonical PHP implementation of the Data Model at the heart of the Wikibase software. It is primarily used by the Wikibase MediaWiki extensions, though has no dependencies whatsoever on these or on MediaWiki itself.

This release contains a new API for working with labels, descriptions and terms, and it deprecates the old API for those.

At the core of the new API is a simple value object representing an until now unnamed domain concept: the part of an Item or a Property that has all those labels, descriptions and terms. The name we gave it is Fingerprint. (Credits and blame for the name go to Lydia :)) The Entity class now has a getFingerprint and a setFingerprint method. Fingerprint itself has getLabels, getDescriptions and getAliases methods. The first two return a TermList, which is made up from Term objects. The later returns an AliasGroupList which consists out of AliasGroup objects.

Why these changes? What was wrong with this approach?

The old API is defined by no less than 17 methods in Entity. They add a lot of code to it, contributing to Entity being the highest complexity class in DataModel. That our core (DDD) entity is also our most scary class is obviously not good. Moving the responsibility to a value object inside of Entity is a clean way to tackle this problem, and is also more in line with how the other parts of Entities, such as they list of Claims are handled. On top of the complexity issue, the old API also does badly interface segregation wise. Most code dealing with Terms (ie Labels, Descriptions and Aliases) will not care about all the rest of Entity. Hence it makes no sense to have to feed it an entire Entity object while a Fingerprint or one of its composited in objects would make more sense.

dm-complexity

Another important reason to move into this direction is that I want to see Entity go. If you are familiar with the project, this might seem like a quite preposterous statement. Kill Entity? How can I possibly think that is a good idea, and how will the code be able to still work afterwards? In short, Entity tries to unify things that are quite different in a single class hierarchy. The difference between those objects creates creates a lot of weirdness. Entity contains a list of Claim, while Item, one of it’s derivatives, requires a list of Statement, Statement being a derivative of Claim. And not all Entities we foresee will have Claims, Fingerprint, etc. The only thing they will all have is an EntityId, and all we need to facilitate that is a simple HasEntityId interface. All the rest, including the Fingerprint, can be composited in by classes such as Item and Property that implement the appropriate interfaces. Those changes are for the next big release, so if you are using DataModel, it is recommended you switch to using the new Fingerprint API as soon as possible.

And I’m still done with the list – wow. A final reason for the change is that the old API was not only ugly (weird function signatures in places), it was also quite inconsistent in its implementation. It has TODOs in there since the start of the project that state things such as “figure out if we care about duplicates” and “are empty strings valid?”. The new implementation properly takes care of these things, and does no in all cases where it should rather than only in assorted random functions. That those old TODOs remained there for nearly two years go to show how likely it is people “go back to fix it”.

You can do everything you could do with the old implementation with the new one. There are however some things that might be slightly more cumbersome for now, especially in code that is taking in entire Entity objects while only needing a Fingerprint. As we migrate to the new implementation, it will become clear what convince functions will pay for themselves, so those will likely be added in the near future. At the same time several tasks are already easier to do now. The new value objects will also likely provide a good place to put functionality that was oddly places before.

For a list of changes, see the release notes.

And in anticipation of the next release, have some WMDE kittens:

2014-04-14 01.30.54

Tagged with: , , , , , , , ,
Posted in Programming, Software

Diff 1.0 released!

I’m very happy to announce the 1.0 release of the PHP Diff library.

Diff is a small PHP library for representing differences between data structures, computing such differences, and applying them as a patch. For more details see the usage instructions.

I created this library as part of the Wikibase software for the Wikidata project. While it is used by Wikibase, it’s fully independent of it (and has no dependencies on any other PHP code). The first release was during September 2012, and since then it has seen a new release with refinements and new features every few months. As we’ve been using it on Wikidata.org for over a year, and there are no known bugs with the library, it can be considered to be quite stable and robust.

The 1.0 does not add anything in terms of functionality. The primary change it brings is PSR-4 compliance. It also removes some old constructs that had been deprecated for a long time already. A detailed list of changes can be found in the release notes.

Tagged with: , , , , , ,
Posted in Software

Wikidata Code Review 2014

One year ago we had the Qafoo guys come into the Wikimedia Deutchland office to review the software we had written for the Wikidata project. There is a summary of the review as well as a big PDF with all the details.

This week I presented a follow up review to the Wikidata team. The primary goal of this review was to make changes since the last review visible, and to suggest on how to improve things further. Check out the slides of the review.

The first part of the presentation looks at some simple code metrics for Wikibase.git and how they changed over the last year. After that the main part of the presentation starts. This part looks at individual points from the 2013 review, the progress we made on them, and what can still be done. The end of the presentation looks at how we tackled the action items from the 2013 review, or rather how we did not, and also lists a number of important improvements we did make while not being mentioned in that review.

Tagged with: , , , ,
Posted in Programming

Maps 3.0.1 and Semantic Maps 3.0.3 released

I just released version 3.0.3 of Semantic Maps, and recently released Maps 1.0.1 as well. Both these releases bring some minor fixes and translation updates. You can get the new versions by running “composer update”.

Tagged with: , , , , , ,
Posted in Software

Status of the new Wikibase (de)serialization code

A quick update on the status of the new serialization and deserialization code for Wikibase, the software behind Wikidata.

For a long time now, we’ve had two serialization formats. One intended for external usage, and one intended for internal usage. The former one is the format our web API uses. The latter one is the format in which entities are stored in the Wikibase Repo database, and is what ends up in the current dumps.
The code responsible for serialization and deserialization is for both formats not reusable and has a high degree of associated technical debt. To tackle both the reusability and design problems, we started work on two new components. One for the external format and one for the internal one.

* External one: https://github.com/wmde/WikibaseDataModelSerialization
* Internal one: https://github.com/wmde/WikibaseInternalSerialization

They are now both essentially feature complete. The next step will be to replace usage of our existing code with usage of these new components, so the old code can be removed. Once this is done, the components will be 100% feature complete and very well tested, at which point their 1.0 versions will be released.
For a description of the responsibility of both components and an explanation of how to use them, see their README files.

Tagged with: , , , , , , ,
Posted in Software

Big Ball of Mud

A while back I somehow stumbled upon a little paper about the Big Ball of Mud patten.

This was an interesting and amusing read. In this blog post I’m adding some additional thoughts from my side, on things I found to be missing, misleadingly explained or disagree with altogether. To be fair to the original authors, the paper is 17 years old. I only noticed this after finishing reading it, though it does explain why certain buzzwords such as “agile” and “scrum” are not used, even though the authors are clearly describing the same concepts.

Don’t know what the Big Ball of Mud pattern is about? You can check the Wikipedia article or simply read the introductory quotes from the paper:

While much attention has been focused on high-level software architectural patterns, what is, in effect, the de-facto standard software architecture is seldom discussed. This paper examines the most frequently deployed architecture: the BIG BALL OF MUD. A BIG BALL OF MUD is a casually, even haphazardly, structured system. Its organization, if one can call it that, is dictated more by expediency than design. Yet, its enduring popularity cannot merely be indicative of a general disregard for architecture.

A Big Ball of Mud is a haphazardly structured, sprawling, sloppy, duct-tape-and-baling-wire, spaghetti-code jungle. These systems show unmistakable signs of unregulated growth, and repeated, expedient repair. Information is shared promiscuously among distant elements of the system, often to the point where nearly all the important information becomes global or duplicated. The overall structure of the system may never have been well defined. If it was, it may have eroded beyond recognition. Programmers with a shred of architectural sensibility shun these quagmires. Only those who are unconcerned about architecture, and, perhaps, are comfortable with the inertia of the day-to-day chore of patching the holes in these failing dikes, are content to work on such systems.

If you are able to distinguish good code from spaghetti code and have been in the software development field for a while, you will undoubtedly also come to the conclusion that this is indeed the most pervasive “architecture pattern”. The paper outlines some interesting factors and contains several analogies on both the cause and the effect of this. My main interest as a software craftsmanship advocate is better understanding these, so they can be dealt with better.

Architecture is often seen as a luxury or a frill, or the indulgent pursuit of lily-gilding compulsives who have no concern for the bottom line.

This resonates much with what I have observed over my modest career so far. In fact, I had this attitude towards software design and architecture for many years. And I got out of that almost by accident. The pervasiveness of this attitude makes it very hard to realize it is short-sighted. Education is likely to blame as well to some extend. At least the one I got completely failed to stress the importance of good design.

It seems to me that many people seem to think that when one needs to put mental effort into designing a system, the same effort will be required to understand it later on. While it is certainly possible to do this through inexperience, or intentionally when one tries to obfuscate a system, the opposite is true when an experienced person pursuits good design. The effort put into the design is to make it simple, not complicated. A well thought out design will be simpler than a pile of code that was written without thought about its organization.

Indeed, an immature architecture can be an advantage in a growing system because data and functionality
can migrate to their natural places in the system unencumbered by artificial architectural constraints.
Premature architecture can be more dangerous than none at all, as unproved architectural hypotheses turn
into straightjackets that discourage evolution and experimentation.

I certainly do see the point being made here and agree that architectural constrains can be very dangerous to a project. What I absolutely object to in this paragraph is the notion that a mature architecture contains these constraints while an immature one does not. One of the main tasks of architecture is to not make choices and delay them as long as possible.

Example: An architectural decision is to abstract the storage mechanism of an application. Now it can be developed without creating a full MySQL implementation or whatnot. As the storage is abstracted, you can develop most of the app using some simple in memory data. Then late on, when you have much more experience with the project and the actual needs, you can decide what technology to use for the real implementation. And you will be able to use different implementations in different contexts. Contrast this to the Big Ball of Mud approach, where no abstraction is used. In this case the code directly binds the the implementation, and you do not only make the decision to go with a particular technology, you also end up binding to it in such a way that you greatly constrain the evolution options of the project.

Architecture is a hypothesis about the future that holds that subsequent change will be confined to that part of the design space encompassed by that architecture

This is much in line with the previous quote. Architecture is definitely a hypothesis about the future. Doing architecture well involves balancing many forces and probabilities. And as already mentioned, one of the main goals is to keep ones options open. While one can guess at the future and distinguish between the likely and the unlikely, no one can predict it. What is not done by a good architect is putting all your money on a specific bet unless it cannot be reasonably avoided.

BIG BALL OF MUD might be thought of as an anti-pattern, since our intention is to show how passivity in the face of forces that undermine architecture can lead to a quagmire. However, its undeniable popularity leads to the inexorable conclusion that it is a pattern in its own right. It is certainly a pervasive, recurring solution to the problem of producing a working system in the context of software development. It would seem to be the path of least resistance when one confronts the sorts of forces discussed above.

The path of least resistance.

The paper somewhere mentions that creating good architecture requires effort as is the case with all things that decrease entropy. This effort is well worth it in many cases. If you don’t agree, I’ll not be visiting your home.

It is however not the quickest path. The resistance in writing the initial code is very low, though you will quickly pay for this with technical debt, debugging, and trying to figure out what the code you wrote a while back actually intends to accomplish. It seems that many developers do not realize these pains are mostly caused because of the “shortcuts” they are taking and can largely be avoided.

When it comes to software architecture, form follows function. The distinct identities of the system’s architectural elements often don’t start to emerge until after the code is working.

Domain experience is an essential ingredient in any framework design effort. It is hard to try to follow a front-loaded, top-down design process under the best of circumstances. Without knowing the architectural demands of the domain, such an attempt is premature, if not foolhardy. Often, the only way to get domain experience early in the lifecycle is to hire someone who has worked in a domain before from someone else.

Domain Driven Design advocates iterative refinement of the model via the process of knowledge crunching in which the domain designers consult the domain experts.

Indeed some engineers are particularly skilled at learning to navigate these quagmires, and guiding others through them. Over time, this symbiosis between architecture and skills can change the character of the organization itself, as swamp guides become more valuable than architects. As per CONWAY’S LAW [Coplien 1995], architects depart in futility, while engineers who have mastered the muddy details of the system they have built in their images prevail. [Foote & Yoder 1997] went so far as to observe that inscrutable code might, in fact, have a survival advantage over good code, by virtue of being difficult to comprehend and change. This advantage can extend to those programmers who can find their ways around such code.

This definitely amuses me, and goes a long way in explaining why the most “senior” technical people at some organizations don’t have a clue about the basics of software design. And are then actually revered for the mess they created by their fellow developers.

During the PROTOTYPE and EXPANSIONARY PHASES of a systems evolution, expedient, white-box inheritance-based code borrowing, and a relaxed approach to encapsulation are common. Later, as experience with the system accrues, the grain of the architectural domain becomes discernible, and more durable black-box components begin to emerge. In other words, it’s okay if the system looks at first like a BIG BALL OF MUD, at least until you know better.

I definitely disagree with this. It is never OK for your system to look like a big ball of mud, unless you are creating a prototype (that will not go into production) or something similar. Should everything be shiny and perfect from the start? No, of course not. That is not possible. And it is often fine to create a bit of a mess in places, as long as it is under control. (Managed technical debt.) By the time your code qualifies as a big ball of mud, the technical debt will no longer be under control, and you will have a serious problem. Going from big ball of mud to sane design is VERY difficult. Thus advising it is fine to start out with a Big Ball of Mud is rubbish.

They also can emerge as gradual maintenance and PIECEMEAL GROWTH impinges upon the structure of a mature system. Once a system is working, a good way to encourage its growth is to KEEP IT WORKING. One must take care that this gradual process of repair doesn’t erode its structure, or the result can be a BIG BALL OF MUD.

Yes, one should be vigilant and not let code rot go so far that a system turns into a big ball of mud. Continuous refacoring and following the so called “boy scout rule” is a big part of the answer to this.

The PROTOTYPE PHASE and EXPANSION PHASE patterns in [Foote & Opdyke 1995] both emphasize that a period of exploration and experimentation is often beneficial before making enduring architectural commitments.

I agree with this, though one should keep in mind that the goal of architecture is not to pin things down and make decisions that are hard to change later.

Time, or a lack thereof, is frequently the decisive force that drives programmers to write THROWAWAY CODE. Taking the time to write a proper, well thought out, well documented program might take more time that is available to solve a problem, or more time that the problem merits.

THROWAWAY code is often written as an alternative to reusing someone else’s more complex code. When the deadline looms, the certainty that you can produce a sloppy program that works yourself can outweigh the unknown cost of learning and mastering someone else’s library or framework.

There is nothing wrong with writing prototype code just to see how something works, or to try out if a particular approach is effective.

Prototypes should be treated as prototypes though, and not be deployed as a non-prototype. This might often be an easy thing to do for developers, though it is also clearly irresponsible. You are introducing a huge liability into your company or handing it over to your client. This liability is likely to cost them a lot of money as maintenance costs go through the roof, further development takes ages and as customers switch to less buggy software. This is compounded by the inability of your customer to know the real state of the code and realize what is bound to happen down the road.

Other forms of throwaway code include katas, a structured form of deliberate practice for developers, and “spikes”, little pieces of code you write typically to explore some an API before writing the real deal.

Keeping them on the air takes far less energy than rewriting them.

Here I like to add that rewriting a Big Ball of Mud is likely to be a bad idea. Unless the system in question is very small, doing so is probably not practical. An iterative process of improving the system by breaking things apart, brining things under test, etc, is the recommended approach when dealing with such legacy systems.

Master plans are often rigid, misguided and out of date. Users’ needs change with time.

If “master plans” refers to big design upfront, then yeah, I agree. As noted before, it is critical one things about design and architecture to prevent prematurely committing to decisions. Clients don’t know what they really need at the start, so an iterative approach is the logical thing to go with.

Successful software attracts a wider audience, which can, in turn, place a broader range of requirements on it. These new requirements can run against the grain of the original design. Nonetheless, they can frequently be addressed, but at the cost of cutting across the grain of existing architectural assumptions.

This again portrays architecture as a source of rigidity. It’s true that sometimes one finds out that a specific design in place is not going to work, and that it needs to be evolved or removed. To make this unavoidable process as easy as possible, one should write clean well designed code. If you have a tangled spaghetti mess, then it will be more rigid, and much less able to deal with changing requirements. Indeed, a good domain model tends to become more and more powerful over time, opening up many valuable opportunities for the business.

When designers are faced with a choice between building something elegant from the ground up, or undermining the architecture of the existing system to quickly address a problem, architecture usually loses. Indeed, this is a natural phase in a system’s evolution [Foote & Opdyke 1995]. This might be thought of as messy kitchen phase, during which pieces of the system are scattered across the counter, awaiting an eventual cleanup. The danger is that the clean up is never done. With real kitchens, the board of health will eventually intervene. With software, alas, there is seldom any corresponding agency to police such squalor. Uncontrolled growth can ultimately be a malignant force.

There certainly are moments in which parts of a system get a little messy. Sometimes it just makes pragmatic sense to just copy paste something, or to add a method where it does not really belong. Sometimes – not most of the time. And if you have a well though out system with high cohesion and low coupling, this will be less often the case than when it lacks perceivable design. One should be careful about where this is done. If it is not behind an abstraction, the rot can easily spread to the rest of the system. It is also wise to hold into account who is working on the codebase, in particular what their experience and attitude is. Inexperienced or badly disciplined people can easily not see or not care that they are creating binding to something that should be cleaned up first.

When constant cleaning and refactoring is applied, messes are kept small and under control. With less vigilance they can easily cause serious problems in the entire codebase.

Maintenance needs have accumulated, but an overhaul is unwise, since you might break the system. There may be times where taking a system down for a major overhaul can be justified, but usually, doing so is fraught with peril. Once the system is brought back up, it is difficult to tell which from among a large collection of modifications might have caused a new problem. Therefore, do what it takes to maintain the software and keep it going. Keep it working.

As already mentioned, I am of the opinion that big rewrites are indeed a bad idea. One should not let a system deteriorate into a Big Ball of Mud in any case. When dealing with a legacy system is required, a careful step by step approach is advised. First bring the relevant part under test, then slowly refactor it. A great book on the subject is Working Effectively with Legacy Code by  Michael Feathers.

If you can’t easily make a mess go away, at least cordon it off. This restricts the disorder to a fixed area, keeps it out of sight, and can set the stage for additional refactoring. One frequently constructs a FAÇADE [Gamma et. al. 1995] to put a congenial “pretty face” on the unpleasantness that is SWEPT UNDER THE RUG.

This is a very effective and pragmatic approach. To break dependencies and be able to work with clean interfaces I often create new interfaces (the language constructs) and then make trivial implementations that act as adapters for the old code.

While using this technique, I have gotten some complaints from people along the lines of “you are not really fixing the problem”. And indeed, the mess is not gone. This is merely the first step in doing so, which both decreases the damage done by the mess, and enables further clean-up.

The second was that the stadium’s attempt to provide a cheap, general solution to the problem of providing a forum for both baseball and football audiences compromised the needs of both. Might there be lessons for us about unexpected requirements and designing general components here?

There is definitely a lesson here yes. The problem however does not lie with generality. This example strikes me much more as a violation of the Single Responsibility Principle, as well as bad interface segregation. Code on different levels should do one thing, and do it well. This applies to functions, classes, components and systems.

This also makes me think of YAGNI. I often see people make things on a function level more generic then they need to be, by adding support for things “that might be needed”. There are of course cases where this makes sense, as some things will be very hard to change on later. However in many others there just is no need for handling some arbitrary whatever, or taking in an optional argument which is never passed. This then essentially ends up being hard to spot dead code, which often people do not realize is dead code, and then end up doing crazy students to support it.

Change: Even though software is a highly malleable medium, like Fulton County Stadium, new demands can, at times, cut across a system’s architectural assumptions in such a ways as to make accommodating them next to impossible. In such cases, a total rewrite might be the only answer.

Only if your architecture sucks balls. Changing requirements might cause things to be obsoleted, new things to be written and old things to be rearranged. If a total rewrite is needed, you probably want to think about firing your “architect”.

That concludes my remarks on this paper. Let’s go clean up some code :)

Tagged with: , , , , , , , ,
Posted in Programming

Semantic Extra Special Properties 1.0 released!

I am happy to announce the 1.0 release of the Semantic Extra Special Properties extension!

This release fixes various issues and makes the extension compatible with the latest MediaWiki, Semantic MediaWiki and PHP versions. It adds several new special properties such as PAGEID and EXIFDATA, as well as providing performance improvements. A more verbose, though also more technical, overview of the changes can be found here.
For upgrading you will need to run update.php and it is recommended you also run SMW_refreshData.php. Documentation on how to configure the extension can now be found here.

As of this release is is also possible to install the extension via Composer. (Manual installation is still supported.) The package name is “mediawiki/semantic-extra-special-properties“.
Many kudos go to MWJames for doing essentially all of the development work in this new version. I’d also like to thank Karsten for helping with the documentation and testing.
Tagged with: , , , , , ,
Posted in Software

I did not do it again!

TL;DR:fail-stamp

At some point I lost the database of this blog. I can’t recall the reason or the exact date, though it was probably somewhere in 2010. Using the Google cache I was able to recover most of the posts.

A few weeks back the database of this blog managed to corrupt itself somehow. (I suspect a dist-upgrade on my server is to blame for that, though I’m not sure.) After spending several hours trying to restore the database and learning many things about InnoDB that I’d rather not have to know, I figured I’d have to resort to manually restoring my blog again.

Why manually? Didn’t I have any backups?! I’m quite lazy and had not gotten around to automatically making backups of my blog. Since I’m only adding content here rather slowly, manually doing a backup every few months is just fine. And I actually did that. Those backups were not backed up though, so I lost them when one of my disks crashed a few weeks ago. Ugh.

The timing of this was also unfortunate as I used to have a recent sql dump of my database lying around from when I moved my blog a few months ago. It seems like I deleted it in the meanwhile though.

Funnily enough I was able to restore all posts up to May 2011 using a dump from my original blog database, on my old webhost. The only reason it’s still there is because (a) I’m to lazy to clean it up, and (b) I’m to lazy to cancel the contract I don’t really need anymore. Sometimes laziness can be a real asset :)

The more recent posts I restored manually from the cache of my RSS reader. Sadly enough that leaves a gap from May 2011 to March 2012, which probably contains two dozen blog posts I was not able to restore.

March 2014 edit: Found the remaining posts on archive.org and inserted them \o/

Posted in Uncategorized

MediaWiki extensions to define their MediaWiki compatibility

Over the past year support for real dependency management has been gradually added to MediaWiki and selected extensions. This support being based on the Composer software.

While extensions have been able to specify their dependencies for a while, such as PHP libraries and other extensions, they where not able to specify the MediaWiki versions they work with. The reason for this being that there is simply no MediaWiki package. Extensions are installed into MediaWiki, as opposed to people installing a MediaWiki package (and its dependencies) into something else. Changing this would resolve the issue, though I’m not sure this is a good idea. What I am however sure of is that such a change is not going to be made any time soon for mainly political reasons. I have now tackled the issue using an alternative approach that requires no change to existing workflows and can be happily ignored by people afraid from evil things such as third party software, namespaces and anonymous functions.

This alternative approach is based on adding a “provide” package link to the root package (this is the package into which extensions are installed). If this provide link points to a mediawiki package with specific version, then extensions can specify they need this version, and have this requirement satisfied on installation in the root package. Rather then having this link defined in composer.json, which would mean people would need to update the version there manually or an automatic composer.json-modifier would need to be created, the link is added programatically via a Composer hook.

Extensions will thus be able to specify their compatibility as follows (in their composer.json):

The commit adding this capability to MediaWiki is awaiting review on Gerrit. Many thanks go to @beausimensen for providing the basis of this idea and helping with implementation. Thanks also go to @seldaek for tweaking Composer so this approach would actually work.

Tagged with: , , , , , ,
Posted in Uncategorized

PHP Framework Interoperability Group

Those who have worked with me in recent history know that I’m rather passionate about reuse and interoperability. Holding that in mind it should not come as a surprise that I’m very happy to announce Wikidata and Semantic MediaWiki are now represented on PHP FIG.

PHP FIG stands for “PHP Framework Interoperability Group”. The goal of this group is to enhance communication between the participating PHP projects and thus facilitate better collaboration. The PSR standards are the most obvious product of this. Since Wikidata and SMW are now represented, we have one vote on upcoming standard proposals \o/

For me this is an additional channel via which I can contribute back to the more awesome parts of the wider PHP community.

If you are not familiar with the PSR standards yet, I highly recommend you have a look at them. Also consider following the FIG mailing list.

Tagged with: , , , , , , ,
Posted in Uncategorized