May 05, 2015
Historically, science was considered to be a domain of philosophy referred to as natural philosophy or scientia (this blog is an interesting attempt to discuss the modern state of this relationship). One aspect that remains common to scientific research and philosophy today is the fact that both involve developing and learning languages. This is certainly recognized explicitly within philosophy itself. Indeed, a major theme of 19th and 20th century philosophy is referred to as the linguistic turn. In science, the substantial if not all-encompassing role played by language is often left implicit. Part of the reason for this may be the tendency of scientists to view philosophical issues as being off the table with regard to scientific discourse.
The increasing impact of computer science on all of the other sciences has led to a linguistic turn in science itself. Part of the reason for this is expressed in a beautiful talk by Guy Steele that outlines, among other things, the fundamental challenge in writing the software constituting a compiler. That is, after settling on some primitive instructions constituting a target language, it is necessary to figure out how to automatically translate every linguistic construct from the source language into compositions of the concepts available in the target. As the distance in abstraction between the source and target language increases, this becomes increasingly difficult to do in a single step.
The intuitive reason for this scientific linguistic turn has been the increasing desire to automate some rote—but increasingly more complex—parts of the scientific process. The ability to even imagine beginning such automation processes has built upon decades of symbiotic research in linguistics, computer science, and cognitive science. However, in part due to the fluidity of the sorts of natural languages pervading scientific discourse, the process of attempting to formalize the domain-specific languages (DSLs) from different domains of science may uncover what appear to be serious logical inconsistencies that pervade the academic literature (the latter being what has served as the closest thing to formalization of any scientific DSL prior to the scientific linguistic turn).
An interesting case study of this sort of process in biology is the evolution and development of representing molecular interactions using diagrams that some might argue has culminated in the systems biology graphical notation (SBGN) introduced in 2009 . Ever since the dawn of molecular biology, biologists have been drawing diagrams of molecular interactions to communicate what they have learned via previous and hypothesized about future experiments. One of the most cited examples is that codifying the logic of the lac operon 
What is important is not so much the diagram itself, but rather what it represents. Instead of diagrams alone, there are several different underlying XML specifications such as the systems biology markup language (SBML) and biopax that can be used to encode the information the diagram is supposed to represent in a consistent manner that can be automatically processed in the context of reasoning about complicated networks of interactions. For example, here is the BioPAX representation of some of the data necessary to encode knowledge about beta-galactosidase synthesis
<bp:PhysicalEntity rdf:about="I2"> <bp:displayName rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">PartialBetagalactosidase</bp:displayName> </bp:PhysicalEntity> <bp:Protein rdf:about="B"> <bp:xref rdf:resource="http://identifiers.org/uniprot/P00722" /> <bp:displayName rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">Betagalactosidase</bp:displayName> </bp:Protein> <bp:BiochemicalReaction rdf:about="conversion_r_b2_i2"> <bp:xref rdf:resource="http://identifiers.org/obo.go/GO:0006412" /> <bp:participantStoichiometry rdf:resource="LEFT_0_conversion_r_b2_i2_I2_STOICHIOMETRY" /> <bp:participantStoichiometry rdf:resource="RIGHT_0_conversion_r_b2_i2_B_STOICHIOMETRY" /> <bp:left rdf:resource="LEFT_0_conversion_r_b2_i2_I2" /> <bp:displayName rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">Beta_galactosidase_synthesis</bp:displayName> <bp:right rdf:resource="RIGHT_0_conversion_r_b2_i2_B" /> </bp:BiochemicalReaction>
This is important because the biological questions on the docket for the 21st century require integrating more information than anyone should ever hope to hold in their mind at once.
There is much more that could be said about the development of formalized representation in the natural sciences. However, it is interesting to speculate about how these languages will evolve and play a role in improving our understanding of and communication about natural systems. In order to make better use of these languages their expressive power needs to be carefully considered. Contrary to some intuition, this is not a case where more is always better. There is a delicate balance to be struck between the amount of automated inference that can be performed and the level of detail that can be encoded in repositories like the biomodels database. While the biomodels database is progressive in that it allows for the representation of process as opposed to static structure alone, as the size of models grows this can lead to models that are too computationally complex to be reasoned about in an automated fashion [4,5]. What is most important, is that explicit efforts to develop at least semi-formal language standards increase the fidelity with which information about biological systems can be communicated and integrated.
Written by Cameron Smith who works on understanding biological systems in New York.