Today : Tue, 23 Jan 18 .



Discussion space - W7: Community modeling, and data and model interoperability

Page: W7.W7PositionPaper - Last Modified : Sat, 17 May 08

W7.W7PositionPaper History

Hide minor edits - Show changes to output

May 17, 2008, at 07:37 PM by Ralf Seppelt -
Added line 109:
Changed lines 115-116 from:
Participatory Modeling [40-44] is becoming a recognized approach to modeling complex systems for decision-making. However, there are no agreed standards and platforms for data sharing and group model development available so far.
to:
Changed lines 122-123 from:
The realization that interoperability is desirable and important is relatively recent. Most of the interoperability difficulties just described have arisen due to institutional and cultural conventions, which have evolved over long periods of time, without the priority among data producers for the need or benefits to coordinate efforts.
to:
The realization that interoperability is desirable and important is relatively recent. Most of the interoperability difficulties just described have arisen due to institutional and cultural conventions, which have evolved over long periods of time, without the priority among data producers for the need or benefits to coordinate efforts. Participatory Modeling [40-44] is becoming a recognized approach to modeling complex systems for decision-making. However, there are no agreed standards and platforms for data sharing and group model development available so far.
Added line 125:
Added line 130:
Added line 132:
Added line 134:
Added line 155:
Added line 157:
Added line 159:
Added line 162:
Added line 166:
Added line 168:
Added line 170:
Added line 176:
Changed lines 182-183 from:
It is essential to have the right software tools and consensus between data providers (in this case, NOAA and USGS) and modelers (in this case, CBP) to make sure that the watershed model can access the data needed as a standard pre-processing, setup routine, when the data will be found and downloaded for further model runs.
In
addition to climatic and water data, the watershed models are linked to socioeconomic information required for landuse coverages and calculation of loading factors. This leads to further exploration of linkages to census data available from the US Census Bureau [56]. The census data is organized according to census blocks and tracts which have nothing in common with the watershed and subwatershed spatial structures assumed in watershed modeling. We need additional preprocessing to reorganize and resample this data to make it available for the model.
to:

It is essential to have the right software tools and consensus between data providers (in this case, NOAA and USGS) and modelers (in this case, CBP) to make sure that the watershed model can access the data needed as a standard pre-processing, setup routine, when the data will be found and downloaded for further model runs. In addition to climatic and water data, the watershed models are linked to socioeconomic information required for landuse coverages and calculation of loading factors. This leads to further exploration of linkages to census data available from the US Census Bureau [56]. The census data is organized according to census blocks and tracts which have nothing in common with the watershed and subwatershed spatial structures assumed in watershed modeling. We need additional preprocessing to reorganize and resample this data to make it available for the model.
May 17, 2008, at 07:25 PM by Ralf Seppelt -
Added line 52:
Deleted lines 65-66:
There is also considerable interest in making models to each other. OpenMI is one
Added line 71:
Environemental problems and their management increasingly requires to consider feedbacks between different entities, such as biotic and abiotic processes. Thus data and models need to build upon joind integrated joined data bases
Added line 75:
Added line 81:
Added lines 83-85:

There is also considerable interest in making models to each other. OpenMI is one
May 17, 2008, at 09:06 AM by Ralf Seppelt -
Changed lines 30-31 from:
to:
'''Ralf Seppelt''': [[Attach:seppelt.pdf| seppelt.pdf]]
Changed lines 14-15 from:
1 - '''Upload the file''' clicking below were says "click here", but '''''be careful''''': when uploading '''name the file attached like this''': YourNameofFirstAuthor_et_al.pdf
to:
0 - '''Be careful''': when uploading in the next step, '''name the file attached''' like this: YourNameofFirstAuthor_et_al.pdf

1 -
'''Upload the file''' clicking below were says "click here"
Changed lines 21-22 from:
'''To submit an abstract''': [[Attach:abstractExample.pdf| Click here]]
to:
'''To submit an abstract''': [[Attach:Abstract.pdf| Click here]]
February 21, 2008, at 12:07 AM by 70.177.181.146 -
Changed lines 37-41 from:

!! Heading

[++'''Position Paper: Community modeling, and data and model interoperability'''++]
to:
--

!! Position Paper

[++'''Community modeling, and data and model interoperability'''++]
Changed lines 55-56 from:
1. Introduction
to:
[+1. Introduction+]
Changed lines 85-86 from:
2. Interoperability in environmental observatories, and community building
to:
[+2. Interoperability in environmental observatories, and community building+]
February 21, 2008, at 12:01 AM by 70.177.181.146 -
Changed lines 25-27 from:
to:
'''Philipp Kraft, Kellie B. Vache, Lutz Breuer, Hans-Georg Frede''': [[Attach:Kraft_et_al.pdf| Kraft_et_al.pdf]]
Changed lines 37-38 from:
Community modeling, and data and model interoperability
to:


!! Heading

[++'''Position Paper: Community modeling,
and data and model interoperability'''++]
Changed lines 7-10 from:
[++'''Abstracts'''++]

'''Example of abstract''': [[Attach:abstracExample.pdf| AbstractExample.pdf]]
to:
\\

[++'''New Abstracts'''++]


Steps for submitting a new abstract:

1 - '''Upload the file''' clicking below were says "click here", but '''''be careful''''': when uploading '''name the file attached like this''': YourNameofFirstAuthor_et_al
.pdf

2 - After uploading, '''edit this WIKI''' (see edit button on the top right corner), add one empty line and the following line at the bottom of the next section "Abstracts already uploaded":
[='''Abstract by YourNameofFirstAuthor et al''': [[Attach:YourNameofFirstAuthor_et_al.pdf | Abstract by YourNameofFirstAuthor et al
]]=]

'''To submit an abstract''': [[Attach:abstractExample.pdf| Click here]]

\\

[++'''Abstracts already uploaded'''++]


\\
December 17, 2007, at 12:33 AM by avoinov - Initial version
December 17, 2007, at 12:32 AM by avoinov - Initial version
Changed lines 16-158 from:
(:commentbox:)
to:
(:commentbox:)

Community modeling, and data and model interoperability

Community modeling is a promising paradigm to develop complex evolving and adaptable modeling systems that can share methods, data and models more easily within specialized communities and with outsiders. Why then are cooperative modeling communities still quite rare and do not propagate easily? Why has open source been so successful for software development, yet open models are still quite exotic? One big difference between software and models is that software shares some common language. Models often use very different principles and semantics. It becomes hard for one modeler to communicate these principles to another; it becomes difficult for one model to talk to another one. Similar problems prevail in data operations, when data sets (which are also models of sort) are hard to integrate with other data. Environmental observatories are becoming an important driver in the research community and also call for new interoperability standards and functionality.
There are two facets of the problem:
* Lack of common modeling and software tools to enable modularity and connectivity;
* Lack of social motivation and communication skills to enable communal work and sharing environments.
The goals of this paper are to explore both of these areas.
* Understand the interoperability needs of the community in a participatory and collaborative effort;
* Develop research scenarios that would benefit from interoperability. Build consensus about interoperability architecture and standards supporting these scenarios;
* Expand on environmental system observatory ontologies, in particular for mapping variables to concepts;
* Discuss common access protocols, enabling models to automatically search for data needed and link to data servers. Design data interoperability for model input/output to help link models.


1. Introduction

There is an increasing consensus that much value can be derived from integrating different models and data sets.





Development of an integrated data sharing infrastructure to facilitate multidisciplinary collaborative analysis and modeling in the context of an environmental observatory is a pressing need. With such infrastructure, researchers should be able to publish and document their data, discover what information is available based on agreed-upon metadata descriptions, retrieve the information over common data access mechanisms, understand and resolve semantic discrepancies between datasets, integrate them for use in analysis and modeling codes, and share research findings with community members. In this research cycle, information sharing and re-use are the major underpinnings in reducing fragmentation in environmental research, and engaging a broader research community from the environmental science and related domains in advanced data collection, analysis and modeling.

The need to focus on the common data foundation for the communities involved in environmental monitoring, analysis and modeling, is underscored by the following observations:





The range of interoperability challenges, derived from differences in structure and semantics of datasets, data publication, discovery and access mechanisms, as well as in modeling approaches, have been described in recent literature on environmental observing systems [60]. Technical interoperability issues, such as those related to common procedures for real time data management, integration of streaming data with data archives, and technologies for expressing and resolving well-understood structural and semantic heterogeneities, have been the focus of NSF attention over the last several years, within such initiatives as CEO:P (Cyberinfrastructure for Environmental Observatories: Program), Geoinformatics, CLEANER/WATERS [1, 6] and NEON [3]. However, purely technical solutions for interoperability are insufficient for establishing a shared interoperable infrastructure for environmental observatories. For the software infrastructure to be truly useful for empowering the entire research community with data sharing and collaborative research capabilities, several additional challenges must be addressed. They include:
o making the community aware of the available data and software resources,
o building consensus about information models used in the community,
o understanding and harmonizing data structures and data access mechanisms and formats used within the community, and
o building support for modeling applications that take advantage of the infrastructure.
The initial experience of infrastructure development and adoption within GEON [4] and CUAHSI [2] HIS (Hydrologic Information System) projects provides ample evidence of heterogeneities in data and resources needed to compute watershed and estuary models in particular, and of the range of interoperability challenges stemming from different data structures and semantics adopted within the community.
Development of ecosystem models in general has been limited by the ability of any single team of researchers to deal with the conceptual complexity of formulating, building, calibrating, and debugging complex models. The need for collaborative model building has been recognized in the environmental sciences. The current-generation models tend to be "idiosyncratic monoliths that are comprehensible only to the builders" [26]. Communicating the structure of the model to others can become an insurmountable obstacle to collaboration and acceptance of the model. The interoperability functions that we propose to develop will be the core of a system of middleware that would allow integrating existing models and will provide for easy integration of new models.

The models provide ample evidence for heterogeneities that need to be resolved in order to obtain more detailed and accurate model results. There is already a significant community building effort going on. For example, within the Chesapeake Bay area there is the CCMP, which is building reliable working relationships among all participating modelers who are willing to share their model and data structures and needs to develop the important interoperability functions. This will include the integration of the very large existing and continuously increasing watershed, hydrodynamic, and biogeochemical (water quality) databases. This effort is led by the Chesapeake Research Consortium (CRC), which organizes universities, as well as government and non-profit research organizations around common problems. In addition, the CCMP can serve as the body responsible for governance of data exchange standards within the Bay area, providing linkages between various community strata grouped according to:





2. Interoperability in environmental observatories, and community building



2.1. Data availability, metadata and catalogs

Creating easy-to-use, uniform and scalable data and services publication and discovery mechanisms, and helping community members familiarize themselves with the available resources, are the basis for engaging community members into an efficient interoperable network. At the moment, the many environmental stakeholders maintain their own data archives and data access systems. For the Chesapeake Bay, several community resource repositories and discovery interfaces are being developed. They include:
- The Chesapeake Information Management System (CIMS), which brings together multiple datasets assembled by federal, state and local agencies. CIMS datasets have consistent metadata descriptions based on FGDC Content Standard for Digital Geospatial Metadata and the metadata content is searchable via text search, with many datasets available for download (the search interface is at [29]).
- The GEON-based CBEO portal, developed within the on-going NSF-supported CEO:P project, award #0618986. Within this portal, users can publish their datasets and services and search for registered resources of different types, including shapefiles, raster images, Excel spreadsheets, relational databases, WMS/WFS services, web services, documents, etc. A subset of CIMS water quality database is already registered through this portal [30]. Catalog search is available via the portal, and via the GEONSearch web service.

However, the above efforts are disconnected, and additional work is required to reconcile metadata and information discovery protocols across the repositories, to enable users query and explore available data for the area regardless of repository.

2.2 Differences in information models for observations data, and data access protocols


In addition, each of the nationally hosted environmental data sources, such as hydrologic data repositories at USGS and EPA, have different data access interfaces. The CUAHSI WaterOneFlow services provide a simplified, consistent way of accessing data from a combination of these sources. While similar in approach to the OGC web services specifications, the CUAHSI web services are not OGC compliant at the moment, though initial harmonization steps are outlined in the CUAHSI WaterML specification [39]. As a result users wishing to access these sources can only do so using CUAHSI-compatible client software. Other non-OGC, non-CUAHSI data servers require a still greater degree of data customization software. Other domains than hydrologic data have similar issues, and may not even have web-based access.

2.3 Differences in model needs with respect to the available data

Models require increasingly large volumes of input data, which raises a performance problem for accessing the data via web services during model runs. It is necessary in many cases to download these large volumes of data from the national portals and transform the data in certain ways to prepare for use with the desired numerical or analytical models. Spatial data may be initially available as triangular grids, regular grids, or vector form such as points, lines, and polygons, while the model may only work with the data in a different form. The scale or resolution of the initial data may not be appropriate for a given model execution, and must be transformed to the correct scale. The temporal resolution and reference system of time-series observations can differ from that expected or needed by models. Each model has its own specific data requirements, resulting in increasingly painstaking manual effort to locate, obtain, and transform the data in preparation for modeling applications.
Participatory Modeling [40-44] is becoming a recognized approach to modeling complex systems for decision-making. However, there are no agreed standards and platforms for data sharing and group model development available so far.

2.4 Differences in semantics between datasets available for different times and assembled in different research domains.

Another significant issue is the difference in nomenclatures and semantics among data sources for a given type of data, usually as a function of the compilation or hosting organization. The simplest differences may be in terms of language, keywords, metadata tags, etc., especially across subject domains and time periods. For example, the depth of a stream may be called "gage," "stage," or "waterlevel." A similar example is that the classification codes (identifiers) for variables such as nutrients (e.g., nitrogen) will vary, depending on the data producer's conventions. Semantic reconciliation is a well-recognized component of the data interoperability challenge [37, 45, 46, 47, 48, 49, 50, 51, 52, etc.]. Within both GEON and CUAHSI HIS, several technical approaches to ontology management and semantics-based integration have been explored. Beyond the technical issues, significant effort is required to make common nomenclatures and ontology translations accepted as the basis for community lingua franca.

2.5 Difficulties in consensus building.

The realization that interoperability is desirable and important is relatively recent. Most of the interoperability difficulties just described have arisen due to institutional and cultural conventions, which have evolved over long periods of time, without the priority among data producers for the need or benefits to coordinate efforts.

In addition, significant institutional and cultural issues still remain that limit the efficacy of interoperability technologies and policies.



The most straightforward aspects of interoperability are in the design and development of technologies and practices for cyberinfrastructures supporting federated data sources linked to users through service-oriented architectures and web services. This approach serves a wide range of government, business, and scientific application areas, and is the subject of nearly all the current standards work in progress at OASIS, OGC, and other IT standards organizations. OGC has evolved a process for testbed pilot projects that has achieved high levels of sponsored participation and collaboration by government and industry data providers and users. As a result of OGC testbed projects which use rapid prototyping as a way to accelerate the design, testing, and adoption of interoperability tools and standards, numerous national agencies in the US, UK, Europe, and Australia now routinely require OGC specifications to be applied in RFPs for enterprise and federated geospatial information portals and other systems. This is an important step toward alleviating the institutional barriers to interoperability. By engaging both the leadership and the programming staff of government agencies and commercial entities in the activities of defining and adopting geospatial standards and best practices, many of the trust issues at the institutional level are addressed, and organizational policies evolve accordingly. Now the issues of most concern within the IT standards consortia are not so much about whether to share data, but how to do so while managing data security, intellectual property rights and licensing, personal privacy, cost recovery, and adequate functionality and performance of transactions for data creation and update.
Much work is also being done in core technology for bridging and mediating between different semantics and ontologies. The Geography Markup Language (GML) [63] was initially developed by OGC members in 2000, and has steadily improved to support the representation of essentially any type of geographic feature, but it is too abstract and general for direct use within a subject domain (user communities are encouraged to adopt an application schema based on some reasonable subset of GML to precisely characterize a given information model). Another broadly applicable standard more recently adopted has to do with webs of sensor data: SensorML [64], TransducerML [65], Sensor Observation Service [66], and other related specifications, were jointly developed by OGC and IEEE (Institute of Electrical and Electronics Engineers), and are beginning to be used in environmental observatory context.
The IT standards organizations have not, in general, been involved in the harmonization of semantics and ontologies within specific subject domains. Each distinct information community needs to develop its own information models and metadata catalogs, simply because these must reflect the deep science within each subject area. These developments, including GeoSciML (GeoSciences Markup Language, based on GML), EML (Ecological Metadata Language), CML (Chemistry Markup Language), WaterML (for hydrologic applications), and other domain-specific ML's, represent important steps in the maturing of community attitudes and understanding toward sharing information models and data in effective ways. At the same time, the choice, or development, of a relevant data model and markup language adequately describing the variety of data presentation and transformation needs of a research domain, remains a challenge. Furthermore, because these developments involve agencies, universities, and other research centers around the world, there is growing pressure to overcome many of the institutional and international barriers to interoperability, which have existed.

3. Challenges and solutions










4. Vision and Tasks

There are two major complementary components in this kind of research: software research and development, and community building and integration. Both are equally important. Within the software-related category, there are a number of distinct types of components, interfaces, and web services, which can be considered.

4.1 Data availability, and harmonizing data discovery interfaces

We envision that easy query and browse access to a community information system with multiple community-contributed resources, coupled with the ease of publishing data of common interest, is a critical component of community infrastructure. The task of unifying data discovery and exploring data availability has the following sub-tasks:
1) Register community-generated observations datasets via existing online systems. Identify the most promising systems and focus on them.



4.2 Harmonizing information models and data access protocols



2) Develop converters and translation services for other datasets. The CUAHSI WaterML development and its harmonization with the O&M specification is of great interest to OGC and the WRON project (two co-PIs of this project are partly sponsored by WRON to attend a Water Markup Languages summit in Canberra 9/25-27/2007; a delegation of WRON project managers is visiting SDSC in September 2007, and SDSC has hosted a WRON researcher during Summer 2007).
3) Expand the technologies developed for observations data within existing observatories, including biogeochemistry data, land use, atmospheric data, etc.. Arriving at data interchange standards and ontologies applicable across communities is a challenging task. It requires explicating and comparing semantic frameworks used in the neighboring fields, outlining information models for commonly used data sets, standardizing data discovery and access mechanisms, and assessing common integration scenarios, e.g., in support of comprehensive modeling on watersheds and estuaries.

4.3 Matching model needs with the available data

By applying the same formalism that treats data sets as independent modules that can be accessed just like the modeling modules, we can integrate the data sets into the modeling system as well.


Figure 1. An example of data flow between some data sets and models. There are many more models and data sets, but these will be primarily targeted for this pilot study.

Suppose we have a watershed model such as the HSPF model that is core of the Chesapeake Bay Program (CBP). Like other watershed models, it requires information about climatic conditions and flow data for streams. These data sets are available from the web; however, substantial effort is required to download all the information needed and convert it to the proper format for landscape modeling. Each time we move from one sub-watershed to another, this effort must be repeated.
It is essential to have the right software tools and consensus between data providers (in this case, NOAA and USGS) and modelers (in this case, CBP) to make sure that the watershed model can access the data needed as a standard pre-processing, setup routine, when the data will be found and downloaded for further model runs.
In addition to climatic and water data, the watershed models are linked to socioeconomic information required for landuse coverages and calculation of loading factors. This leads to further exploration of linkages to census data available from the US Census Bureau [56]. The census data is organized according to census blocks and tracts which have nothing in common with the watershed and subwatershed spatial structures assumed in watershed modeling. We need additional preprocessing to reorganize and resample this data to make it available for the model.
Additionally, these standardized protocols for data access will be available to any models in the CCMP directory and beyond. For example, the PSU PIHM watershed model has similar data requirements, but runs over a triangular spatial grid. Resampling procedures will be developed to provide similar data access for a different geometric structure used in this model. The output from these models can be then piped into Bay models such as the CBP-QUAL-ICM or ChesROMS, which is part of the CCMP open source distribution.

4.4 Ontology and semantics



4.5 Serving data for participatory modeling efforts

In recent years, there has been a shift from top-down prescriptive management of water resources towards policy making and planning processes that require on-going active engagement and collaboration between stakeholders, scientists, and decision-makers. Participatory modeling is the process of incorporating stakeholders, often including the public, and decision-makers into an otherwise purely analytic modeling process to support decisions involving complex ecological questions. It is recognized as an important means by which non-scientists are engaged in the scientific process and is becoming an important part of watershed planning, restoration, and management. The development of unique, practical, and affordable solutions to ecological problems is often best accomplished by engaging stakeholders and decision makers in the research process. These group modeling efforts require specific types of models and data to be successful. These modeling tools are usually simpler than what we find in full-fledged research models, and oftentimes use Excel or Stella as a means of joint learning and system representation. The scoping models that are produced are designed to gain shared experience about the system and build consensus among stakeholders. However, they also require data to make them run. Moreover, they need a lasting web presence that would support group interactions, and link directly to diverse data and modeling tools.

4.6 Community consensus-building and testbed processes

Various communities need to be engaged in as many ways as possible. This can be achieved by the following means:
1) On-going "cyber-seminars" for which selected participants would submit white papers in advance for consideration by the community. On specific days, discussions of the content could take place using simple web conferencing tools such as Skype, Webex, etc. Follow-up discussion would take place on a twiki over a prescribed period of time (one to two weeks). The results of discussion could be incorporated into the initial white paper(s) and re-posted to the community, which could then decide on subsequent action: e.g., allow the paper to form the basis for subsequent research and development work; redirect research currently underway; submit for presentation to a conference; etc. The twiki could be used for further feedback and subsequent results as needed.
2) Provide means for a Web 2.0 approach to obtain community rating and annotation of submitted resources. Each key resource, such as source data sets, derived data sets, metadata catalogs, software tools developed to search the metadata and actual data sets, etc., will have a commonly identified means in the portal for users to rate the resource and enter comments explaining their ratings. The comments are essential to enable the resource developers to understand and respond to any problems or other issues. These comments and follow-up will be posted to resource-specific twikis for tracking and archive.
3) Support Inter- and intra-community ontology development with a Web 2.0 approach, along the lines of urbandictionary.com. Terms of reference, classification systems, valid values for coded domains, etc., will be posted to twikis allowing all users to see and vote on competing definitions. As specific issues arise that require more extensive discussion for resolution, the community will be called together for a cyber-seminar.
Solving the cultural issues mentioned previously is more difficult than the institutional issues, because these very often come down to working with attitudes, opinions, and beliefs of strong-willed individuals who may resist external directives. But here too, lessons are being learned and progress is being made. An individual field scientist can hardly be faulted for reluctance to share her data when she doesn't know the motives and applications a given user might have in mind. She may not see or accept the value to her of allowing her data to be accessed and used without her direct involvement. But if, say, someone from the cyberinfrastructure community were to take a personal interest in her work, and earn her trust based on a relationship of shared understanding, then her perspective is more likely to open. One needs to be able to see and trust ways of benefiting from the synergistic implications of contributing one's data to a greater body of knowledge. (It is also likely that some form of external rewards could help to further encourage this outcome.) Certainly, even with personal attention and effort to create such bridging relationships with the IT community, some scientists will still reject the importance of sharing their data, and of working with others to use community-based information models and data exchange mechanisms. This is just part of human nature. Such scientists may yet respond to firm directives from funding agencies to adhere to community information models and best practices, but the difference between motivated and unmotivated adherence to recommended standards and practices can make a big difference in the fitness of data for other uses.
In summary, while the focus to date on interoperability standards and tools has been largely technical, we need to acknowledge and work with the social aspects of it now. In order to move forward, we must now focus on forming consensus around information and modeling requirements and architectures, making the differences between current information models and workflows explicit, helping to build solid relationships between IT and subject domain scientists, and otherwise engaging the various stakeholder communities in these activities as much as possible.

Added lines 3-4:
[++'''Position Paper'''++]
Added lines 1-14:
(:title Discussion space - W7: Community modeling, and data and model interoperability :)

'''Position paper under discussion''': [[Attach:positionPaper.pdf| WorkshopPaper.pdf]]

[++'''Abstracts'''++]

'''Example of abstract''': [[Attach:abstracExample.pdf| AbstractExample.pdf]]

[++'''Discussion'''++]

Comments from participants.


(:commentbox:)