GSoC mid-term: Attributes Disk Store

empty

My name is Ernesto Aneiros and during this Google Summer of Code I am working on the Attributes Disk Store.

The problem

In Gephi, Attributes are the data that is associated with nodes and edges. As graphs grow larger and larger, attributes occupy more memory even though many times they are not essential to the end-user when he is only applying transformations or algorithms to the graphs. These attributes can be of different types, from simple Java primitives (byte, char, int, String, etc) to Gephi’s internal data types (lists of primitives or versioned data). The idea for the project was to have a combined memory/disk cache system to partially off-load these attributes to disk. The system should have a well-designed cache system to handle heavy read access on the most-accessed elements.

The Solution (1st iteration)

Lucene is one of the most popular text searching engines and a flagship Java open source project. Lucene is capable of handling and indexing millions of records while remaining performant, and when the idea for the Attributes API cache was born, Lucene was first considered for the role of the data store, and as added bonus Gephi will get full-text search capabilities with almost no extra effort. When analyzing the problem, the following criteria were developed to judge a possible data store:

  1. Reliable (resistant to corrupted disk data, failed transactions, unexpected errors)
  2. Fast
  3. Transparent (minimum complexity exposed to the end-user)

While Lucene complies with items 2 and 3, the approach when dealing with corrupted indices in Lucene is to rebuild from scratch therefore failing item 1. This doesn’t pose a problem to Lucene because in the context where it is supposed to be used (indexing of external information), input data is always available separately from the index and can be accessed if needed. In Gephi, however, this is not the case. Once Attributes are loaded from disk they remain in memory until saved back to file. If an error occurs during a disk store transaction the end-user can end up losing a day’s work, certainly not acceptable.

The Solution (2nd iteration)

After Lucene was ruled out as a contender for a data store, several options were considered, including using embedded SQL databases and using a combination of Ehcache plus BerkeleyDb. Both options bring a lot to the table and embedded databases in Java have achieved impressive results in performance when compared to other mainstream database systems (see projects H2 and HSQL for example). Ehcache + BerkeleyDb however win when complexity is considered since they introduce almost no translation layers between Gephi and the cache. Both solutions are good fits for the problem but in the end the balance tilted in favor of Ehcache + BDB because the complexity consideration.

Optimizing Ehcache and BerkeleyDB

Even though Ehcache provides a great deal of functionality and features, it was relatively easy getting up to speed with it. The documentation provided online was very complete with code samples available and detailed explanations. In almost no time an in-memory cache was up, running and being tested. Traditionally cache sizes have been specified as the amount of max elements that they can hold. In the 2.5 BETA of Ehcache a new feature was introduced that allowed sizing the caches by memory consumed instead of elements held. For our project this is a killer feature since we can now expose a single option to the user, letting him specify how much memory the cache should consume. Even though using the new feature proved a little more complicated than expected we obtained great feedback from the Ehcache community, specially from alexsnaps and Mike Allen, which helped us to solve the issues we were having.

BerkeleyDB on the other hand, is a very complex piece of software. With years of development under the belt, BDB has evolved to be a very robust and flexible database. In fact, it is so flexible that can be used as full blown database supporting queries, a simple key/value datastore or with a front-end that exposes a Java collections map that greatly simplifies its use. All of this flexibility does not come free though, configuring and optimizing BerkeleyDB requires delving into details about transactions, buffers, log file sizes and BDB internals. However the tools are there and the information provided is quite good, especially the FAQ and the optimization section.

Integration with Gephi

Since ease of use and transparency are important considerations for the end-user of Gephi, only the minimal configuration options are exposed in the preference panel of the disk cache, but an Advanced tab provides more control for those who want it.

general_settings_tab
The General settings tab, where cache can be enabled or disabled and the memory usage configured.


The advanced settings tab allows a more advanced user to configure several of BerkeleyDB’s options.

The Disk Store in Action


Memory consumption without the disk store. It reaches 400MB.


Memory consumption with the store, after load it drops below 400 MB. Note how load time increased due to disk operations, a trade-off to consider when using the store.

Known issues

The project is still in development. Being memory saving the main goal of the disk store project, results are not good enough yet because of several reasons.

While BerkeleyDB provides a very convenient way of storing bytes in disk, it is still a database oriented software and therefore it is not the most suitable solution for out project because of large memory usage to caching data, building and maintaining its index (features desirable for databases but not for this project).

Trying to reduce BerkeleyDB memory usage with its settings will produce quite different results in different systems or even in the same system. The benchmark above shows not bad results but it is not always the case. A better control of maximum heap growth can be observed but still with memory usage peaks that prevent better saving.

The conclusion is that it is a priority to replace BerkeleyDB with other disk persistence system or create one specifically designed for Gephi disk store.

It is also known that graphs with more complex data like strings or lists will always benefit more from a disk storage system than graphs with simple data like integers or booleans. An idea is to always store simple data in memory because indexing in in the disk is going to need as much memory anyway, or even more.

On the other hand, Gephi works and was designed with in-memory data structures in mind. Adding a cache/disk store to the system is bound to create integration issues with other parts of the codebase. For example the GEXF file importer tends to load large portions of the graph file to memory while parsing it, which is not so good in memory constrained environments and using the cache here will not make a difference. One of these issues is regarding the handling of data in files with .gephi format. Due to the way that .gephi files are imported, some integration problems still need to be debugged in the disk store to work properly.

Looking to the future

This GSOC project is only scratching the surface of what a memory + disk cache system can achieve. In the future BerkeleyDB could be replaced with other persistence provider, and it doesn’t necessarily has to persist locally to the disk. For example replacing BerkeleyDB with a datastore like Cassandra, or maybe some RDBMS.

Conclusion

While the Data Store API introduced by this project is still taking its first steps and can be significantly evolved, it has helped ironing out many issues and has paved the way for bigger and better improvements. Working during this summer has been a great experience and I have been able to share with great mentors like Eduardo Ramos, who knows the Gephi codebase in and out. I hope the work of all of Gephi’s GSOC’ers becomes the starting point for many new features and enhancements that the community will surely appreciate. Happy coding and see you next summer!

GSoC mid-term: a new Timeline to explore time-varying networks

daniel

My name is Daniel Bernardes and during this Google Summer of Code I am working on the new Timeline interface.

Dynamic graphs have been the subject of increasing interest, given their potential as a theoretical model and their promising applications. Following this trend, Gephi has incorporated tools to study dynamic networks. From a visualization perspective, a critical tool is the Timeline component, which allows users to select pertinent time intervals and display and explore the corresponding graph. The challenge concerning the timeline was twofold: redesign the component to improve user experience and add extra features and introduce an animation scheme with the possibility to export the resulting video.

Together with my mentors Cezary Bartosiak and Sébastien Heymann, we have proposed a new design for the timeline component featuring a sparkline chart in the background of the interval selection drawer (which is semi transparent): this feature will help the user to focus on particular moments of the evolution of the dynamic graph, like bursts of connections or changes in graph density or other simple graph metrics. Current metrics are the evolution of the number of nodes, the number of edges and the graph density. The sparkline chart was preferred to other chart solutions because it does not add too much visual pollution to the component and adds to the qualitative analysis. The interaction with the drawer remains globally the same of the old timeline, to guarantee a smooth transition for the user.

To implement this feature we have used the chart library JFreeChart (a library already incorporated to Gephi), customizing their XYPlot into a Sparkline chart by modifying their visual attributes. To display the Sparkine, one needs to measure the properties of the graph in several time instants of the global time frame where the dynamic graph exists. This represented a major challenge, since the original architecture did not allow the timeline component to access (and measure) the graph in particular instants of time; the solution was to introduce a slight modification to the DynamicGraph API to provide an object which gave us snapshots of the graph at given instants. Other challenges we dealt with included the automatic selection/switching of real number/time units in the timeline (depending on the nature of the graph in question) and sampling granularity of the timeline.

Another breakthrough of this project was the introduction of the timeline animation. Once the user has selected a time frame with the drawer it can make it slide as the corresponding graph is being displayed on the screen. Besides the technical aspects of interaction between the timeline and the animation controller, there were also an effort to calibrate the animation (ie, in terms of speed and frames) so it would be comfortable and meaningful for the user.

As far as the UI is concerned, the component has gained a new “Reset” button next to the play button which activates the timeline drawer and displays the chart. It also serves to reset the drawer selection to the full interval when the timeline is active. The play button gained its original function, that is, to control the animation of the timeline — instead of activating the selection.

Finally, the animation export to a video format revealed to be more tricky than expected and couldn’t be finished as planned. There were several setbacks to this feature, beginning with the selection of a convenient library to write de movie container: it turns out that the de facto options available are not fully Java-based and need an encoder working in the background. The best alternative I found was Xuggler, which is based on ffmpeg. Also, obtaining screen captures of the graph to were a little bit tricky so I have exported SVG images from the graph corresponding to each frame, converted them to jpeg and than encoded them though Xuggler to a video format. As one might expect, this solution is not very efficient in terms of time, so Mathieu Bastien and my mentors suggested me to wait for the new features from the new Visualization API that would make this process simpler.

In addition to current bugfixes and minor improvements concerning the timeline and the animation, the movie export remains the the next big step to close this project. If you have questions or suggestion, please do not hesitate! The new timeline will be available in the next release of Gephi.

DB

GSoC mid-term: GraphGL, network visualization with WebGL

urban-škudnik

My name is Urban Škudnik and during this Google Summer of Code I develop GraphGL, an open source network visualization library for the Web.

Introduction

GraphGL is a network visualization library designed for rendering (massive) graphs in web browsers and puts dynamic graph exploration on the web another step forward. In short, it calculates the layout of the graph in real time and is therefore suitable for static files (exported GraphML/GEXF files) and for dynamic files (LinkedIn InMaps would be one such example).

As such, it is both a replacement for Gephi and a complimentary tool, similar to Seadragon, providing another method for displaying graphs in a Web browser.

Google-ChromeScreenSnapz001

null Static demo on the Java dataset.
null Static demo on a random graph with 100 nodes and 500 edges.
null Static demo on a random graph with 10,000 nodes and 50,000 edges.
null Static demo on a random graph with 100,000 nodes and 200,000 edges.
null Dynamic demo on Java dependencies dataset.

Commands: mouse left-button to pan, mouse wheel to zoom.

Alternatives

While having Gephi (renderer, at least) in the browser would be nice, such alternatives are not really realistic – for one, Java in Web browser is not welcomed by many users as it alone is a large resource hog. Another issue that can be raised is it’s integration with the rest of the web environment and issues that a developer can face with integration into his web application. It’s benefit however would be almost native-application performance.

Flash can also be considered for our problem as it supports 3D hardware accelerated graphics but being a proprietary technology it is not particularly attractive, especially for a library that wants to be based on open and standard technologies.

An alternative is aforementioned Seadragon plugin that builds image tiles of the rendered graph and provides interactivity components similar to those found at Google Maps or any other mapping site. As calculating graph layout and rendering itself can be very resource intensive this method can still be encouraged at graphs where unreasonably large amounts of RAM and CPU are required. It’s issue is interactivity and dynamics – after graph is rendered and exported, it can not be easily changed, especially not in real time.

WebGL and Web Workers

However, WebGL and WebWorkers present a solution, that can circumvent the issues of interactivity and dynamics and at the same time offer good performance.

3D graphics on the Web was always a bit tricky and was only possible if you had Java or Flash plugin. WebGL origins can be traced to Canvas 3D experiments at Mozilla in 2006, but it was in 2009 that Mozilla and Khronos, consortium that is focused (among other) on creating and maintaining open standard for graphics, started WebGL Working Group. It’s first stable specification was released in March 2011.

Since then, it has been touted both as a solution to 3D graphics problem on the web as well as a huge security vulnerability that provides a completely new vector of attack – access to kernel-mode graphics drivers and hardware.

The WebGL API is based on OpenGL ES 2.0 (with slight changes) and is exposed through HTML Canvas element. OpenGL ES 2.0, in turn, is a subset of OpenGL, primarily target at embedded devices and enables fully programmable 3D graphics with a vertex and fragment shader exposed to the developer.

Web Workers are a lot less controversial technology. Basically, they are an API for starting, running and terminating Javascript scripts in the background (separate thread) and thus allow web application to perform long-running calculations that could otherwise be interrupted either by user actions or by browsers timeout limits for Javascript.

WebGL and Web Workers are supported by Firefox (enabled by default since 4), Safari (disabled by default in 5.1), Chrome (enabled by default since 9) and Opera (though for Windows at the moment there is only a development build).

Microsoft has already indicated that they do not plan to support WebGL in its current form due to security issues, but there is a plugin, IEWebGL, that adds support for it.

Basically, all of this boils down to this: if your users are relatively tech savvy and therefor have relatively modern web browser and that browser is not Internet Explorer, you can give GraphGL a serious consideration. If your target audience will include a large proportion of IE users that will not or can not install a plugin, this might not be your optimal solution.

GraphGL

GraphGL’s objective is to be an open source network exploration tool for the Web. Built with open technologies, easily extensible (e.g. with other layout algorithms), easy to integrate with existing web applications, it enables easy adoption in your application and rapid development of any missing features also for developers that are not familiar with OpenGL and GLSL (shader language of OpenGL).

To achieve all of these objectives, GraphGL is built with the help of three.js, an awesome library for WebGL that abstracts-away low-level graphic calls. This means that Javascript developers should not have too much trouble giving a helping hand to the project.

Currently, data is imported with JSON (JavaScript Object Notation) converted to internal representation and displayed. Basic interactivity, such as panning, zooming and selecting node and its connections are already implemented, with further additions for selection possible.

Use cases

Another factor to consider is what you are trying to achieve. As mentioned, if you have a multi-million node graph, calculating its layout in real time might be a bit too heavy-weight for your average computer. It’s current best use case would be when you do not have a too large graph so that layout can be calculated on a client side.

One such example could be graphs that change frequently or are dependent on the per-rendering settings: interconnections between particular Twitter users’ followers, where, if Twitter would provide such a tool, calculating all layouts would be extremely expensive for Twitter, while for most average users this wouldn’t present any problem if layout would be calculated on client side when user would visit this tool.

LinkedIn is doing something similar with it’s InMaps service.

What to expect in term of performance

Performance varies greatly, as could be expected from such a library. On a modern computer one should not have problem calculating layout and rendering thousands, if not tens of thousands of nodes, while on older hardware (lower) thousands of nodes should still be rendered, but performance may not be super-smooth. In the future, further optimizations should give us even a higher FPS (Frames Per Second).

If, however, you are dealing with static graph (meaning, exported GraphML file, converted to JSON), we can easily render tens of thousands of nodes and edges and actual file size gets the biggest limitation.

To put things a bit into perspective: On my notebook (Summer 2007 Macbook Pro – 2.2GHz Core2Duo, 4GB RAM, GF8600M) I can render the Java dependency dataset that comes with Gephi (1.5k nodes, 8k edges) with about 40FPS, 10k nodes with 50k edges with around 10 to 15FPS and 100k nodes with 200k edges with around 3-5FPS. However, 100k nodes and 200k edges file comes at almost 22MB. At one time I tested with 2k nodes and 900k edges, file came at almost 37MB and sent Chrome belly up (though I haven’t tested that dataset with latest branch that supports static layouts).

I hope we (my hopes are that more developers join in the effort) still have some space to optimize and render even larger graphs.

Limitations

As said, support for WebGL is not universal and this can present a show stopper for you. Further limitation for the time being can be layout calculations and the strain it can put on resources of your users. Along with that, one should also keep in mind a very real issue of file size – large datasets are large not just by number of nodes but also by megabytes.

Technicals

What follows is a more technical discussion of implementation and issues for those that are interested in development of GraphGL.

Theory

Web is always a bit of a tricky environment due to a rather restrictive environment in which you must operate. Not only you have to share resources with other applications, but you also share resources with other web applications which on times have memory leaks or just burn through CPU cycles like there is no tomorrow (though GraphGL will fall into later category – but with layout processing and heavy rendering that is somewhat expected).

Along with these usual restrictions there is also a browser limit on the duration of execution of Javascript code, performance of Javascript itself (no call-by-value), practical file size limitations, recursion limits, etc.

As said, WebGL and Web Workers were utilized to try to circumvent these limitations. Using three.js to abstract low-level graphic calls has its advantages and potential problems, but in general advantages out-weight problems.

Advantages of faster development and wider developer base have already been pointed out, so I’ll just point out the biggest possible problem (and advantages at the same time). With three.js, the abstraction removes low-level control over details of implementation and optimization for our use case.

At the beginning of Summer of Code I also looked at other libraries but at the end three.js won over the rest primarily due to a lot more active developer community around it. Most of other libraries in general provide tools to help with things like loading shaders and how to send attributes and uniforms to the shader and leave majority of graphic calls to programmer. None of them also provided any particular advantage over each other so at the end the deciding factor was really a number of semi-active developers as my hope is that GraphGL becomes de facto the open source network visualization library for the Web for the foreseeable future and for that it needs a foundation that will not be unmaintained.

Implementation

Library imports JSON (GEXF and GraphML were considered, but are unfeasible – as they are XML, they can only be properly parsed (i.e. not with Regex) in the main window, which would lock the browser at graph of any meaningful size).

At this very time, there are two implementations – one which relies on meshes for rendering of nodes and one that relies on three.js‘s particle system. Later is not yet quite as stable and therefore still in separate branch.

For the “stable” relase: nodes are rendered as Meshes – each one a plane – with a shader drawing a circle by determining whether pixel should be colored or not, i.e., whether it satisfies the equation x^2 + y^2 – r^2 < 0.

As for "particlesystem" branch: Every node is a particle, rendered as a gl.POINT, determining its size with gl.PointSize. Coloring and shape are yet to be implemented, but will follow the same rule.

Edges are rendered as a single Line object – three.js translates this to WebGLs gl.LINES – to efficiently render large number of lines. Arches (disabled at the moment) – are, as nodes, rendered as planes with each one being colored by shader – if pixel lies in a certain range of values and therefore satisfies an implicit equation.

Currently only one color of edges is supported.

As for layout – it is calculated in a Web Worker that (at the moment) uses a not-quite-finished-yet version of Force Atlas 1 algorithm. Me and Julian Bilcke (my mentor) are in the process of re-writing Force Atlas 2 into Javascript but for all practical purposes my library should be easily understandable to anyone to write any desired algorithm into Javascript – if not, do not hesitate to contact me for help/explanation/suggestions.

Future

For what remains of Summer of Code I plan to fix bugs, write documentation, maybe finish Force Atlas 2.

Currently labels are also missing but should be implemented in the near future. I just have to decide if I should implement them with HTML or as text in WebGL. First option gives us easy copy-and-paste and greater flexibility for (custom) styling, second gives performance. One take would be to do it with HTML and only show labels when you are close enough and remove those that are not in the view or only show labels of a node and it’s neighbors when you select it.

My long term (and at the moment still uncertain) goal is to also try to move layout calculations to GPU, though this presents serious challenges. I tried to implement this in the middle of GSoC but stumbled upon a couple of technical issues that prevented practical implementation. Since then I came upon several demos that overcame those specific issues, making me hopeful that it shouldn’t be impossible.

While implementing it with WebGL will be hard, it should be a lot easier to achieve with WebCL. Hopefully, WebCL adoption will head the same way as WebGL (meaning, generally about a year or two).

Summary

I hope this text provided good introduction into GraphGL, what technologies it uses, how it is built, what are it’s objective and for what kind of problems it is best suitable for. If you have a use case already but don’t see a particular feature do not hesitate to request it – it just might bump it up the priority list.

And remember, the point of GraphGL is customization and easy changes that can be done by everyone.

Feature requests? Comments? Suggestions? Opinions?

Comments or urban.skudnik@gmail.com or github – just fork it! 😉

GSoC mid-term: new Preview API

My name is Yudi Xue and during this Google Summer of Code am glad to work on the Core evolution of Gephi.

Current API in the Preview module provides too many granular methods and classes. Developers are clueless about how they may extend the component. In this project, we do not seek to expand what the Preview module already have to offer. Rather, we focus on making the Preview module easy to learn, easy to use and easy to extend to the Gephi developers. The new API will allow developers to focus on particular parts of the module. They may specify a new visual algorithms just by implementing a new type of Renderer, such as edge bundling and convex hull. They may also extend the RenderTarget to allow display or export visualization to different platform.

The user story

We took the infovis reference model into consideration when we started designing the new infrastructure. The infrastructure aims to provide support to a visualization-preview workflow:

raw data -> the data builder -> renderers -> render targets.

In particular, the raw data is the graph associated with the current gephi workspace. The data builder (DataBuilder) will interpret information associated with the nodes and edges and generate Item objects for Preview use. The Item objects are immutable objects that are either node item (NodeItem), edge item (EdgeItem) or item group (GroupItem) specified from the graph workspace or data lab. We append “Item” to refer that they are data rather than display objects. After the data has been imported, the preview controller (PreviewController) will associate each type of entity items with Renderer objects. Renderer objects are functional procedures that describe how an item should be drew. While we give information to an Renderer object what it is going to draw, we also tell it what RenderTarget it will use. By default, we provide ProcessingRenderTarget, PDFRenderTarget and SVGRenderTarget. All RenderTarget objects contribute to the RenderTarget API, which provide granular drawing functions that can be used by developers to form advanced visual algorithm. In addition to the workflow, we will provide a flexible properties structure to the Preview module so it may be used to provide listener to user interface commands. The property will allow dynamic dependency where grouped properties can listen for a single parent property.

The code below demonstrates how a Renderer to a particular Item type could be updated at runtime.

Code sample:

PreviewController prc = (PreviewController)Lookup.getDefault().lookupItem(
                                                       PreviewController.DEFAULT_IMPL).getInstance();
prc.loadGraph();
// Load graph from workspace
prc.updateRenderer(NodeItem.class, new Renderer() {
    // How I want to draw a node, edge, or item types.
    // Specify your procedural visualization algorithm here
    @Override
    public void render(Item item, RenderTarget rt) {
        NodeItem ni = (NodeItem) item;
        rt.drawImage(..);
        rt.drawline(..);
    }
    // The RenderTarget will pick up the properties and draw the rest..
});

The big picture

Speaking of API flexibility, the Preview API goes from constrained to flexible in the direction from DataBuilder to RenderTarget. Here is the big picture:

Current progress

  • done:
    • a working copy based on the new architecture
    • added ProcessingRengerTarget
    • added GroupRenderer (Convex hull)
    • added ImageRenderer
    • basic unit testing
    • basic functional testing against updating Renderer in Preview API at runtime
  • in progress:
    • Property support
    • Selfloop, curved edge drawing
    • PDF and SVG RenderTarget implementation

Here is a screenshot of the new system with convex-hull enabled:

Code practice

The code base is under active development at https://code.launchpad.net/~yudi-xue/gephi/gephi-preview. The code base includes the PreviewAPI module and the PreviewImpl module.

Lookup API

We make use of Netbeans Lookup API to instantiate singleton and use Lookup. Template to ensure the correct implementation been called.

For example, to call the default PreviewController constructor, we call:

/*
* DEFAULT_IMPL is defined in the interface.
* It refers to default implementation class
* "org.gephi.preview.PreviewControllerImpl"
*/
(PreviewController)Lookup.getDefault().lookupItem(PreviewController.DEFAULT_IMPL).getInstance();

Accordingly, you may choose to use the API with your implementation by creating a Template that points your implementation class.

Functional Tests

During the development, we are creating functional tests against our own API for the purpose of both flexibility and stability. the “PreviewAPIFunctionalTest”

Conclusions

Our goal is to bring modularity and extensibility to the Preview module. We aim to deliver the freedom in defining your own visual algorithms (Renderer) and user interaction (Property) and make use of API without thinking about the detailed mechanism. I would like to give my thanks to Dr. Christian Tominski, Mathieu Bastian and Sébastien Heymann for their support and feedback, which is critical during the development for the new architecture.

GSoC mid-term: Automated build & Maven

My name is Keheliya Gallaba and during this Google Summer of Code I am working on the Automated build system for Gephi. The goal of this project is to add Maven build support to Gephi and set up a continuous integration system to fasten the release process. The Netbeans Platform, which Gephi is built upon, natively uses Apache Ant to compile, build and package the application. But now there is also a variant of NetBeans which uses Apache Maven as the build system. There are several reasons that make moving into a Maven based system worthwhile.

Maven vs Ant

The existing Ant build system for building NetBeans Platform-based applications which is called Ant Build Harness is very intuitive, and needs almost no initial setup. The set of standard Ant scripts and tasks can be easily triggered by the IDE or by the command line. But there are reasons that Ant might not suite a rapidly growing, multi-module project like Gephi. The Gephi project consists of a team of developers who work on dependent modules and plugins. These modules have to be composed to the application regularly. With a large number of modules, with many small packages, and with multiple projects with many inter-dependencies and external dependencies, its essential to manage different versions and branches with their dependencies. And reusing modules with the Ant build harness is not that intuitive.

Image1-Gephi-modules-modified

But Apache Maven is introduced as a standard, well defined build system that can be customized. It uses a construct known as a Project Object Model (POM) to describe the software project being built, its dependencies on other external modules and components, and the build order. It comes with pre-defined targets for performing certain well-defined tasks such as compilation of code and its packaging. It makes dependency management very easy and efficient with the concept of repositories. Most importantly in maven unique coordinates: groupId, artifactId, version, packaging, classifier identifies an artifact which can be uploaded or retrieved from a repository. This helps to easily build modules which depend on other modules.

Work completed so far

This project involves digging deeper in to the Gephi’s architecture and understanding dependencies, building and packaging. Gephi includes 100+ submodules categorized into Core, UI, Libraries and Plugins sections. NBM, which stands for “NetBeans module”, is the deployment format of modules in NetBeans. It is a ZIP archive, with the extension .nbm, containing the JARs in the module, and their configuration files. NBM files can be manually installed using the Update Center and choosing the option for installing manually downloaded modules, or they can be downloaded and installed directly from netbeans.org or another update server.

I’m happy to say that I was able to successfully mavenize 75 modules and continuing to complete the rest. I primarily used the NetBeans Module Maven Plugin for this, which now comes built in with NetBeans 6.9 and 7.0 IDEs. Currently NBM handles the tasks like defining the ‘nbm’ packaging by registering a new packaging type “nbm” so that any project with this packaging will be automatically turned into a netbeans module project, creating nbm artifacts and managing branding. It is also capable of populating the local maven repository with module jars and NBM files from a given NetBeans installation.

Image-2-Screenshot-NetBeansIDE7.0

Some third party libraries used in Gephi are not maintained in any public Maven Repositories. So I had set up a local Sonatype Nexus Repository to store and serve these dependencies. Basic functionalities of a repository manager like Sonatype are:

  • managing project dependencies,
  • artifacts & metadata,
  • proxying external repositories
  • and deployment of packaged binaries and JARs to share those artifacts with other developers and end-users.

We are in the process of setting up a Sonatype Nexus Repository in official Gephi server as well, so not only these third party jars, but the Gephi releases such as the Gephi Toolkit can be served as a maven dependency to maven-based projects all over the world.

Image-3-Screenshot-Sonatype-Nexus-Maven-Repository-Manager-Google-Chrome

Challenges faced during the process

  • Researching on existing large scale applications using NetBeans RCP and Maven
  • Finding documentation on handling Netbeans specific ant tasks, now in Maven
  • Managing transitive dependencies and versioning (specially with slight defferences of Maven and NetBeans difinitions)
  • Compilation and Test Failures.

Continuous Integration

Image-4-Screenshot-Continuum-Continuum-Project-Google-Chrome

While Maven migration is going on, I also looked in to the other aspect of the project, setting up of a continuous integration server. Main benefits of such a system are:

  • checking out source from source control,
  • running clean build,
  • deploying the artifacts in a repository
  • and running unit tests.

Furthermore it can notify developers via Email, IM or IRC on Success, Failure, Error and Warning in a build or simply a Source Code Management Failure. What this means is that when a project gets updated during development, the continuous integration system will try to build the project and will notify the developers if it ran into any issues. This is very useful when working on a multi-module project with many developers, like Gephi since a developer may unintentionally, by accident break the build since they are working concurrently on code and they may have unique configurations to their development environment that isn’t shared by other developers. I looked at the options of Apache Continuum, Hudson and Jenkins (A fork of Hudson) considering the criteria, being open source, supporting Ant & Maven and better integration with Java based projects.

Hudson is an extensible Continuous Integration Server built by Sun Microsystem’s Kohsuke Kawaguchi. Since the design of Hudson includes well thought-out extension points, developers have written plugins to support all of the major version control systems and many different notifiers, and many other options to customize the build process for example the Amazon EC2 plugin to use the Amazon “cloud” as the build cluster.

Continuum is described as a fast, lightweight, and undemanding continuous integration system built by Apache Maven team. It is built on the Plexus component framework, and comes bundled with its own Jetty application server. Like Maven, it is built on the Plexus component framework, and comes bundled with its own Jetty application server. It uses Apache Derby, a 100% Java, fully embedded database for its persistence needs. All these reasons make Continuum self-reliant, and also particularly easy to install in almost any environment.

After considering all of these reasons I settled on Apache Continuum because of the ease of setting it up, configuration and out-of-the-box support for Bazaar. Bazaar is the distributed version control system used in Launchpad for managing the source code, when lot of developers work together on software projects like Gephi. I have set up a local instance of Apache Continuum to check out and build the ant-based Gephi hourly. In the future we can host this in the Gephi server to notify the developers and administrators.

Future Work

Since the initial foundation has been laid out, it will be quite convenient to complete the rest of the planned work. These will include completion of mavenizing rest of the modules, creating .zip distribution, properly running the final project being developed and setting up the infrastructure at the Gephi server.

I would like to thank my mentors Julian Bilcke, Mathieu Bastian and Sébastien Heymann for providing all the guidelines and support for making this project a success. You can find my ongoing work at this repository: https://code.launchpad.net/~keheliya-gallaba/Gephi/maven-build

References

GSoC mid-term: Scripting Plugin

My name is Luiz Ribeiro and during this summer I am working on creating a Scripting Plugin for Gephi, mentored by Eytan Adar from the GUESS project and co-mentored by Mathieu Bastian. This article will give you an overview of the current status of the project and also what you can expect from future work.

Background

The Scripting Plugin originated as a joint proposal with the GUESS project which aimed at porting the Gython language as a console plugin for Gephi during the Google Summer of Code. For those who are not familiar with it, GUESS is a software that was originally created to support the interactive manipulation of graph structures. This feature is achieved through a mix of a visualization framework and a domain-specific embedded language called Gython.

As you have probably already guessed, Gython is an extension of the Python programming language or, more specifically, Jython, which is a Java implementation of Python. Thus, Gython is backwards-compatible with Python itself and can be used with many different Python 2.5 libraries without much pain.

By adding new operators for handling graph structures to the Python’s grammar and exposing nodes and edges as first class objects to the scripting language, Gython turns out to be a very powerful and concise language to work with graphs.

Since the GUESS’ implementation of Gython is based on Jython 2.1.0, we opted for a complete rewrite of its source code. Gephi’s implementation of Gython works over Jython’s latest stable release, version 2.5.2. Besides, our implementation does not support all the original features that were present on GUESS and focuses on a better integration with the Gephi Toolkit. At the moment, our plugin supports most operations of the Graph API, Attributes API and Filters API. This means you can create and remove nodes/edges from the graph, manipulate nodes/edges attributes and also build filters and run queries on the graph.

Over the next section, I will guide you through a small tour of some of the current features of the Gephi’s Scripting Plugin.

Current Status

The scripting console can be accessed through the Window menu from Gephi’s UI. After opening up, the console looks like this:

Thanks to jythonconsole, the console supports code completion. On the screenshot above you can see that the console suggests many different attributes for a given node in the graph.

Add/remove nodes and edges

Like on GUESS, there is a reserved variable name “g”, which corresponds to the main graph from the current workspace. This object has many methods for manipulating the graph like addNode, addUndirectedEdge, addDirectedEdge, among others. So, for example, for adding two new nodes to the graph and an undirected edge connecting them:

>>> g.addNode()
v1
>>> g.addNode()
v2
>>> g.addUndirectedEdge(v1, v2)
e1

As you can see, each node of the graph can be accessed by prefixing the node id with “v” and each edge can be accessed by prefixing the edge id with “e”.

Operators

One of the most interesting features of Gython is that it has four new operators for selecting edges, ->, <->, <- and ?. These operators work as follows:

  • v1 <-> v2: selects the undirected edge connecting nodes v1 and v2;
  • v1 -> v2 and v2 <- v1: selects the directed edge from node v1 to node v2;
  • v1 ? v2 selects any edges connecting nodes v1 and v2.

Handling attributes

Consistently with Python, all the nodes and edges’ attributes from the Data Laboratory can be accessed directly as objects’ attributes from the console. Topological attributes like a node’s degree also can be accessed directly by calling v5.degree, for instance.

If you have a node attribute column called “gender” you can access it directly, for example:

>>> v42.gender = "female"
>>> v42.gender
"female"

Filtering

In my opinion, filtering is where the console really stands out. Building complex filters is as simple as typing a single command. Let’s start with an elementary example: say that you have a graph with your Facebook social network and you want to view only the subgraph of women in your network, you would call something like this:

>>> visible = g.filter(gender == "female")

This means that you are filtering the main graph “g” for nodes that have the attribute gender equals to “female” and afterwards you set this resulting subgraph as the visible graph. Further, you can also filter the visible graph. For example, if you want to filter the visible graph to show only the nodes that have degree greater than 5 on it:

>>> visible = visible.filter(degree > 5)

The intersection and union operators available on the Gephi’s Filters Window can be used with the & and | operators from the language. For example, the following commands will build a filter that filters nodes that are in the human resources department and are more than 45 years old:

>>> someFilter = (dept == "HR") & (age > 45)

Note that this time the resulting filter has been assigned to a new variable (even though you could have applied the filter directly). If you want to filter the main graph with the newly created filter and set the resulting subgraph to the visible view, just run the following command:

>>> visible = g.filter(someFilter)

You can also add a filter created from the console to the Filters Window by executing the following command:

>>> addFilter(someFilter)

Finally, if you want to set the visible view to the main graph again, just type:

>>> visible = g

Conclusion and Future Work

Obviously, this is just a quick overview of the scripting console functionality. Over the next few weeks I expect to release an alpha version of the plugin to the community and also an user guide with some more detailed examples. Also, a documentation of the plugin’s inner-workings should be coming soon too (i.e. a more up-to-date specification than the one available on the wiki), in case you’re interested in contributing to the development.

If you are interested in trying the plugin right now and testing the code, you can download it from my Bazaar branch on Launchpad:

bzr branch lp:~luizribeiro/gephi/scripting

There are many features that we still want to implement, including:

  • Allow users to import and export graphs from the console;
  • Support for running layouts with console commands;
  • Integration with the Partition and Ranking APIs.

If you have any ideas or suggestions, feel free to leave a comment. Feedback is always more than welcome!

GSoC mid-term: new Visualization API

My name is Vojtech Bardiovsky and I am working on the new Visualization API. This is done together with the new visualization engine based on shaders.

API design

The aim of the project was to design a clean and usable API for the new engine. It exposes only as much as necessary, but enough to make customization of visualization possible. The following four API classes are all services and can be retrieved through ‘Lookup’.

Visualization controller

This is the most important class in the API and can be used to retrieve the ‘Camera’, ‘Canvas’ used for visualization display, and very importantly the instance of active ‘VizModel’ and ‘VizConfig’ classes that both contain many settings that help controlling the visualization. It will also allow making direct changes to visualization like setting the frame rate or centering camera according to different parameters. The ‘Camera’ class can be used to get data about its position or to make actions such as translation or zooming.

Event and Selection managers

The Event manager can be used to register listeners to user events exactly as in the old engine. This is very important for the tools. The selection manager provides methods to retrieve all currently selected nodes or edges, to select nodes or edges and to control the selection state of the UI (dragging, rectangle selection, etc).

Motion manager

Apart from listening to all user induced events and their most basic handling (selection, translation, zoom), this class provides information about current mouse position in both screen and world coordinates.

New features

There are many changes the new engine will bring and although it is not finished yet, there already are some new user-side features.

Complex selection

In the old visualization engine, only rectangular and direct (one node) selection were possible. New API will allow to implement any reasonable shape. At the moment it supports rectangles, ellipses and polygons.

Thanks to the selection shape variability and changes in the mouse event system, it is possible to make incremental/decremental selections using Shift and Ctrl keys. Opposed to only one node at the time, the whole selection can be dragged and moved now.

Background image

It is now possible to change and configure the background image. Settings are similar to the CSS properties such as ‘background position’ or ‘background repeat’.

Node shapes

It is possible to have different shapes for every node in graph. Basic shapes include ‘circle’, ‘triangle’, ‘square’, etc., but also up to 8 custom images that can be imported by user. Nodes can have their shapes defined in the import file or set them directly through the context menu.

Better 3D

Work has been done on a better way to control the scene in the 3D. Graphs are not naturally suited for 3D, for example adding new nodes or moving them will never be perfectly intuitive. But for displaying the graph, some enhancements can be done.

Current status

The engine is still under development, but the API is slowly closing to its final state. Next step for the API will be to include as many configuration possibilities as the engine will allow. The underlying data structures will be optimized for performance.
As the project consists of two parts, API and engine, Antonio Patriarca, the mentor for this GSoC project and implementor of the engine will write an article about rendering details in the near future.

(The rendering pipeline for edges is not fully finished, so the images shown are not the actual new look of gephi.)