Gephi boosts its performance with new “GraphStore” core

Gephi is a graph visualization and analysis platform – the entire tool revolves around the graph the user is manipulating. All modules (e.g. filter, ranking, layout etc.) touch the graph in some way or another and everything happens in real-time, reflected in the visualization. It’s therefore extremely important to rely on a robust and fast underlying graph structure. As explained in this article we decided in 2013 to rewrite the graph structure and started the GraphStore project. Today, this project is mostly complete and it’s time to look at some of the benefits GraphStore is bringing into Gephi (which its 0.9 release is approaching).

Performance is critical when analyzing graphs. A lot can be done to optimize how graphs are represented and accessed in the code but it remains a hard problem. The first versions of Gephi didn’t always shine in that area as the graphs were using a lot of memory and some operations such as filter were slow on large networks. A lot was learnt though and when the time came to start from scratch we knew what would move the needle. Compared to the previous implementation, GraphStore uses simpler data structures (e.g. more arrays, less maps) and cache-friendly collections to make common graph operations faster. Along the way, we relied on many micro-benchmarks to understand what was expensive and what was not. As often with Java, this can lead to surprises but it’s a necessary process to build a world-class graph library.

Benchmark

We wanted to compare Gephi 0.8.2 and Gephi 0.9 (development version) so we’ve  created a benchmark to test the most common graph operations. Here is what we found. The table below represents the relative improvement between the two versions. For instance, “2X” means that the operation is twice faster to complete. A benchmarking utility was used to guarantee the measurements precision and each scenario was performed at least 20 times, and up to 600 times in some cases. We used two different classic graphs, one small (1.5K nodes, 19K edges) and one medium (83K nodes, 68K edges) . Larger graphs may be evaluated in a future blog article.

Benchmark / Graph SMALL (n=1490, e=19025) MEDIUM (n=82670, e=67851)
Node Iteration 23.0x 34.6x
Edge Iteration 40.1x 109.4x
Node Lookup 1.6x 2.1x
Edge Lookup 1.2x 2.3x
Get Edge 1.1x 1.2x
Get Degree 2.5x 2.3x
Get Neighbors 3.4x 1.2x
Set Attributes 2.3x 0.1x
Get Attributes 3.3x 4.0x
Add Nodes 6.2x 5.7x
Add & Remove Nodes 1.4x 2.9x
Add Edges 7.7x 3.8x
Add & Remove Edges 3.3x 1.8x
Create View 2851.0x 4762.3x
Iterate Nodes In View 2.7x 1.5x
Iterate Edges In View 11.6x 7.3x
Save Project 2.4x 1.7x
Load Project 0.6x 0.6x
Project File Size 1.9x 1.5x

These benchmarks show pretty remarkable improvements in common operations, especially read ones such as node or edge iteration. For instance, in average it takes 40 to 100 times less CPU to read all the edges in the graph. Although this benchmark focus on low-level graph operations it will bring material improvements to user-level features such as filter or layout. The way GraphStore creates views is different from what we were doing before, and doesn’t require a deep graph copy anymore – explaining the large difference. Finally, only the set attribute is significantly slower but that can be explained by the introduction of inverted indices, which are updated when attributes are set.

And what about memory usage? Saving memory has been one of our obsession and there’s good news to report on that front as well. Below is a quick comparaison between Gephi 0.8.2 and Gephi 0.9 for the same medium graph above.

Benchmark Gephi 0.8.2 Gephi 0.9 Improvement
Simple graph 115MB 52MB 2.2X
Graph with 5 attribute columns 186MB 55MB 3.4X

This benchmark shows a clear reduction of memory usage in Gephi’s next version. How much? It’s hard to say as it really depends on the graph but the denser (i.e. more edges) and the more attributes, the more memory saved as significant improvements have been made in these areas. Dynamic graphs (i.e. graphs that have their topology or attributes change over time) will also see a big boost as we’ve redesigned this part from scratch.

What’s next?

All of the GraphStore project benefits are included in the upcoming 0.9 release and that’s the most important. However, the work doesn’t end and there’s many more features and performance optimization that can be added.

Then, we count on the community’s help to start collaborating with us on the GraphStore library – calling all database and performance experts. GraphStore will continue to live as an all-purpose Java graph library, released under the Apache 2.0 license and independent from Gephi (i.e. Gephi uses GraphStore but not the opposite). We hope to see it used in other projects in the near future.

graphstore-api

GraphStore API, represented as a graph

Rebuilding Gephi’s core for the 0.9 version

This is the first article about the future Gephi 0.9 version. Our objective is to prepare the ground for a future 1.0 release and focus on solving some of the most difficult problems. It all starts with the core of Gephi and we’re giving today a preview of the upcoming changes in that area. In fact, we’re rewriting the core modules from scratch to improve performance, stability and add new features. The core modules represent and store the graph and attributes in memory so it’s available to the rest of the application. Rewriting Gephi’s core is like replacing the engine of a truck and involves adapting a lot of interconnected pieces. Gephi’s current graph structure engine was designed in 2009 and didn’t change much in multiple releases. Although it’s working, it doesn’t have the level of quality we want for Gephi 1.0 and needs to be overhauled. The aim is to complete the new implementation and integrate it in the 0.9 version.

In November 2012, we started to develop a completely new in-memory graph structure implementation for Gephi based on what we’ve learnt over the years and our desire to design a solution that will last. The project code-name is GraphStore and we focus on four main things:

  • Performance: The graph structure is so important to the rest of the application that is has to be fast and memory efficient.
  • Stability: The new code will be the most heavily unit-tested in the history of Gephi.
  • Simplicity: The Graph API should be documented and easy to use for developers.
  • Openness: If possible, we want GraphStore to be used in other projects and keep the code free of Gephi-specific concepts.

Gephi is known to use a large amount of memory, especially for very large networks. We want to challenge ourselves and tackle this issue by redesigning the way graphs are encoded and stored. Besides memory usage, we carefully analyzed possible solutions to improve read/write performance and optimize the throughput. Stability and simplicity are like food and shelter, and whatever we try to do at Gephi should be simple to use and stable. As we’re going towards a 1.0 version, we’re putting more and more efforts to testing and code quality.

Since November 2012, we have been working on GraphStore separately from Gephi’s codebase and will start the integration fairly soon. The Graph API is very similar to the existing API. However, it isn’t entirely compatible and several core things changed like attributes, views or dynamic networks and will require a lot of work in some modules. On the other hand, because the GraphStore code is decoupled, it could be leveraged in other projects. For instance, it could serve as a Blueprints implementation as an alternative to TinkerGraph.

Graph structure

A graph (also called network) is a pair of a set of nodes and a set of edges. Edges can be undirected, or directed if the direction of the relation matters. Edges may also have weights to represent a value attached to the edges, like the strength of a connection or the flow capacity. Edges may also point to the same node (i.e. self-loops). Gephi currently supports these features, but they are not sufficient to describe the variety of problems graphs can be helpful with. Multigraphs permit several relationships between nodes and is for instance commonly used to represent RDF graphs. Multigraphs with properties (i.e. ability to attach any property to nodes and edges) have recently become the standard representation for graph databases.

The next version of Gephi will support multigraphs and therefore allow multiple edges between nodes to be imported. The rollout will be done in two phases. The first phase is to allow this new type of graph to be imported, filtered and exported. We will update the importers and add new options to support these graphs. The second phase is to update the visualization and the way multiple edges between nodes look like.

Hierarchical graphs

Since the 0.7 version released in 2009, Gephi has supported hierarchical graphs. Hierarchical graphs let the user group or ungroup nodes so it forms a tree. Nodes which contain other nodes are named meta-nodes and edges are collapsed into meta-edges. Groups obtained from clustering algorithm (e.g. modularity) could also easily be collapsed into meta-nodes in order to study the network at a higher level. We initially recognized the potential of this idea for network analysis and developed a hierarchy-enabled data structure. However, we realized we didn’t completely fulfill the vision by not providing all the tools to fully explore and manage hierarchical networks. Although the data structure allows it, the software still lacks many features to really make hierarchical networks explorable.

Recently, we are more focused on networks over time and plan to continue to do so. In the past years, users have shown steady and continuous interest in dynamic networks and we haven’t really seen a strong interest in hierarchical networks. Therefore, we propose to remove this feature from next releases. On the developer side, cutting this feature will greatly simplify the code and improve performance.

Dynamic networks

Networks that change over time are some of the most interesting to visualize and analyze. We have heavily invested in supporting this type of network, for instance by developing the Timeline component. However, dynamic graph support was added after the current graph structure implementation was conceived and therefore remains suboptimal and difficult to scale. Now that we have enough hindsight, we can rethink how this should be done and make it simpler.

One pain point is the way we decided to represent the time. Essentially, there are two ways to represent time for a particular node in a graph: timestamps or intervals. Timestamps are a list of points where the particular nodes exist and intervals have a beginning and an end. For multiple reasons, we thought intervals would be easier to manipulate and more efficient than a (possibly very large) set of timestamps. By talking to our users, we found that intervals are rarely used in real-world data. On the code side, we also found that it makes things much more complex and not that efficient at the end.

In future versions, we’ll remove support for intervals and add timestamps instead. We considered supporting both intervals and timestamps but decided that it would add too much complexity and confusion.

Graph structure internals

Graph structures design is an interesting problem to solve. The objective is quite simple, yet challenging: how to best represent an interconnected graph so it’s fast to query and compact in space? Also, how to keep it simple and serve a large number of features at the same time?

Graph storage

Our goal is to develop a thread-safe, in-memory graph structure implementation in Java suitable for real-time analysis. You may ask how this differs from a graph database or a distributed graph analysis package. In a few words, one can say the requirements are quite different.

Graph databases like Neo4j, OrientDB or Titan store the graph on local disk or in a cluster and are optimized for large graphs and large number of concurrent users. Typically, the networks are much larger than what can fit in memory and these databases mostly focus on answering traversal queries. In the environment where graph databases operate most of the needs can be converted in some sort of traversal query (e.g. friends of X, tweets of Y). Traversal queries are also the reason why graph databases scale to billions of nodes. Indeed, for each traversal, only a subset of the graph is accessed. This is quite different from Gephi, which by its nature of being an analysis software needs to access the complete graph. For instance, when a layout is running Gephi needs to read the X,Y position of each node as quickly as possible. Although reading from the disk can be very quick as well (e.g. GraphChi), it’s limited to sequential access and things become more complex that way.

Because of the real-time requirements, we want to keep our graph data in memory accessible at all time. However, we want to make it easy to connect to external data sources, and graph databases in particular.

Reducing overhead

In computer science, overhead is any combination of excess or indirect computation time, memory, bandwidth, or other resources that are required to attain a particular goal.

GraphStore heavily relies on Java primitives, arrays and efficient collections library like fastutil. We are reducing overhead by simply avoiding using too many Java objects, which are very costly. Instead of using maps, trees or lists, Nodes and Edges are stored in large arrays which can be dynamically resized in blocks. For instance, iterating over the graph should be extremely fast because the CPU caches array blocks. This may sounds obvious but performance optimizations are tricky in Java because of the JVM and the uncertainty of what makes a difference and what doesn’t. In his “Effective Java” book, Joshua Bloch writes “Don’t guess, measure” and that’s still true today. For our project, we rely on well-defined micro-benchmarks to see where the bottlenecks are and how to make our data-structure more cache-friendly and more compact in memory. When the graph contains millions of edges, every byte saved per edge can make a large difference at the end.

In terms of speed, we focused on optimizing the most common operations, which are iterating over all the elements and consult nodes’ neighbors. Typically, a layout algorithm needs to read the neighbors of every node at each iteration. Neighbors can’t simply be an unsorted list because of the removal complexity: to remove a node, you need to know where it is. The current Gephi graph structure uses a binary tree to store the node’s neighbors. Although the complexity is logarithmic, every node in the tree takes extra memory and logarithmic complexity is still suboptimal. After isolating the problem in a benchmark, we found that using a double linked-list is the best solution for our requirements and achieves a O(1) complexity, as it fulfills both a quick iteration and quick update. Here is a snapshot of the solution:

Every edge has 4 integer pointers to the next in/out predecessor and successors and a separate dictionary would help to find the right edge based on the source and destination pointers. Each node has a pointer to the first edge in the linked list (i.e the head). Node ids are integers (32 bits) so one can easily create a long->Edge dictionary by encoding the source and destination node into a single long number (64 bits). The diagram intentionally leaves out the multigraph support for simplicity. In reality, nodes can have multiple head pointers, one for each edge type. Each edge type is represented by a integer index.

Views

Views are one of the most useful aspects of Gephi’s graph structure and are mainly used behind the scenes in the Filter module. A view is a graph subset (i.e. a subgraph) which remains connected to the main structure, so if a node is removed from the graph, it’s removed from the views as well. For instance, when users create a ‘Degree Filter’, Gephi creates a view and removes all the nodes which don’t fulfill the degree threshold. Multiple views can co-exist at any time in the graph structure. In the current graph structure, a node tree complete copy is done for every view and we found that this can be very inefficient.

In the new version, the way views are implemented is very different and should yield to better performance. Instead of doing a copy of the nodes, we maintain bit-vectors for nodes and edges. Because these elements are stored in large arrays with a unique identifier, it’s easy to create and maintain a bit-vector. When developers obtain the ‘Graph’ object for a particular view, the bit-vectors are used behind the scenes to adapt iterators and accessors. This solution should make filtering for large graphs much quicker. One drawback is that whereas the current implementation copies and then trims the view, GraphStore work with bit-vectors but continues to access the complete graph. In other words, if the view represents only 1% of the original graph, it still needs to iterate over the 100% to find which elements are the 1%. Even though this sounds bad, our benchmarks show it’s a very fast operation and we win overall because of the reduced overhead of duplicating the graph. Moreover, we can introduce some caching later to optimize this further.

Inverted Index

When you’re using the Partition module in Gephi, you’re manipulating some sort of inverted index. Nodes and edges have properties like ‘gender’, ‘age’ or ‘country’, and these properties are contained within the nodes and edges objects. An index is a simple data structure which allows to retrieve the list of elements for a particular value. For instance, the partition module needs to know what is the number of ‘male’ or ‘female’ nodes for the ‘gender’ column. When the column is a number like ‘age’, it also needs to know what is the maximum and minimum value. Unlike the Ranking module and its auto-apply feature, the Partition module is not refreshed in real-time and therefore difficult to use when the graph is changing a lot. We have decided to invest in this feature for the future release and are building a real column inverted index in the graph structure. The index will simply keep track of which values exist for each column and which elements are holding this value. The index will be updated in real-time as elements are added, removed or updated.

The ability to quickly retrieve elements and counts based on specific values will be very useful in many different modules like Filters, Partition or Data Laboratory. New APIs will be added for developers to use the newly created index interface. As we’re working on attributes storage and manipulation, we’ll also merge the Attributes and Graph API because they are so interconnected that it doesn’t really make sense to have them separate. The interfaces that developers are familiar with like Table or Columns will remain the same.

Events

In software programming, events are a common way to inform other modules that something changed. In Gephi, we also use events to convey graph updates events to inform other modules about updated nodes or edges. In the new GraphStore, we’ll stop using events to transport graph modifications because of the large overhead due to the creation of event objects. Indeed, when 10K nodes are added to the graph, the existing structure literally creates 10K event objects and puts them in a queue. Although the event queue is compressing objects of the same type, the overhead to create, queue, send and destroy large amount of small Java objects is too large.

Instead of a push model (i.e. the emitter is pushing updates), we want to rather promote a pull model (i.e. the listener pull updates from time to time) for future releases. A similar system is already in place to link the graph and the visualization module and it has been working without a glitch. We’ll develop the tools to easily calculate graph differences between a listener module and the graph structure. By removing the bottleneck, write performance should greatly improve.

Timestamps

As said earlier, we’ll add timestamps support to represent dynamic networks. Instead of using a time interval, a timestamp array will be associated with nodes and edges. For element (node/edge) visibility, each timestamp represents the presence of the element at that time. For example if a network snapshot is collected every month for a year, each node will have up to 12 different timestamps. The timestamp itself is a real number and can therefore represents an epoch time but also any other value in a different context. For a dynamic attribute, the time+value is simply represented as a list of (time, value) pairs.

To support the timeline and dynamic networks algorithm, we’re developing an inverted index for timestamps so we can make time filtering very quick. One good thing about intervals is that it’s very easy to know if two intervals overlaps with each other. With a flat list of timestamps, one can’t avoid to go through the entire list. The index will essentially map timestamps to the nodes and edges elements in the graph and therefore solves this issue. The Interval tree implementation which we are currently using to store intervals is based on a binary tree and is very costly in memory because of all the Java objects overhead. Using simple arrays should reduce overhead and improve performance for large dynamic networks. When computing a dynamic network algorithm (ex: Clustering coefficient over time), we’re using a sliding window over the graph so the ability to quickly filter is critical as it impacts how fast the graph refreshes.

Saving/Loading

Saving and loading the graph structure into into/from a file (or a stream) is another critical feature. When a user saves a project in Gephi, the graph data structure is serialized in XML and compressed into a .gephi file. If you worked with project files in Gephi, you may have experienced corrupted files issues or errors when loading a file. We’ve done our best to fix these problems but some still remain. We’re rethinking how this should be done in GraphStore and are making a call to rewrite the code from scratch. Our approach will rely on a lot of unit tests to make sure the code is stable so we don’t repeat the same issues in future versions. Please note that this concerns the .gephi files only and existing importers (e.g. GEXF, GraphML) will remain the same.

Concerning the GraphStore serialization, we’re abandoning XML in favor of pure byte arrays. That should yield to better performance and reduced project file size. We’ll create a custom reader for previous Gephi versions so you can still open your existing projects. Other modules like Filters or Preview will continue to use XML as it’s working just fine.

Next steps

This is the first post about the Gephi 0.9 version and more will come soon. We’re excited about the current developments and hope to hear from you. Please join the gephi-dev mailing list to learn more about ongoing projects and contribute. We need your ideas!

Follow us on Twitter!

Scientist Christian Tominski about Gephi

Guest blog post from Dr. Tominski who accepted to review Gephi 0.7alpha4 for us.

Christian Tominski received his diploma (MCS) from the University of Rostock in 2002. In 2006 he received doctoral degree (Dr.-Ing.) from the same university. Currently, Christian is working as a lecturer and researcher at the Institute for Computer Science at the University of Rostock. Christian has authored and co-authored several articles in the field of information visualization. His main interests concern visualization of multivariate data in time and space, visualization of graph structures, and visualization on mobile devices. In his research, a special focus is set on interactivity, including novel interaction methods and implications for software engineering.

Recently, I stumbled upon the Gephi Project – an open source graph visualization system. As I’ve done some research in the area of interactive graph visualization, I was eager to see how Gephi works and if it brings some new concepts or if it’s yet another graph visualization system. I’ll share my thoughts on Gephi from three perspectives. The first one is the user perspective. I’ll take the role of a user who is interested in getting a visual depiction of some graphs. Secondly, I’ll take the role of a developer and shed some light on the aspect of software engineering. And finally, I’ll be a scientist and try to foresee if and in which regard Gephi might have some impact on visualization research.

The User’s Perspective

Gephi has been designed with the users and their needs in mind. The system welcomes its users with a familiar look and feel. It is quite easy to load graph data into the system. Many of the known file formats for graphs are supported, as for instance, DOT, GML, GraphML, or Tulip’s file format TLP. A nice thing about the data import is that an import report provides essential information about the import process (e.g., number of nodes and edges, edge-directedness, potential problems, etc.). Once imported, the graph is shown as nodes and links in a main view, and several complementary views provide additional information.

The main view is the core for visual graph exploration. It allows users to zoom in, to select nodes, to adjust node size and color, to find shortest paths, and to access attributes of nodes and edges. In addition to letting users set sizes and colors manually, the system can also set these automatically based on attributes associated with nodes and edges. What is called “Partition” in Gephi is used to assign unique colors to nodes and edges based on qualitative data attributes (e.g., class affiliation). Quantitative data values can be mapped to size and color of nodes, edges, and labels using the “Ranking” tool. All these tools are customizable. It is worth mentioning, that Gephi provides some nice user controls to parameterize the color coding.

Gephi also supports graph editing, i.e., insertion and deletion of nodes and edges as well as manipulation of attribute values. What is missing in terms of editing the data is the possibility to add (and delete) attributes, for instance to generate some derived data values using simple formula.

A key aspect in graph exploration is the layout of node and edges. As it is usually unclear what will be the best layout for a given graph, Gephi offers various layout algorithms to choose from. While a layout is being computed, the main view constantly updates itself to provide feedback of the progress made. A big plus is that users can interrupt the layout algorithm once they deem the result to be ok or if they find that it might be more suitable to use the current result as the initial setup for another algorithm. This way users can easily tune the layout to fit the graph and the particular needs. Users may put the finishing touches to the layout by moving nodes manually in the main view.

Once a suitable visual representation has been created, the final step is to export nice pictures of the graph. To this end, Gephi follows the philosophy of providing a dedicated export interface with many options to create high quality printouts.
People that have been working with larger graphs might know that some computations on graphs (including layout computation) are quite complex and take some time. While other systems are blocked during computation and in the best case provide a progress bar, Gephi is different. Long running calculations are concurrent to the main application. From my point of view, this is one of the strongest points of Gephi, the system does not block during costly computations. The benefit for the users is that they can always interact, for instance to initiate some other computations or to cancel running ones when they recognize that a re-parameterization would yield better results.

Concurrency is Gephi’s solution to offering computations of statistics about the graph. Currently, Gephi supports a variety of classic graph statistics including degree distribution, number of connected components, and others. Based on data attributes and computed statistics, the graph can be filtered to reduce nodes and edges to those that fulfill the filter criteria. In a dynamic filtering UI, several filters can be combined using drag’n’drop and thresholds can be manipulated easily, for instance via sliders. Besides using filtering for data reduction, Gephi also provides basic support for graph clustering. However, the currently implemented MCL algorithm is still experimental. But there is the possibility to manually group nodes to build a hierarchical structure on top of the visualized graph. Yet, this is quite cumbersome for larger graphs. Additional tools are needed to support the user in creating a navigable hierarchy on top of a graph. Configurable clustering pipelines that combine several strategies for clustering (e.g., based on attributes or based on bi-connected components) in addition to a clustering wizard user interface would be helpful.

In summary, I see a much potential in Gephi, the overall shape of the system impressed me – me as a user. I personally felt it easy to work with Gephi and explore some of my own data sets and some provided at Gephi’s website. Given the fact that the version I’ve worked with is 0.7 alpha, there is also much space for improvements. In the first place I would like to mention the navigation of the graph. The main view provides just basic zoom and pan navigation, which is even imprecise in some situations. Navigation tools like those provided in Google Earth and navigation based on paths through a graph would be really helpful. Moreover, I was missing the concept of linking between views. Selecting an element (node or edge) in one view should highlight that element in all other views. Right now this is not really an issue as the number of views seen in parallel is quite low. But once additional views are needed, for instance to focus on data attributes in a Parallel Coordinates Plot or to visualize the cluster hierarchy in a dedicated view, or when one and the same graph is shown in parallel in two or more main views for comparing different analytic results, linking will be crucial for user experience. But these things are not too complex and should be easy to integrate in future versions of Gephi. Another aspect regards highlighting in the main view: instead of marking the selected node, all non-selected nodes faded out to focus on the selected node. This implies rather big visual changes because all but one nodes change their appearance when a single node gets selected and deselected.

Pros: Cons:
  • Easy graph import and export
  • Many options for visual encoding
  • Various layout algorithms to choose from
  • Support for dynamic filtering
  • Computation of graph statistics
  • Basic support for graph clustering
  • System does not block during long running computations
  • Graph navigation can be improved
  • No linking among views
  • Few visual glitches
  • Still an alpha version with bugs here and there

The Developer’s Perspective

Now let me switch to the developer’s view. Gephi is open source software so that everybody can participate in improving the system or can adapt the system to personal or business needs. Gephi seems to be very well designed on the back-end. The project is based on the Netbeans platform and the Java language. It is subdivided into a number of modules that define several APIs and SPIs and that provide implementations of these interfaces. Thanks to the modular structure, Gephi can be extended quite easily. The best way to do so is to implement plugins. Plugins can be used, for instance, to add further layout or clustering algorithms, statistical computations, filter components, or export methods. The modular structure also allows for using only specific parts of the Gephi project in one’s own projects. The Gephi Toolkit is a good example. It is not an end-user desktop application, but a class library that provides all the functionality of Gephi to those who want to reuse Gephi’s functionality and data structures in different ways.

As I’ve mentioned in the user perspective, the way how Gephi deals with long running computations is a big plus. Given the fact that aspects of multi-threading are inherent in the system from the very beginning and are manifested at the systems core, I sincerely hope – no, I’m quite sure that Gephi will not run into all the problems that are likely to occur when multithreading is integrated into an existing single-threaded system, as I have experienced it myself. Also I conjecture that others will find it much easier to implement concurrent non-blocking extensions of the system simply by following the way how existing code handles things in Gephi.

As Gephi is split up into many different modules, it took me a while to get accustomed to the system and to learn which functionality can be found in which module. But I have to add that I had no prior experience in Netbeans platform development and the module concept that is used there. I also found that the code documentation could be improved in several parts of Gephi’s sources. On the other hand, the Gephi website provides informative wiki pages with various examples and tutorials.

My view from the developer’s perspective can be summarized as the following pros and cons:

Pros: Cons:
  • Open source
  • Modular structure
  • Well defined interfaces
  • Extensible via plugins
  • Inherently multithreaded
  • In-code documentation can be improved

The Scientist’s Perspective

As a scientist I’m not so much interested in developing fully-fledge end-user software, but in developing solutions to scientific questions and in publishing the results. A difficulty in interactive visualization is that usually one needs a broad basis of fundamental functionality to be able to develop such solutions. Previous attempts of establishing a common infrastructure for interactive data exploration made notable progress, but eventually did not fully succeed or are no longer actively maintained. This is due to the fact that a single researcher usually simply does not have the time to do decent research and at the same time to maintain a larger software project.

I personally feel that Gephi can become such a fundamental infrastructure. Maintained by an active community, the system allows researchers to focus on solutions in form of plugins, while they can utilize the functionality that the system provides. Visualization researchers will be happy if they can simply plug in new visualization techniques as additional views, test new layout algorithms, and experiment with new clustering methods. Moreover, new solutions can be easily disseminated to real users in the community. This might prove beneficial when it comes to acquiring early user feedback or when more formal user evaluation is needed prior to publishing new techniques and concepts.

A big issue in visualization research is visual analytics, that is, the combination of analytical, interactive, and visual means to facilitate making sense of large volumes of data. In terms of analytic means, a goal is to break analytic black boxes and make analysis algorithms interactively steerable. With the architecture of Gephi, where parameterizable algorithms run concurrently and provide feedback in form of intermediate results, I believe this goal can be reach in the future. A thing that I’m curious about is if it is also possible to come up with concepts that allow for plugging in new interaction techniques. As interaction is usually quite tightly bound to a view, I wonder if interaction could be implemented as independent plugins as well, and if novel interaction concepts will be supported in the future (e.g., touch interaction)? Furthermore, aspects of interactive collaboration of multiple users working to solve a common analysis problem could be of interest. A question related to the visual side is whether it is possible to use Gephi with different displays and display environments such as tabletop displays, display walls, smart phones, or multi-display environments?

A facet of graph visualization that I did not mention in the user’s perspective as I felt it more suited to be mentioned here is dealing with dynamically changing graphs. Visualization of time-varying graphs is a hot research topic and Gephi is about to face this challenge. There is preliminary support for exploring time-dependent graphs via a time slider. But there is more to this that just browsing in time. Concepts have to be integrated to support easy comparison of multiple snapshots of a graph and to highlight significant changes in the development of a graphs history.

Let me try to put my thoughts into a pros and cons list:

Pros: Cons:
  • Potential infrastructure for visualization research
  • Researchers can focus on solutions in form of plugins
  • Potential to use community for user feedback and evaluation
  • Partial results for current research questions (graph clustering, steerable algorithms, dynamic graphs)
  • Nice playground for experimentation and testing new ideas
  • Unclear if new and alternative technologies will be supported

Summary

Since I’ve put hands on Gephi I’m infected. Maybe I’m dazzled by the beautiful demo video or the nice pictures that have been generated using Gephi, but in my opinion Gephi has the potential to become a big player in interactive visual graph exploration and analysis. From all perspectives that I’ve taken I see many positive things – and plenty of room for improvements or additional features. I do hope that the people behind Gephi will continue their work to the benefit of all users, developers, and researchers.

Related Stuff

There are many other systems and frameworks out there that do a great job in interactive graph visualization or in supporting it as a toolkit. I would like to give credit to these systems, because they can be the source of many ideas and much inspiration:

To go further about Gephi design, see also this article about semiotics.

GSoC 2010 mid-term: Adding support for Neo4j in Gephi

Martin Škurla

During this summer, six students are working on Gephi with the Google Summer of Code. They contribute to Gephi by developing new features that will be integrated in the 0.8 version, released later this year.

My name is Martin Škurla and this summer I was working on GSoC project called “Adding support for Neo4j in Gephi”. In this article we will look at implemented features including these under the hood, pictures of dialogs, common use cases and future plans.

 

Gephi project

At first I want to make quick introduction into Gephi project. Gephi is Open Source Visualization Platform build on top of the NetBeans platform. It is written in Java so you can run it on various Operating Systems including Windows, Linux, Mac OS. It supports many interesting graph analysis capabilities including:

  • Real-time visualization
  • Layout
  • Metrics
  • Dynamic network analysis
  • Cartography clustering and hierarchical graphs
  • Dynamic filtering

The story so far

The main idea of my project is to add support for Neo4j in Gephi. This means the ability to transform the Neo4j graph into Gephi graph. In fact, both graph models are different so the first task was to make mapping between Neo4j graph items and Gephi graph items and vice versa.

There was also a mismatch between types supported in Neo4j and these supported by Gephi. This mismatch was solved by adding new “List” types into Gephi, so now every type in Neo4j has its appropriate type in Gephi.

There were also some changes under the hood which are not visible to end user, but must be defined and implemented. The most interesting thing is adding “Delegating mechanism”. This mechanism is responsible for getting values from storing engine (Neo4j) as well as manipulation with data. In fact during the importing process, graph representation of Neo4j graph is created in Gephi, but all values are not stored directly, but they are queried using delegating mechanism.

Another minor tasks were to customize the open dialogs used for importing local Neo4j database and debugging the imported database. The open dialog for importing accepts only valid Neo4j database directories. I defined valid Neo4j database directory structure and every valid directory now includes picture of Neo4j in the open dialog. User is able to open only valid Neo4j directories in the process of importing. The open dialog for debugging accepts only Java class files that can be used for debugging process. This simply means they have to implement required interface and have public nonparam constructor. Every valid class file will have Neo4j picture and after selecting a valid debug file, Target and Visualization options will be automatically filled based on data from selected class file.

 

Open Neo4j directory dialog customization

Open Neo4j debug file dialog customization

 

Neo4j integration

Menu integration

All possible actions started in menu. As we can see, this is the entry point to import from, export to and debug the Neo4j graph. Both importing and exporting support local as well as remote Neo4j databases.

Importing

Whole graph import dialog

Importing process consist of 2 approaches:

  • whole import
  • traversal import

Whole graph import dialog is designed for importing whole graph. We can customize the rules responsible for returning nodes by defining filtering expressions. For example previous dialog can be used when we want to find all people working on project Gephi with maximum age 30 years. Only people with at least 5 years of experience and those which have driver licence types A, B and C will be included.

Let’s have a deeper look at the dialog:

  • Property key is the name of property we want to filter
  • Property value is the value which will be compared to actual Node property value using chosen operator. Values will be automatically converted into appropriate types and if the value cannot be converted, the node will not be included into graph. All types supported in Neo4j are supported in this dialog. We can also see the support for array types in the last filter expression.
  • Operator will be applied on the final expression and if the expression is evaluated to true, node will be included
  • Match case means the ability to compare String, char, String[] and char[] types with respect of the same case
  • Restrict mode is used to restrict some nodes. Imagine we have people stored in database which have only subset of required property names used in filtering expressions. If the Restrict mode is on, only nodes which have all property names and all filtering expressions evaluated to true will be included. If the Restrict mode is off, every node which has any subset of required property names (even empty subset) will be included if all the filtering expressions applicable to the subset will be evaluated to true.

All the filtering expressions are combined together using AND and the list of current supported operators consist of: ==, !=, <, <=, >, >=.

In fact, usefulness of adding new operators as well as including OR and other useful import options is the main idea behind Questionnaire which is part of this article.

Traversal graph import dialog is designed for importing any subgraph using traversal capabilities of Neo4j v 1.1. Traversal import adds additional options:

  • Start node can be set in two ways, either by its id or by its indexing key and value pair
  • Order can be set to depth or breadth first algorithms
  • Max depth can be set to concrete number or to end of graph
  • Relationships can be restricted too. We can set any combination of Relationship types and directions which should traversal include. The list of Relationship types is dynamically filled from database with existing values.

 

Traversal graph import dialog

This was the quick summary of Gephi Neo4j importing capabilities implemented in the project. We focused on more features and one of them is the support for exporting. We can export any loaded graph into local or remote Neo4j database. The exporting process can be customized in similar way as importing.

Exporting

Export dialog

Exporting means opposite process to importing. Previous dialog shows exporting options as well as validation. We can customize exporting process by setting:

  • From column is used to set the RelationshipType to appropriate values from any of Gephi edge columns. During importing Neo4j graph, column with name “Neo4j Relationship Type” is automatically created.
  • Default value is used in the case when processed Gephi edge does not have value in selected From column
  • Export Node columns is the set of Gephi columns in node table which will be exported
  • Export Edge columns is the set of Gephi columns in edge table which will be exported

Remote importing/exporting

The only difference between local and remote importing/exporting is the existence of Remote dialog, where we need to set following connection information:

  • Remote database URL
  • Login
  • Password

All of them must be filled in order to successfully import/export remote graph.

Remote import/export dialog

Delegation process

Nodes values exploration (click on the image to enlarge)

As we can see from previous picture, we can very simply explore all the node and edge values. This is exactly the place where delegating mechanism is used. All values are in fact not stored directly in memory in some kind of Gephi data structure, but the storing engine (Neo4j) is requested for actual values every time we need them.

Debugging

Debugging in action

We can see debugging in action in previous picture. The dialog is initialized with data from chosen debug class file, but we can change all of them at the runtime too. Any change in options will automatically update graph visualization. We can change visibility of nodes and edges as well as colors for both nodes and edges. User proceeds to next step of debugging/traversal by clicking on the Next button.

Use cases

That was the quick summary of all implemented features and now we can summarize common use cases every user can be interested in.

Visualizing Neo4j graphs

One of the main ideas of my project was to implement the ability to visualize Neo4j graphs, even big ones. As we saw from the dialog pictures, we have many options how to customize the importing process including filtering. After the import we can use all the rich graph analysis features Gephi provides.

Analyzing only part of the whole graphs

Quite common use case is to analyze only part of the graph, which is possible in Gephi too. We can take advantage of traversing where we can set starting node and other traversal options. After that we can visualize and analyze only part of the graph.

Export graph stored in text files/databases into Neo4j

Another use case could be exporting graphs stored in graph text files or relational databases into Neo4j. In fact, every graph loaded into Gephi can be easily exported to Neo4j database. Importing formats depends on Gephi abilities themselves, currently following formats are supported:

  • Text formats: GEXF, GDF, GML, GraphML, Pajek NET, GraphViz DOT, CSV, UCINET DL, Tulip TPL, XGMML
  • Relational databases: MySQL, PostgreSQL, SQL Server

Future plans

There are more things which we want to implement, including:

  • support for Gephi Toolkit, which is in general set of Gephi core libraries which you can use in your own Java projects for graph visualization and manipulation
  • implementing proof of concept Web application using both Gephi Toolkit & Neo4j to manipulate with Neo4j database & show results (probably using GWT)
  • more features, bug fixing, performance optimizations

Questionnaire

One of the big advantages of Gephi is the fact that it is developed as Open Source project. We want to add additional features according to user requests and their opinions. That’s why we created questionnaire focusing on usefulness of proposed additions. We will be very happy if you fill the questionnaire because it is very valuable source of information and we can focus on features Neo4j users think useful. Please fill in the questionnaire.

Conclusion

I am very happy that I can be part of the Gephi developer community and introduce integration with Neo4j. During this summer I learned a lot and I am proud that I was chosen as GSoC student. The fact is that none of these features can be done without great help of my mentors, so big thank to both of them: Mathieu Bastian & Tobias Ivarsson.

If you are interested in and want to test the code, you can download source codes from my branch using bzr branch lp:~bujacik/gephi/support-for-neo4j

All the pictures were made on data stored in testing Neo4j database which can be created using Java SE project and you can download it using:
bzr branch lp:~bujacik/+junk/testing-new-neo4j-traversal-api

 

Martin Škurla

Download this article in PDF.

GSoC 2010 mid-term: Dynamic attributes and statistics

Cezary Bartosiak

During this summer, six students are working on Gephi with the Google Summer of Code. They contribute to Gephi by developing new features that will be integrated in the 0.8 version, released later this year.

 

The project which is done by Cezary Bartosiak focuses special attention on further development of dynamic network analysis (DNA) in Gephi. The aim is to create a framework which would make it possible to build and query a dynamic graph with use of proper API. It has got a practical purpose, for instance analyzing evolution of networks (see in particular M. Argollo de Menezes, A.-L. Barabási Fluctuations in Network Dynamics) or dynamic networks visualization. The article shows the most important features provided by this GSoC project.

 

In the current 0.7 version we can import dynamic graphs written in GEXF syntax and then filter them using Timeline component. Unfortunately, it only filters graphs topologies and that means hiding nodes and/or edges.

The obvious step is make it possible to handle dynamic changes not only of graph topology but also attributes connected with nodes and edges. It can be done by creating a proper API. This API could be used by other modules, like Statistics to make dynamic versions of them. Computing metrics like Degree Distribution or Clustering Coefficient for each time interval in the time series has got a great interest to analyze graphs within time.

So, getting down to brass tacks, the most important tasks are:

  • A data structure to host dynamic attributes efficiently which would make it possible to present them in Data Laboratory module.
  • A Dynamic API which has got the following features: the Dynamic Graph Decorator, that wraps the graph and a time interval, returns static graphs copies for given time intervals, attributes values arrays for given nodes/edges and time intervals.
  • Adapting Metrics framework to use Dynamic API to propose dynamic versions of existing metrics.

There are also additional features, which will be done in the future (probably they will not be included in the nearest release):

  • Dynamic visualization of attributes.
  • Dynamic version of the Ranking module – dynamic visualization attributes transformation.

I’ll try to shortly describe how these features are done.

Dynamic attributes

It is a very interesting task from a programmer’s point of view since it requires implementing a complicated data structure like Interval Tree (see also Antoine Vigneron – Segment trees and interval trees). But also users will judge it necessary. The purpose is to make it possible to read dynamic attributes from GEXF files and host them efficiently. Thanks to that we are able to get values of attributes of different time intervals. It goes without saying how powerful feature it is. To show how it is working, let’s consider one node (written in GEXF syntax):

<node id="1" label="Some node">
<attvalues>
<attvalue for="0" value="abcdefgh"/>
<attvalue for="2" value="1" end="2009-03-01"/>
<attvalue for="2" value="2" start="2009-03-01" end="2009-03-10"/>
<attvalue for="2" value="1" start="2009-03-10"/>
</attvalues>
</node>

As we can see we have got one dynamic attribute (id = 2) which has three different values in different time intervals. The first interval starts in the “negative infinity”. We simply assume that it only ends, never starts. But if we have got some bounds, for instance, a related graph has its start and end times, this attribute would “start” in the same moment as the graph. It is rather intuitive. The second interval exists from 2009-03-01 to 2009-03-10 and the last one exists from 2009-03-10 to “positive infinity” or graph’s bound.

After importing this to Gephi we can simply get values of ANY time interval we want, for example [-inf, +inf]. But we should know how to estimate a final value. In the above example we have got three values: 1, 2 and 1. To solve the problem which of them should be returned, we provide a set of estimators like AVERAGE, MEDIAN, MODE, SUM, MIN, MAX, FIRST and LAST. Each of them has got different behavior that depends on a type of attribute, i.e. for real numbers they behave like in statistics.

So, users will be able to get values of different time intervals on demand, for instance in Data Laboratory module or (in the future) see them on the screen as a part of a rendered graph. For instance we have got some attribute like priority. A potential user will be able to choose between several possibilities like: nothing (it means this attribute should not be visualized), color, stroke, thickness etc. It means, for instance, that if some node has got this attribute close to its upper bound its stroke thickness would be very high. And, on the other hand, if one node has got this attribute close to its lower bound only its internal color could be visualized.

Metrics framework

For now it is possible to count a set of important metrics but all of them take a “static graph” into consideration. The idea of dynamic metrics is then to execute the static ones in a loop, where the graph changes according to time interval. The following screen shows that use of these additional metrics is similar to their static brothers:

Dynamic Metric (click on the image)

In the screen we can see only Dynamic Degree Power Law, but of course every dynamic metric will be implemented (during writing this article this module was still under development – it also means that the final product could differ from this one presented above). So, user inserts important information like time interval etc. and gets a separate report for every time interval. What are the other results?
The result for each node/edge is written in the graph, so one can see this in Data Laboratory.
General result is also written and presented in the report.

Conclusion

Evolution of networks, network dynamics and dynamic network analysis are hot topics nowadays. There is growing interest in studying these issues. It causes that there is bigger and bigger need of DNA analysis tools. In my opinion Gephi is heading towards being one of the best…

Cezary Bartosiak

GSoC 2010 mid-term: Graph Streaming API

andre-panisson

During this summer, six students are working on Gephi with the Google Summer of Code. They contribute to Gephi by developing new features that will be integrated in the 0.8 version, released later this year.

The purpose of the Graph Streaming API project, run by André Panisson, is to build a unified framework for streaming graph objects. Gephi’s data structure and visualization engine has been built with the idea that a graph is not static and might change continuously. By connecting Gephi with external data-sources, we leverage its power to visualize and monitor complex systems or enterprise data in real-time. Moreover, the idea of streaming graph data goes beyond Gephi, and a unified and standardized API could bring interoperability with other available tools for graph and network analysis, as they could start to interoperate with other tools in a distributed and cooperative fashion.

 

With the increasing level of connectivity and cooperation between systems, for a system that aim to be interoperable, it is imperative to comply with the available standards. Graph objects are abstractions that can represent a wide range of real-world structures, from computer networks to human interactions, and there are a lot of standards to exchange graph data in different formats, from text-based formats to xml-based formats. But the real-world structures are constantly changing, and the current formats are not suitable to exchange such type of dynamic data.

A lot of well-established systems already stream data to its users using a streaming API. Twitter for example defined a Streaming API to allow near-realtime access to its data. They are using two different formats: XML and JSON, but JSON is strongly encouraged over XML, as JSON is more compact and parsing is greatly simplified.

We are not the first to implement a Graph Streaming API, and another very interesting experience is the GraphStream Java Library. It is composed of an API that gives a way to add edges and nodes in a graph and make them evolve. The graphs are composed of nodes and edges that can appear, disappear or be modified, and these operations are called events. The sequence of operations that occur in a graph is seen as a stream of events.

So, as other people already had successful experiences with graph streaming, why not start our work based on these experiences? That’s what we are doing, and beyond finding these experiences very useful, we are also trying to be compatible with the available work. The first Gephi Graph Streaming release will use two formats: JSON for flexibility, and a text-based format, based in the GraphStream implementation.

The first version of the Graph Streaming features will be available in the next release of Gephi, but it’s already possible to taste some of these features. To illustrate how simple it will be to connect to a master, the following video shows Gephi connecting to a master and visualizing the received graph data in real time. The graph in this demo is a part of the Amazon.com library, where the nodes represent books and the edges represent their similarities. For each book, a node is added, the similar books are explored, adding the similar ones as nodes and the similarity as an edge.

 

 

The Graph Streaming specification goes beyond the simple fact that a client can pull data from a master: in fact, clients can interact with the master pushing data to it, in a REST architecture. The same data format used by the master to send graph events to the clients is used by clients to interact with the master.

In the next example, we will transform Gephi in a master to provide graph information to its clients. At the Streaming Tab in the Gephi application, you can access all the features of graph streaming. You can connect to a Master by clicking the ‘+’ button, but you can also transform your Gephi in a master by right-clicking the “Master Server” and selecting “Start” (You are not limited to a single master by host: each Gephi workspace can be available as a master). By default, the HTTP server will listen at port 8080 in plain HTTP, and at port 8443 using SSL. The server path depends on your workspace: each workspace uses a different path. You can configure these parameters (and also Basic Authentication) at the “Settings…” button:

 

Graph Steaming Server start

Graph Steaming Settings Panel

 

Now, you can connect to it using some simple HTTP client. For example, you could use curl to see the data flowing. First of all, open a shell window and execute the following command:

curl "http://localhost:8080/workspace0"

With this, you are connecting to your workspace at Gephi. If the workspace is empty, you will receive no data, but you will remain connected, so you will receive all events from now.

Now open another shell prompt, and with the following commands, you could see a triangle appearing at Gephi:

curl "http://localhost:8080/workspace0?operation=updateGraph" -d $'
{"an":{"A":{"size":10,"r":1,"g":0,"b":0,"z":0,"y":500,"x":70}}}r
{"an":{"B":{"size":10,"r":1,"g":0,"b":0,"z":0,"y":90,"x":250}}}r
{"ae":{"AB":{"source":"A","target":"B","weight":10,"r":0,"g":0,"b":0,"directed":false}}}r
{"an":{"C":{"size":10,"r":1,"g":0,"b":0,"z":0,"y":90,"x":-90}}}r
{"ae":{"BC":{"source":"B","target":"C","weight":10,"r":0,"g":0,"b":0,"directed":false}}}r
{"ae":{"CA":{"source":"C","target":"A","weight":10,"r":0,"g":0,"b":0,"directed":false}}}'

At the same time, all events will be sent to your connected client, in the other shell window.

With the following commands you can retrieve some of the data:

curl "http://localhost:8080/workspace0?operation=getNode&id=A"
curl "http://localhost:8080/workspace0?operation=getEdge&id=AB"

And you could start manipulating your graph through command line, as you like. There are other event types for changing and removing edges and nodes, for more information about them see the current status of the JSON Streaming Format, available at this page. We recall that this format is subject to changes, as the API was build to be very flexible and more requirements are being added to it.

But what about connecting two different Gephi instances together? One instance will be master, and the other client. Using the Graph Streaming API, a change in a graph at the master’s workspace should cause a change in the client’s workspace, and a change at the client’s workspace will cause it to send requests to the master to update its graph accordingly. Both instances working in a distributed mode. In fact, different people could work in a distributed mode to construct a graph: it’s the Collaborative Graph Construction.

My personal impressions about it

For me as a researcher, Gephi has the potential to become a de-facto standard for manipulating and visualizing large scale graphs. I believe that the research community is still lacking a high-quality, general-purpose, community-supported framework for exploratory analysis of large-scale dynamical graph data, and I believe that Gephi has the potential to fill this gap. I’m working also in collaboration with ISI Foundation at the SocioPatterns project, an example of research use case that currently uses Gephi for exploratory data analysis and visualization. The support for dynamic networks, the readiness of the Gephi data model for dynamical update of graph topology and attributes and, in a near future, the support for graph streaming are exciting features that suit very well the large-scale real-time data sources we are dealing with. The potential for processing live streams from our experiments is a unique feature that we are eager to see working.

André Panisson

GSoC 2009 student interview: Jérémy Subtil

Today I am pleased to interview Jeremy Subtil, Gephi student at Google Summer of Code 2009.

Jeremy is a French postgraduate student in Computer Science at Compiègne University of Technology. Fond of FLOSS, he took part to the 2009 Google Summer of Code on a Gephi project.

 

Sebastien Heymann: Hi Jérémy Subtil, you took part in Gephi with the Google Summer of Code 2009 (GSoC), by handling the vectorial preview module and implementing the SVG export. Could you explain why you chose this project, in particular why getting involved in Gephi although there are such great other organizations like Debian, WordPress or Mozilla?

Jérémy Subtil: I heard about Gephi in one of the courses I chose in February last year. From web crawling inputs, we visualized links between websites and we identified the emerging clusters, in order to approach a part of the world wide web’s shape. I discovered that Gephi was driven by a small but very active team, so I though it was a very nice opportunity to integrate a FLOSS community. In addition, I was very interested in doing some graphic work with the vectorial preview, as well as I wanted to know more about the SVG format.
Continue reading →