Gephi boosts its performance with new “GraphStore” core

Gephi is a graph visualization and analysis platform – the entire tool revolves around the graph the user is manipulating. All modules (e.g. filter, ranking, layout etc.) touch the graph in some way or another and everything happens in real-time, reflected in the visualization. It’s therefore extremely important to rely on a robust and fast underlying graph structure. As explained in this article we decided in 2013 to rewrite the graph structure and started the GraphStore project. Today, this project is mostly complete and it’s time to look at some of the benefits GraphStore is bringing into Gephi (which its 0.9 release is approaching).

Performance is critical when analyzing graphs. A lot can be done to optimize how graphs are represented and accessed in the code but it remains a hard problem. The first versions of Gephi didn’t always shine in that area as the graphs were using a lot of memory and some operations such as filter were slow on large networks. A lot was learnt though and when the time came to start from scratch we knew what would move the needle. Compared to the previous implementation, GraphStore uses simpler data structures (e.g. more arrays, less maps) and cache-friendly collections to make common graph operations faster. Along the way, we relied on many micro-benchmarks to understand what was expensive and what was not. As often with Java, this can lead to surprises but it’s a necessary process to build a world-class graph library.

Benchmark

We wanted to compare Gephi 0.8.2 and Gephi 0.9 (development version) so we’ve  created a benchmark to test the most common graph operations. Here is what we found. The table below represents the relative improvement between the two versions. For instance, “2X” means that the operation is twice faster to complete. A benchmarking utility was used to guarantee the measurements precision and each scenario was performed at least 20 times, and up to 600 times in some cases. We used two different classic graphs, one small (1.5K nodes, 19K edges) and one medium (83K nodes, 68K edges) . Larger graphs may be evaluated in a future blog article.

Benchmark / Graph SMALL (n=1490, e=19025) MEDIUM (n=82670, e=67851)
Node Iteration 23.0x 34.6x
Edge Iteration 40.1x 109.4x
Node Lookup 1.6x 2.1x
Edge Lookup 1.2x 2.3x
Get Edge 1.1x 1.2x
Get Degree 2.5x 2.3x
Get Neighbors 3.4x 1.2x
Set Attributes 2.3x 0.1x
Get Attributes 3.3x 4.0x
Add Nodes 6.2x 5.7x
Add & Remove Nodes 1.4x 2.9x
Add Edges 7.7x 3.8x
Add & Remove Edges 3.3x 1.8x
Create View 2851.0x 4762.3x
Iterate Nodes In View 2.7x 1.5x
Iterate Edges In View 11.6x 7.3x
Save Project 2.4x 1.7x
Load Project 0.6x 0.6x
Project File Size 1.9x 1.5x

These benchmarks show pretty remarkable improvements in common operations, especially read ones such as node or edge iteration. For instance, in average it takes 40 to 100 times less CPU to read all the edges in the graph. Although this benchmark focus on low-level graph operations it will bring material improvements to user-level features such as filter or layout. The way GraphStore creates views is different from what we were doing before, and doesn’t require a deep graph copy anymore – explaining the large difference. Finally, only the set attribute is significantly slower but that can be explained by the introduction of inverted indices, which are updated when attributes are set.

And what about memory usage? Saving memory has been one of our obsession and there’s good news to report on that front as well. Below is a quick comparaison between Gephi 0.8.2 and Gephi 0.9 for the same medium graph above.

Benchmark Gephi 0.8.2 Gephi 0.9 Improvement
Simple graph 115MB 52MB 2.2X
Graph with 5 attribute columns 186MB 55MB 3.4X

This benchmark shows a clear reduction of memory usage in Gephi’s next version. How much? It’s hard to say as it really depends on the graph but the denser (i.e. more edges) and the more attributes, the more memory saved as significant improvements have been made in these areas. Dynamic graphs (i.e. graphs that have their topology or attributes change over time) will also see a big boost as we’ve redesigned this part from scratch.

What’s next?

All of the GraphStore project benefits are included in the upcoming 0.9 release and that’s the most important. However, the work doesn’t end and there’s many more features and performance optimization that can be added.

Then, we count on the community’s help to start collaborating with us on the GraphStore library – calling all database and performance experts. GraphStore will continue to live as an all-purpose Java graph library, released under the Apache 2.0 license and independent from Gephi (i.e. Gephi uses GraphStore but not the opposite). We hope to see it used in other projects in the near future.

graphstore-api

GraphStore API, represented as a graph

Improving the Gephi User Experience

This is an effort to rethink the design of Gephi authored by Donato Ricci, co-founder of Density Design and senior designer at the Sciences Po médialab in Paris, and me, creator of Gephi and an engineer at the same lab (note that I am not Mathieu Bastian, our lead developer and actual powerhouse of the project for the past 10 years).

While this text presents possible improvements and practical solutions, it does not address practical considerations of available labor. Also, be aware that this is not a formal roadmap for future releases but rather a way to open the current state of our reflections for brainstorming. So feel free to share your ideas and comments with us.

There are five main categories that structure the improvements that we currently envision:

  • Design strategy: Ensure that a coherent design philosophy is applied across the entirety of the project
  • User interface: Identify and correct user-facing errors
  • Network-focus: Re-focus design and architecture around the network’s position of primary importance
  • Filling in the Gaps: Providing expected functionality
  • Miscellany: Other minor issues

In addition, we have drafted a UI mockup illustrating some of our propositions.

Design strategy

Gephi was built by engineers without a comprehensive design strategy. This situation is fairly common: engineers approach design in an ad-hoc fashion learning by trial and errors and through casual user observation but without a formal user-testing protocol. Should the tool succeed, it is mostly because the utility triumphs over the pain of use. Gephi is an embodiment of this phenomenon in its current state. Some computer scientists may find it simple, partly because using terrible interfaces is a part of their job, but for many users Gephi is confusing. Geeks of a masochistic tendency may love the tool as a result of digital Stockholm Syndrome, but the bulk of users that could benefit from Gephi find it to be confusing and opaque. In our defense, developing desktop applications is heavily constrained and the Java technology was not helping us to overcome this difficulty. What could a designer do to alleviate this situation? Apply a strategy.

A designer does more than treating the symptoms of poor usability; he or she approaches user experience from its fundamentals. Improving Gephi requires rethinking some of its longest standing features from a new standpoint. A design strategy is the solid foundation upon which we build both a satisfying user experience and underlying software architecture.

Our design strategy fits in five basic points: obtaining substantial and organized user feedback, giving Gephi a clear workflow, implementing a facet-oriented interface layout, reordering panels from the user standpoint, and removing unnecessary features. Each point is explained in further detail along with practical guidelines for implementing potential solutions in the future.

User feedback

We cannot build a sustainable user interface without a quantitative measure of user activity. These data are necessary to support and validate design choices. One approach to obtain this information is to log and collect feedback about interface usage.

An optional logger could be implemented in Gephi to allow users to opt-in to the collection of logs in order to improve the software. Data harvesting can be done as a campaign: for example, we may ask some users to activate it for one month to evaluate the usage of a new interface paradigm that we are testing.

A clear workflow

Users need a clear and visible path to start with Gephi, in particular when opening a new file. We need to remove information to allow users focusing on what is important.

Gephi involves not only the software itself, but also the installer, website and documentation. Our ultimate goal is to make the entry process as simple as possible by coordinating these different elements. We begin by focusing here on just the software itself. We propose to consider that there are only two proper ways to enter in Gephi:

  • Opening a file (constructing a network from a pre-generated file)
  • Connecting to a data source (embedded scraper or API connection)

We also need to clarify the roles of the “open” and “import” functions. We have to clarify that:

  • If the user has a file and needs to see it in Gephi, then “Open” is the right answer
  • If the user has an external data source he or she wishes to connect to, there is distinct menu option for this function

Use case: we have observed that some users try to get in Gephi with a table of nodes and a table of links, and do not succeed in finding the right path. The problem there is that it is not explicit that it is necessary to create an empty document, go to the data table, and then import the tables. Since the pattern is “I have files and I want to see them in Gephi”, then the answer should be under the “Open” menu item.

Facet-oriented interface layout

Rethinking overall design has the virtue of allowing for the reorganization of the interface from a user-centric perspective. The current interface relies on the panels system provided by the Netbeans Platform, which provides some beneficial properties for design. We were inspired by Ben Shneiderman’s motto of “overview first, zoom and filter, then details-on-demand” and it has been quite successful. However the different views are not articulated in a coherent way and the features sometimes struggle to find the right place in terms of visibility.

We propose three simple guidelines for a better organization of the panels:

  • The global hierarchy of containers should reflect the generality of the features
  • Some panels are not facet-dependent: they should not change with the facet
  • The network should occupy a single place whatever its facet, since it is always the same object

Panels guidelines

These guidelines have two consequences. First: facet-dependent panels should be contained inside facet-independent panels; which is to say that there is a single container for all facet-dependent features. Second: the facet selector (currently the three tabs on top) has to be inside this container.

We illustrate this with a comparison between a representation of the current layout and a new simplified structure.

Web

Reordering panels from the user standpoint

A part of our design strategy is to reduce visual clutter by grouping panels that are not used simultaneously. Though it is not intuitive, we prioritize separating panels that work well together over grouping similar features. For instance the following panels should be placed in different groups in order to be used combined:

  • Filters + Layout
  • Filters + Partition or Ranking
  • Timeline + almost anything else

With the current panels there are at least three obvious groups: one with filters, another with layout, and the third with the timeline. Generic contextual information is a fourth possible group, but could be placed in a non-intrusive location like the footer. Some panels, like statistics, could theoretically be at home in any group or even as a separate window that could be invoked from a menu.

Collapsing panels concept

Panels and groups of panels create two levels, so the “window” menu should have two levels too.

Removing unnecessary features

Reducing complexity can also be accomplished by removing features. We have detected at least one clear candidate for removal, but we may find more unnecessary complications to remove.

The “preview” panel of Gephi has been increasingly simplified over iterative updates. The goal of this feature is to provide a quick way to export cartographies. Users with competence in design tend to rely on third-party tools that provide finer-grained control over the visualization, like Illustrator and Scriptographer. Thus, the focus in Gephi is to provide a quick way to export images that can be manipulated in other tools.

We propose to further streamline preview functionality by removing some advanced label features: they infrequently used, complicated, and at times internally inconsistent with other preview settings. Furthermore, it is not necessary to facilitate changes to features like label and node size and color when such adjustments can be made much more easily using other tools.

User Interface

Donato Ricci has identified various flaws in the Gephi UI. Fixing them is a priority for the future.

User-centric features: reordering workflow

Users think in terms of results they want to obtain. They have an action in mind and they search how to do it. By displaying features according to their result, we can both improve user orientation and reduce the tool’s learning curve. A few examples of follow.

We propose to aggregate Partition and Ranking under the more accessible term “Appearance”, and to reverse the order of what is asked to the user. The current interface is organized in the following way: if the user has metadata that can rank the nodes then the user can visualize it using different attributes like color and size. The new interface inverts this approach: if the user wants to color nodes then he or she chooses which metadata to use. The panel may progress like a wizard to reduce cognitive load by drawing attention only to information that is necessary for a given step.

Unified panel appearance

Collapsing advanced layout settings

The current design of Gephi does not respect the general principle of drawing attention to information that is commonly used while obscuring information that is infrequently used.

Tools of the Overview: many problems to fix

The small tool buttons on the left side of the overview panel have a number of problems including:

  • Confusing icons that do not easily communicate the use of tools
  • Indistinct icons that do not sufficiently distinguish between different tools
  • Missing tools that are commonly expected

These issues are compounded by evidence that the tools themselves do not provide sufficient utility for common use.

We propose to alleviate some of these shortcomings by putting most of these tools in a collapsed panel and to have a normal panel dedicated to the settings of each tool. We also propose to implement a default tool cursor that draws on common mouse usage paradigms to provide intuitive functionality to users:

  • A click-drag starting on the background (or an edge) of the view makes a rectangle selection
  • A click-drag starting on a node moves this node
  • A click on a node selects the node, the shift key is used for multiple selection
  • A click on the background deselects
  • A set of meta-keys changes the click function, for instance the spacebar to switch to the view-panning tool (i.e. the hand in Photoshop)
  • The secondary-click works the same

Fixing highlight colors

Highlighting works by tweaking colors so that some nodes get more contrast than others. The contrast should depend on the background color, but this is not current implementation. At the very leastthe following should be done:

  • White background: highlighted nodes darker and other nodes lighter
  • Black background: highlighted nodes lighter and other nodes darker

Network-focus

As a network analysis tool, the network itself plays an obvious central role in Gephi. We have explored different ways to incorporate the network into the software’s presentation and have developed some suggestions for modifications that would increase interface coherency.

A different layout for the panels: network as background

In this approach, the network is contained by a “background sheet” and floating panels support functions. Such a philosophy has been successfully implemented in other systems like Photoshop or Google Maps. Using the root panel for the network, like grouping facet-dependent panels, fosters the feeling that we always deal with the same object, the network. This operates on the metaphor that we are always manipulating a primary canvas that consists of the network to be analyzed.

Network as background

Statistics as an invoked panel or window

Statistics tend to be used on demand, and thus do not need to be displayed permanently. Rather, a discrete menu or button could invoke the statistics panel when needed. Removing this visual information leaves more room to focus on what is important, i.e. the network.

Workspaces: more visible, on top

The workspaces need more attention. We propose to show them as tabs on the top of Gephi. It is more natural to have the workspaces above the facet selector in the hierarchy of panels. This is consistent with the prevalence of the “tab” paradigm in the browser space.

Workspaces-on-top

Filling in the Gaps: Providing Expected Functionality

In addition to the different aspects listed above, users need some well-known common features such as an “undo” function, even if they are complex to implement.

History and undo: feasible if limited to network structure

A visible trace of previous steps, like a proverbial breadcrumb trail, provides users with a sense of orientation and confidence when exploring and manipulating data in a speculative fashion. This also accelerates the learning process by alleviating cognitive load by not forcing users to have to remember a series of unfamiliar steps. This works in tandem with an “undo” feature, which facilitates experimentation without fear of permanently corrupting data.

History and Undo are complex to implement and burden the development of plugins and modules as these functions tend to be deeply embedded in a piece of software’s architecture. This partly explains why they are not currently available in Gephi. However a prudent approach in Gephi would be to focus recording and reversal of changes to the structure of the network: Nodes and edges, their attributes (including color, size and coordinates), but not the state of panels such as filters, statistics…

An initial approach would be to cover only a minimal set of modifications of the network structure. The history would then contain information about the type of the modification, but not its exact content nor the way it was done (manually, filter, data table…). For instance:

  • Modifying attribute X for node N / n nodes / all nodes
  • Modifying color / size / position for node N / n nodes / all nodes
  • Adding / removing: node N / n nodes / all nodes
  • Adding / removing: node attribute X / n attributes / all attributes
  • …and the same for links

The history would not include operations such as exporting files, taking screenshots, modification of views, changes to settings, or other changes that did not directly affect the structure of the network

Protecting irreversible operations

Some operations are irreversible: removing nodes, edges and attributes (and possibly more). Because these operations are definitive and may cause the loss of a certain quantity of work, they should be protected. A classical solution is to ask confirmation for any definitive operation. This is a simple guideline but the result is quite user hostile. We propose a better solution, as implemented in Photoshop: when an irreversible operation has been done, when the user tries to save the network the “Save as…” window appears instead and proposes a different name (with a suffix number or “Copy of X”).

Miscellany

A few additional points deserve to be listed, and are done so in no particular order.

Manual versioning

A basic versioning feature would be appreciated: just the opportunity to save with incrementing / adding a number suffix:

  • “My Network.gexf” is saved as “My Network 01.gexf”
  • “My Network 11.gexf” is saved as “My Network 12.gexf”

A common shortcut for this is Ctrl+Alt+S.

Generalizing zoom options (more internal consistency)

We can currently set how the zoom impacts text labels. The same feature would be useful for edges, for instance to keep 1 pixel lines whatever the zoom, as well as for nodes, for instance to keep small points whatever the zoom.

Generalized zoom options

Size nodes according to area

Our eyes perceive areas. Setting a ranking to the diameter of nodes is less intuitive than to apply it to the area of nodes. We propose to offer an option to customize ranking by either diameter or area, but set the default to area.

Removing unnecessary settings about labels

Node labels and edge labels should help the user identifying nodes. However, using the color or size of labels to visualize attributes is confusing. Gephi presently contains settings to manipulate labels in this way, these settings should be removed and replaced with a simpler interface.

UI Mockup sample (work in progress)

We present here a possible approach to integrate some of the different suggestions made in this document. Consider the following image as a way to help imagine the future of the Gephi user experience.

MockupA-01

As we stated earlier, the purpose of this document is to open up the floor to brainstorming ideas about improving the Gephi UX. Please share your ideas in the comments!

PS: Thanks to Niranjan Sivakumar for his excellent proof-reading 🙂

Rebuilding Gephi’s core for the 0.9 version

This is the first article about the future Gephi 0.9 version. Our objective is to prepare the ground for a future 1.0 release and focus on solving some of the most difficult problems. It all starts with the core of Gephi and we’re giving today a preview of the upcoming changes in that area. In fact, we’re rewriting the core modules from scratch to improve performance, stability and add new features. The core modules represent and store the graph and attributes in memory so it’s available to the rest of the application. Rewriting Gephi’s core is like replacing the engine of a truck and involves adapting a lot of interconnected pieces. Gephi’s current graph structure engine was designed in 2009 and didn’t change much in multiple releases. Although it’s working, it doesn’t have the level of quality we want for Gephi 1.0 and needs to be overhauled. The aim is to complete the new implementation and integrate it in the 0.9 version.

In November 2012, we started to develop a completely new in-memory graph structure implementation for Gephi based on what we’ve learnt over the years and our desire to design a solution that will last. The project code-name is GraphStore and we focus on four main things:

  • Performance: The graph structure is so important to the rest of the application that is has to be fast and memory efficient.
  • Stability: The new code will be the most heavily unit-tested in the history of Gephi.
  • Simplicity: The Graph API should be documented and easy to use for developers.
  • Openness: If possible, we want GraphStore to be used in other projects and keep the code free of Gephi-specific concepts.

Gephi is known to use a large amount of memory, especially for very large networks. We want to challenge ourselves and tackle this issue by redesigning the way graphs are encoded and stored. Besides memory usage, we carefully analyzed possible solutions to improve read/write performance and optimize the throughput. Stability and simplicity are like food and shelter, and whatever we try to do at Gephi should be simple to use and stable. As we’re going towards a 1.0 version, we’re putting more and more efforts to testing and code quality.

Since November 2012, we have been working on GraphStore separately from Gephi’s codebase and will start the integration fairly soon. The Graph API is very similar to the existing API. However, it isn’t entirely compatible and several core things changed like attributes, views or dynamic networks and will require a lot of work in some modules. On the other hand, because the GraphStore code is decoupled, it could be leveraged in other projects. For instance, it could serve as a Blueprints implementation as an alternative to TinkerGraph.

Graph structure

A graph (also called network) is a pair of a set of nodes and a set of edges. Edges can be undirected, or directed if the direction of the relation matters. Edges may also have weights to represent a value attached to the edges, like the strength of a connection or the flow capacity. Edges may also point to the same node (i.e. self-loops). Gephi currently supports these features, but they are not sufficient to describe the variety of problems graphs can be helpful with. Multigraphs permit several relationships between nodes and is for instance commonly used to represent RDF graphs. Multigraphs with properties (i.e. ability to attach any property to nodes and edges) have recently become the standard representation for graph databases.

The next version of Gephi will support multigraphs and therefore allow multiple edges between nodes to be imported. The rollout will be done in two phases. The first phase is to allow this new type of graph to be imported, filtered and exported. We will update the importers and add new options to support these graphs. The second phase is to update the visualization and the way multiple edges between nodes look like.

Hierarchical graphs

Since the 0.7 version released in 2009, Gephi has supported hierarchical graphs. Hierarchical graphs let the user group or ungroup nodes so it forms a tree. Nodes which contain other nodes are named meta-nodes and edges are collapsed into meta-edges. Groups obtained from clustering algorithm (e.g. modularity) could also easily be collapsed into meta-nodes in order to study the network at a higher level. We initially recognized the potential of this idea for network analysis and developed a hierarchy-enabled data structure. However, we realized we didn’t completely fulfill the vision by not providing all the tools to fully explore and manage hierarchical networks. Although the data structure allows it, the software still lacks many features to really make hierarchical networks explorable.

Recently, we are more focused on networks over time and plan to continue to do so. In the past years, users have shown steady and continuous interest in dynamic networks and we haven’t really seen a strong interest in hierarchical networks. Therefore, we propose to remove this feature from next releases. On the developer side, cutting this feature will greatly simplify the code and improve performance.

Dynamic networks

Networks that change over time are some of the most interesting to visualize and analyze. We have heavily invested in supporting this type of network, for instance by developing the Timeline component. However, dynamic graph support was added after the current graph structure implementation was conceived and therefore remains suboptimal and difficult to scale. Now that we have enough hindsight, we can rethink how this should be done and make it simpler.

One pain point is the way we decided to represent the time. Essentially, there are two ways to represent time for a particular node in a graph: timestamps or intervals. Timestamps are a list of points where the particular nodes exist and intervals have a beginning and an end. For multiple reasons, we thought intervals would be easier to manipulate and more efficient than a (possibly very large) set of timestamps. By talking to our users, we found that intervals are rarely used in real-world data. On the code side, we also found that it makes things much more complex and not that efficient at the end.

In future versions, we’ll remove support for intervals and add timestamps instead. We considered supporting both intervals and timestamps but decided that it would add too much complexity and confusion.

Graph structure internals

Graph structures design is an interesting problem to solve. The objective is quite simple, yet challenging: how to best represent an interconnected graph so it’s fast to query and compact in space? Also, how to keep it simple and serve a large number of features at the same time?

Graph storage

Our goal is to develop a thread-safe, in-memory graph structure implementation in Java suitable for real-time analysis. You may ask how this differs from a graph database or a distributed graph analysis package. In a few words, one can say the requirements are quite different.

Graph databases like Neo4j, OrientDB or Titan store the graph on local disk or in a cluster and are optimized for large graphs and large number of concurrent users. Typically, the networks are much larger than what can fit in memory and these databases mostly focus on answering traversal queries. In the environment where graph databases operate most of the needs can be converted in some sort of traversal query (e.g. friends of X, tweets of Y). Traversal queries are also the reason why graph databases scale to billions of nodes. Indeed, for each traversal, only a subset of the graph is accessed. This is quite different from Gephi, which by its nature of being an analysis software needs to access the complete graph. For instance, when a layout is running Gephi needs to read the X,Y position of each node as quickly as possible. Although reading from the disk can be very quick as well (e.g. GraphChi), it’s limited to sequential access and things become more complex that way.

Because of the real-time requirements, we want to keep our graph data in memory accessible at all time. However, we want to make it easy to connect to external data sources, and graph databases in particular.

Reducing overhead

In computer science, overhead is any combination of excess or indirect computation time, memory, bandwidth, or other resources that are required to attain a particular goal.

GraphStore heavily relies on Java primitives, arrays and efficient collections library like fastutil. We are reducing overhead by simply avoiding using too many Java objects, which are very costly. Instead of using maps, trees or lists, Nodes and Edges are stored in large arrays which can be dynamically resized in blocks. For instance, iterating over the graph should be extremely fast because the CPU caches array blocks. This may sounds obvious but performance optimizations are tricky in Java because of the JVM and the uncertainty of what makes a difference and what doesn’t. In his “Effective Java” book, Joshua Bloch writes “Don’t guess, measure” and that’s still true today. For our project, we rely on well-defined micro-benchmarks to see where the bottlenecks are and how to make our data-structure more cache-friendly and more compact in memory. When the graph contains millions of edges, every byte saved per edge can make a large difference at the end.

In terms of speed, we focused on optimizing the most common operations, which are iterating over all the elements and consult nodes’ neighbors. Typically, a layout algorithm needs to read the neighbors of every node at each iteration. Neighbors can’t simply be an unsorted list because of the removal complexity: to remove a node, you need to know where it is. The current Gephi graph structure uses a binary tree to store the node’s neighbors. Although the complexity is logarithmic, every node in the tree takes extra memory and logarithmic complexity is still suboptimal. After isolating the problem in a benchmark, we found that using a double linked-list is the best solution for our requirements and achieves a O(1) complexity, as it fulfills both a quick iteration and quick update. Here is a snapshot of the solution:

Every edge has 4 integer pointers to the next in/out predecessor and successors and a separate dictionary would help to find the right edge based on the source and destination pointers. Each node has a pointer to the first edge in the linked list (i.e the head). Node ids are integers (32 bits) so one can easily create a long->Edge dictionary by encoding the source and destination node into a single long number (64 bits). The diagram intentionally leaves out the multigraph support for simplicity. In reality, nodes can have multiple head pointers, one for each edge type. Each edge type is represented by a integer index.

Views

Views are one of the most useful aspects of Gephi’s graph structure and are mainly used behind the scenes in the Filter module. A view is a graph subset (i.e. a subgraph) which remains connected to the main structure, so if a node is removed from the graph, it’s removed from the views as well. For instance, when users create a ‘Degree Filter’, Gephi creates a view and removes all the nodes which don’t fulfill the degree threshold. Multiple views can co-exist at any time in the graph structure. In the current graph structure, a node tree complete copy is done for every view and we found that this can be very inefficient.

In the new version, the way views are implemented is very different and should yield to better performance. Instead of doing a copy of the nodes, we maintain bit-vectors for nodes and edges. Because these elements are stored in large arrays with a unique identifier, it’s easy to create and maintain a bit-vector. When developers obtain the ‘Graph’ object for a particular view, the bit-vectors are used behind the scenes to adapt iterators and accessors. This solution should make filtering for large graphs much quicker. One drawback is that whereas the current implementation copies and then trims the view, GraphStore work with bit-vectors but continues to access the complete graph. In other words, if the view represents only 1% of the original graph, it still needs to iterate over the 100% to find which elements are the 1%. Even though this sounds bad, our benchmarks show it’s a very fast operation and we win overall because of the reduced overhead of duplicating the graph. Moreover, we can introduce some caching later to optimize this further.

Inverted Index

When you’re using the Partition module in Gephi, you’re manipulating some sort of inverted index. Nodes and edges have properties like ‘gender’, ‘age’ or ‘country’, and these properties are contained within the nodes and edges objects. An index is a simple data structure which allows to retrieve the list of elements for a particular value. For instance, the partition module needs to know what is the number of ‘male’ or ‘female’ nodes for the ‘gender’ column. When the column is a number like ‘age’, it also needs to know what is the maximum and minimum value. Unlike the Ranking module and its auto-apply feature, the Partition module is not refreshed in real-time and therefore difficult to use when the graph is changing a lot. We have decided to invest in this feature for the future release and are building a real column inverted index in the graph structure. The index will simply keep track of which values exist for each column and which elements are holding this value. The index will be updated in real-time as elements are added, removed or updated.

The ability to quickly retrieve elements and counts based on specific values will be very useful in many different modules like Filters, Partition or Data Laboratory. New APIs will be added for developers to use the newly created index interface. As we’re working on attributes storage and manipulation, we’ll also merge the Attributes and Graph API because they are so interconnected that it doesn’t really make sense to have them separate. The interfaces that developers are familiar with like Table or Columns will remain the same.

Events

In software programming, events are a common way to inform other modules that something changed. In Gephi, we also use events to convey graph updates events to inform other modules about updated nodes or edges. In the new GraphStore, we’ll stop using events to transport graph modifications because of the large overhead due to the creation of event objects. Indeed, when 10K nodes are added to the graph, the existing structure literally creates 10K event objects and puts them in a queue. Although the event queue is compressing objects of the same type, the overhead to create, queue, send and destroy large amount of small Java objects is too large.

Instead of a push model (i.e. the emitter is pushing updates), we want to rather promote a pull model (i.e. the listener pull updates from time to time) for future releases. A similar system is already in place to link the graph and the visualization module and it has been working without a glitch. We’ll develop the tools to easily calculate graph differences between a listener module and the graph structure. By removing the bottleneck, write performance should greatly improve.

Timestamps

As said earlier, we’ll add timestamps support to represent dynamic networks. Instead of using a time interval, a timestamp array will be associated with nodes and edges. For element (node/edge) visibility, each timestamp represents the presence of the element at that time. For example if a network snapshot is collected every month for a year, each node will have up to 12 different timestamps. The timestamp itself is a real number and can therefore represents an epoch time but also any other value in a different context. For a dynamic attribute, the time+value is simply represented as a list of (time, value) pairs.

To support the timeline and dynamic networks algorithm, we’re developing an inverted index for timestamps so we can make time filtering very quick. One good thing about intervals is that it’s very easy to know if two intervals overlaps with each other. With a flat list of timestamps, one can’t avoid to go through the entire list. The index will essentially map timestamps to the nodes and edges elements in the graph and therefore solves this issue. The Interval tree implementation which we are currently using to store intervals is based on a binary tree and is very costly in memory because of all the Java objects overhead. Using simple arrays should reduce overhead and improve performance for large dynamic networks. When computing a dynamic network algorithm (ex: Clustering coefficient over time), we’re using a sliding window over the graph so the ability to quickly filter is critical as it impacts how fast the graph refreshes.

Saving/Loading

Saving and loading the graph structure into into/from a file (or a stream) is another critical feature. When a user saves a project in Gephi, the graph data structure is serialized in XML and compressed into a .gephi file. If you worked with project files in Gephi, you may have experienced corrupted files issues or errors when loading a file. We’ve done our best to fix these problems but some still remain. We’re rethinking how this should be done in GraphStore and are making a call to rewrite the code from scratch. Our approach will rely on a lot of unit tests to make sure the code is stable so we don’t repeat the same issues in future versions. Please note that this concerns the .gephi files only and existing importers (e.g. GEXF, GraphML) will remain the same.

Concerning the GraphStore serialization, we’re abandoning XML in favor of pure byte arrays. That should yield to better performance and reduced project file size. We’ll create a custom reader for previous Gephi versions so you can still open your existing projects. Other modules like Filters or Preview will continue to use XML as it’s working just fine.

Next steps

This is the first post about the Gephi 0.9 version and more will come soon. We’re excited about the current developments and hope to hear from you. Please join the gephi-dev mailing list to learn more about ongoing projects and contribute. We need your ideas!

Follow us on Twitter!

GSoC: Legend module

My Name is Eduardo Gonzalo Espinoza Carreon and during this summer I developed the new Legend Module for Gephi, with the mentoring of Eduardo Ramos and Sébastien Heymann. This article will give you an overview of the work done.

Problem statement

Currently Gephi offers the possibility of visualizing graphs, but what about legends? Legends provide basic and extra information related to the graph and they are useful when interpreting any kind of network map. If a person is not familiar with the content of a graph, missing or wrong legends could lead to misleading interpretations and sometimes wrong decisions. When a visualization is used by multiple people for discussing, analyzing or communicating data, legends are of great importance.

For instance, the following graph represents the coappearance of characters in the novel Les Miserables. After performing a visual analysis we could only conclude that the graph has 9 groups. This is probably a little of the information the creator wanted to transmit. The graph has no information related to the number of nodes explored, or what the groups represent and how many elements each group has, etc.

A current workaround to solve this problem is to export the graph as an image, and then manually add the legends using Inkscape, Adobe Illustrator or another graphics editor. However this task is time-consuming and can be automated. The new Legend Module proposes a solution to this problem.

Solution

We propose an extension to the Preview module for generating legend items. The following legend items are available: Table, Text, Image, Groups and Description. They can be added using the Legend Manager, which is shown in a new tab under the Preview Settings:

After selecting a type of legend, the user chooses a sub-type builder, e.g. “Table” > “Partition interaction table”, or “Top 10 nodes with greatest degree”, as shown in the following figure:

When a new Legend item is added, it is displayed in the list of active legend items, where the user can edit its properties. The user can also edit its label and assign a user-friendly name to remember the content of the legend easily.

Every item has a set of common properties: label, position, width, height, background color and border, title, description; and also each type of item has its own properties and data. The values of those properties are editable through a Property Editor like the one used in the preview settings.

Some properties like scale and translation can be modified using the mouse like most of the graphic design applications. All legend items are designed with a smart way of autoresize. It’s not the common scale feature, e.g. if the text included in the Text Item is bigger than the size assigned, then the Text Renderer overrides the text font defined by the user and decreases the font size until the text is able to fit in the specified dimensions. The results of this feature are shown in the next figure:

Workflow

The legend builder retrieves the graph data (partitions, node labels, edge labels, etc) and creates a new Legend item for each of them. Then a legend renderer makes use of these information, plus the properties set by the user, to render the Legend item to the specified target: PNG, PDF or SVG.

For developers

The renderers can be extended. For instance, the default Group Renderer is:

Using external libraries like JFreeChart, we can extend it to create a Pie Chart Renderer like as follows:

Other types of items can be created by combining other available Legend Items or by extending Legend Item, Legend Item Builder and Legend Item Renderer.

The Legend Module also provides a save/load feature. So you can save your legends for future editing.

Limitations

Currently there are some limitations like selecting a specific renderer for each type of item, and also exporting legends to SVG format is not done automatically like PNG and PDF, e.g. Exporting an Image (they will be embedded in the SVG file).

Conclusions

I would like to thank Eduardo Ramos and Sébastien Heymann for their support and feedback, which was critical during the development of this new module. The Legend module will be available as core feature in next Gephi release.

This GSoC was a great opportunity to learn and it also represents my first important contribution to the open-source community.

GSoC: Force Directed Edge Bundling

My name is Taras Klaskovsky and during this Summer Of Code I have implemented the Force Directed Edge Bundling algorithm.

Force Directed Edge Bundling (FDEB) is an edge layout algorithm. Gephi already has node layouts, which are placing the nodes (usually using force-directed algorithms). FDEB helps to further improve graph visualization by changing shapes and colors of the edges to merge them in bundles, analogous to the way electrical wires and network cables are merged into bundles along their joint paths and fanned out again at the end. This reduces visual clutter, inevitable in large graphs, allowing to find high-level edge patterns and get an overview of the graph just by looking at it. As example, US flights graph below, with nodes as airports and edges as flights.

The algorithm

Edges are modeled as flexible springs that can attract each other while node positions remain fixed. A force-directed technique is used to calculate the bundling and it is relatively slow. On small graphs it works pretty fast due to optimizations, but consumes large amount of memory to store precomputed similar pairs of edges; for average and large graphs a special slow low-memory mode implemented. After every iteration preview is being refreshed, so it’s possible to observe formation of bundles in real-time.

Full algorithm description can be found on this research paper.

original
mode-1

mode-2
mode-3

Renderer Modes

FDEB has 3 renderer modes:

  • Simple renderer: Draws all edges with the same color and transparency (color, transparency and thickness are set in the preview settings in the bottom-left panel), bundles are emphasized by combined transparency.
  • Gradient renderer: Draws all edges with color from gradient slider. Edges that are similar to higher number of other edges get higher color.
  • Gradient slow renderer: Uses more precise method to determine intensity of edge (needs precalculate points checkbox and to be re-runned) to set personal color for every segment.

FDEB can be customized with lots of options and they all have descriptions, some of the most influential on result are:

  • Use inverse-qudratic model: Makes more localized bundling
  • Compatibility threshold: Ignore pairs of edges that are not similar enough, which makes FDEB faster (ignored in low-memory mode) and bundling become more localized.

To make FDEB faster it’s also possible to decrease number of cycles/subdivision points increase rate.

gui

The Edge Layout API

An edge layout API has been created to simplify the integration of other edge layouts into Gephi. This API is very similar to the existing node layout API, with the following additions. Since edge layouts do not only change shapes of edges, but are also responsible for their visualization, a modifyAlgo() is called each time when the Preview is refreshed, to control the modification of parameters. The edge layout data, which is accessible for each edge, provides polyline points and it’s colors for all renderer modes.

How to get the FDEB algorithm

It will be available in the next release of Gephi. The current source code is available on Github.

GSoC mid-term: a new Timeline to explore time-varying networks

daniel

My name is Daniel Bernardes and during this Google Summer of Code I am working on the new Timeline interface.

Dynamic graphs have been the subject of increasing interest, given their potential as a theoretical model and their promising applications. Following this trend, Gephi has incorporated tools to study dynamic networks. From a visualization perspective, a critical tool is the Timeline component, which allows users to select pertinent time intervals and display and explore the corresponding graph. The challenge concerning the timeline was twofold: redesign the component to improve user experience and add extra features and introduce an animation scheme with the possibility to export the resulting video.

Together with my mentors Cezary Bartosiak and Sébastien Heymann, we have proposed a new design for the timeline component featuring a sparkline chart in the background of the interval selection drawer (which is semi transparent): this feature will help the user to focus on particular moments of the evolution of the dynamic graph, like bursts of connections or changes in graph density or other simple graph metrics. Current metrics are the evolution of the number of nodes, the number of edges and the graph density. The sparkline chart was preferred to other chart solutions because it does not add too much visual pollution to the component and adds to the qualitative analysis. The interaction with the drawer remains globally the same of the old timeline, to guarantee a smooth transition for the user.

To implement this feature we have used the chart library JFreeChart (a library already incorporated to Gephi), customizing their XYPlot into a Sparkline chart by modifying their visual attributes. To display the Sparkine, one needs to measure the properties of the graph in several time instants of the global time frame where the dynamic graph exists. This represented a major challenge, since the original architecture did not allow the timeline component to access (and measure) the graph in particular instants of time; the solution was to introduce a slight modification to the DynamicGraph API to provide an object which gave us snapshots of the graph at given instants. Other challenges we dealt with included the automatic selection/switching of real number/time units in the timeline (depending on the nature of the graph in question) and sampling granularity of the timeline.

Another breakthrough of this project was the introduction of the timeline animation. Once the user has selected a time frame with the drawer it can make it slide as the corresponding graph is being displayed on the screen. Besides the technical aspects of interaction between the timeline and the animation controller, there were also an effort to calibrate the animation (ie, in terms of speed and frames) so it would be comfortable and meaningful for the user.

As far as the UI is concerned, the component has gained a new “Reset” button next to the play button which activates the timeline drawer and displays the chart. It also serves to reset the drawer selection to the full interval when the timeline is active. The play button gained its original function, that is, to control the animation of the timeline — instead of activating the selection.

Finally, the animation export to a video format revealed to be more tricky than expected and couldn’t be finished as planned. There were several setbacks to this feature, beginning with the selection of a convenient library to write de movie container: it turns out that the de facto options available are not fully Java-based and need an encoder working in the background. The best alternative I found was Xuggler, which is based on ffmpeg. Also, obtaining screen captures of the graph to were a little bit tricky so I have exported SVG images from the graph corresponding to each frame, converted them to jpeg and than encoded them though Xuggler to a video format. As one might expect, this solution is not very efficient in terms of time, so Mathieu Bastien and my mentors suggested me to wait for the new features from the new Visualization API that would make this process simpler.

In addition to current bugfixes and minor improvements concerning the timeline and the animation, the movie export remains the the next big step to close this project. If you have questions or suggestion, please do not hesitate! The new timeline will be available in the next release of Gephi.

DB

GSoC mid-term: new Preview API

My name is Yudi Xue and during this Google Summer of Code am glad to work on the Core evolution of Gephi.

Current API in the Preview module provides too many granular methods and classes. Developers are clueless about how they may extend the component. In this project, we do not seek to expand what the Preview module already have to offer. Rather, we focus on making the Preview module easy to learn, easy to use and easy to extend to the Gephi developers. The new API will allow developers to focus on particular parts of the module. They may specify a new visual algorithms just by implementing a new type of Renderer, such as edge bundling and convex hull. They may also extend the RenderTarget to allow display or export visualization to different platform.

The user story

We took the infovis reference model into consideration when we started designing the new infrastructure. The infrastructure aims to provide support to a visualization-preview workflow:

raw data -> the data builder -> renderers -> render targets.

In particular, the raw data is the graph associated with the current gephi workspace. The data builder (DataBuilder) will interpret information associated with the nodes and edges and generate Item objects for Preview use. The Item objects are immutable objects that are either node item (NodeItem), edge item (EdgeItem) or item group (GroupItem) specified from the graph workspace or data lab. We append “Item” to refer that they are data rather than display objects. After the data has been imported, the preview controller (PreviewController) will associate each type of entity items with Renderer objects. Renderer objects are functional procedures that describe how an item should be drew. While we give information to an Renderer object what it is going to draw, we also tell it what RenderTarget it will use. By default, we provide ProcessingRenderTarget, PDFRenderTarget and SVGRenderTarget. All RenderTarget objects contribute to the RenderTarget API, which provide granular drawing functions that can be used by developers to form advanced visual algorithm. In addition to the workflow, we will provide a flexible properties structure to the Preview module so it may be used to provide listener to user interface commands. The property will allow dynamic dependency where grouped properties can listen for a single parent property.

The code below demonstrates how a Renderer to a particular Item type could be updated at runtime.

Code sample:

PreviewController prc = (PreviewController)Lookup.getDefault().lookupItem(
                                                       PreviewController.DEFAULT_IMPL).getInstance();
prc.loadGraph();
// Load graph from workspace
prc.updateRenderer(NodeItem.class, new Renderer() {
    // How I want to draw a node, edge, or item types.
    // Specify your procedural visualization algorithm here
    @Override
    public void render(Item item, RenderTarget rt) {
        NodeItem ni = (NodeItem) item;
        rt.drawImage(..);
        rt.drawline(..);
    }
    // The RenderTarget will pick up the properties and draw the rest..
});

The big picture

Speaking of API flexibility, the Preview API goes from constrained to flexible in the direction from DataBuilder to RenderTarget. Here is the big picture:

Current progress

  • done:
    • a working copy based on the new architecture
    • added ProcessingRengerTarget
    • added GroupRenderer (Convex hull)
    • added ImageRenderer
    • basic unit testing
    • basic functional testing against updating Renderer in Preview API at runtime
  • in progress:
    • Property support
    • Selfloop, curved edge drawing
    • PDF and SVG RenderTarget implementation

Here is a screenshot of the new system with convex-hull enabled:

Code practice

The code base is under active development at https://code.launchpad.net/~yudi-xue/gephi/gephi-preview. The code base includes the PreviewAPI module and the PreviewImpl module.

Lookup API

We make use of Netbeans Lookup API to instantiate singleton and use Lookup. Template to ensure the correct implementation been called.

For example, to call the default PreviewController constructor, we call:

/*
* DEFAULT_IMPL is defined in the interface.
* It refers to default implementation class
* "org.gephi.preview.PreviewControllerImpl"
*/
(PreviewController)Lookup.getDefault().lookupItem(PreviewController.DEFAULT_IMPL).getInstance();

Accordingly, you may choose to use the API with your implementation by creating a Template that points your implementation class.

Functional Tests

During the development, we are creating functional tests against our own API for the purpose of both flexibility and stability. the “PreviewAPIFunctionalTest”

Conclusions

Our goal is to bring modularity and extensibility to the Preview module. We aim to deliver the freedom in defining your own visual algorithms (Renderer) and user interaction (Property) and make use of API without thinking about the detailed mechanism. I would like to give my thanks to Dr. Christian Tominski, Mathieu Bastian and Sébastien Heymann for their support and feedback, which is critical during the development for the new architecture.