Skip to content

acdmammoths/dinghy-code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DiNgHy

This repository contains the implementation of the algorithms discussed in "DiNgHy: Null Models for Non-Degenerate Directed Hypergraphs", published in the proceedings of ECML PKDD '25. [1]

The implementation leverages the NuDHy [2] codebase. This ReadMe file is adapted to reflect the updated code, and includes portions of the original ReadMe that still apply.

Overview

This repository implements a suite of Markov-Chain Monte-Carlo algorithms for sampling directed hypergraphs from two null models, one of which is novel.

The existing null model from [2] preserves the left and right in- and out-degree sequences of the bipartite graph corresponding to the directed hypergraph. The corresponding sampler, NuDHy-Degs, is based on Parity Swap Operations (PSOs), which preserve the degree sequences.

Our novel null model preserves the same properties as the aforementioned one, and the non-degeneracy of the sampled dihypergraph. The corresponding samplers, DiNgHy and DiNgHy-B, are based on Non-Degenerating Swaps, which preserve the degree sequences without introducing degeneracy.

Reproducing Paper Results

We include the commands used to run to reproduce results from the paper [1] below. More details on the code/datasets/configuration are included later on.

  1. Build the Java code by running mvn clean package from the code directory, then copy the JAR from the code/target directory to the root directory of this repository and rename it DINGHY.jar.

  2. Edit the configuration file config.cfg to take samplerType values "DiNgHy", "DiNgHy_B", and "NuDHy_Degs", running run.sh from the main directory with each value of samplerType. This should produce folders for each sampler and dataset in the folder out/samples and out/convergence.

  3. For each dataset folder in out/samples/[samplerType], zip the folder into a zip file of the same name. This is how the python helpers access the samples.

  4. Run the python helpers count_degen_samples.py, plot_degen_while_sampling.py, plot_p_values.py, and plot_convergence.py to generate the tables and plots from the paper.

    Specifically, count_degen_samples.py writes a table with all the data in the main text and supplementary materials to table.txt. The plots from the main text are written to the folders degeneracy_plots/, p_value_plots/, and convergence_plots/. In order to generate the plots from the appendix for plot_degen_while_sampling.py and plot_p_values.py, change the datasets as described in the comments at the beginning of each file, change the file that the output is saved to, and rerun each helper.

Content

code/            ... Java source files of the samplers
data/            ... datasets used in the experimental evaluation
helpers/         ... Python and Bash scripts to compute hypergraph metrics of interest
run.sh           ... Bash script for running the Java code
config.cfg       ... configuration file for the Java code
requirements.txt ... Python libraries required to run the scripts in helpers

The directory helpers includes the Python and Bash scripts used to calculate the metrics presented in the tables and charts in the paper.

Requirements

  • To build the Java code, you need maven;

  • To run the Java code (the source files are in the folder code), you need Java JRE v1.8.0;

  • To run the Python scripts and check the results of our experimental evaluation, you must install the libraries listed in the file requirements.txt.

Input Format

Each line of the input file for directed hypergraphs consists of a tab-separated pair of two comma-separated lists of integers. The first list in the pair is the head of a hyperedge, and the second list is the tail.

All the scripts assume that the file extension is .tsv. The folder data includes all the datasets used in our experimental evaluation.

How to Use the Samplers

Build the project and generate the JAR file by running mvn package from the code directory, then copy the JAR from the code/target directory to the root directory of this repository. Note that the configuration file assumes that the name of the JAR file is NUDHY.jar.

Now you can use DiNgHy, DiNgHy-B, and NuDHy-Degs by running the script run.sh in the root directory of this repository.

The value of each parameter used by the samplers must be set in the configuration file config.cfg (format described below).

General Settings

  • datasetsDir: directory where the input files are placed
  • resultsDir: directory where the output files will be stored
  • seed: seed for reproducibility.
  • numThreads: number of threads for parallel computation.
  • maxNumSwapsFactor: integer used in the Convergence experiment.
  • k: number of frequent itemsets to return in the Convergence experiment.
  • size: min size of a frequent itemset to return in the Convergence experiment.
  • samplerType: name of the sampler to use (values admissible are "DiNgHy", "DiNgHy_B", and "NuDHy_Degs").
  • store: whether the samples generated must be stored on disk.

Dataset-related Settings

  • Dataset names: names of the dataset files under datasetsDir.
  • Default values: comma-separated list of default values for each dataset, i.e., number of swaps to perform before returning the random sample, and number of random samples to generate.
  • Minimum frequency thresholds: minimum frequency thresholds used to return the top-k frequent itemsets of size greater than size in the Convergence experiment.
  • Experimental flags: test to perform among (1) convergence (ConvergenceFI.java), (2) Structural Entropy (StructuralEntropy.java), (3) scalability (Scalability.java), (4) generation of random samples (Sampling.java), and (5) generation of random samples with snapshots recorded (SamplingForConvergence.java). Then, the arrays that store the names, the default values, the frequency thresholds, and the experimental flags of each dataset to test must be declared at the beginning of the script run.sh. Only results from (4) and (5) are included in our results in the paper, as we did not run (1-3) on DiNgHy.

How to Compute the Metrics

The directory helpers includes the code that preprocesses datasets as described in the paper and creates the tables and graphs included in the main text and supplementary material. Most of the scripts take no input, except clean\_datasets\_irreducibility which takes in the percentage of largest edges that should be removed from datasets where it ensures irreducibility. The values and datasets that we cleaned to remove degeneracy or ensure irreducibility are included in the data folder.

The directory includes also:

  • config: configuration files to set the value of each relevant parameter.
  • io: methods to read the data from disk in various formats.
  • parser\_for\_output\_file\_names: methods to parse the file names of the samples and extract relevant information about the sampling process.
  • utils: some useful methods.

To be able to run the scripts listed above, you first need to generate some random samples using the chosen sampler. The samples generated must be placed in a .zip folder with the name of the chosen dataset. The .zip folder must be placed in a directory with the name of the chosen sampler.

For example, if you generated some samples using NuDHy-Degs starting from the observed dataset enron, you must place the samples at the following path:

sample_path/NuDHy/enron.zip

where sample_path is a parameter that can be set in the config.cfg file.

License

This package is released under the GNU General Public License 2.0.

References

  1. Abuissa, M, Riondato, M., , Upfal, E.. "DiNgHy: Null Models for Non-Degenerate Directed Hypergraphs", to appear in the proceedings of ECML PKDD'25.
  2. Preti, G., Fazzone, A., Petri, G., De Francisci Morales, G.. "Higher-order null models as a lens for social systems", Physical Review X 14(3), 031032 (2024)

About

The code for the DiNgHy paper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published