This repository contains the implementation of the algorithms discussed in "DiNgHy: Null Models for Non-Degenerate Directed Hypergraphs", published in the proceedings of ECML PKDD '25. [1]
The implementation leverages the NuDHy [2] codebase. This ReadMe file is adapted to reflect the updated code, and includes portions of the original ReadMe that still apply.
This repository implements a suite of Markov-Chain Monte-Carlo algorithms for sampling directed hypergraphs from two null models, one of which is novel.
The existing null model from [2] preserves the left and right in- and out-degree sequences of the bipartite graph corresponding to the directed hypergraph. The corresponding sampler, NuDHy-Degs, is based on Parity Swap Operations (PSOs), which preserve the degree sequences.
Our novel null model preserves the same properties as the aforementioned one, and the non-degeneracy of the sampled dihypergraph. The corresponding samplers, DiNgHy and DiNgHy-B, are based on Non-Degenerating Swaps, which preserve the degree sequences without introducing degeneracy.
We include the commands used to run to reproduce results from the paper [1] below. More details on the code/datasets/configuration are included later on.
-
Build the Java code by running
mvn clean package
from thecode
directory, then copy the JAR from thecode/target
directory to the root directory of this repository and rename itDINGHY.jar
. -
Edit the configuration file
config.cfg
to takesamplerType
values "DiNgHy", "DiNgHy_B", and "NuDHy_Degs", runningrun.sh
from the main directory with each value ofsamplerType
. This should produce folders for each sampler and dataset in the folder out/samples and out/convergence. -
For each dataset folder in out/samples/[samplerType], zip the folder into a zip file of the same name. This is how the python helpers access the samples.
-
Run the python helpers
count_degen_samples.py
,plot_degen_while_sampling.py
,plot_p_values.py
, andplot_convergence.py
to generate the tables and plots from the paper.Specifically,
count_degen_samples.py
writes a table with all the data in the main text and supplementary materials totable.txt
. The plots from the main text are written to the folders degeneracy_plots/, p_value_plots/, and convergence_plots/. In order to generate the plots from the appendix forplot_degen_while_sampling.py
andplot_p_values.py
, change the datasets as described in the comments at the beginning of each file, change the file that the output is saved to, and rerun each helper.
code/ ... Java source files of the samplers
data/ ... datasets used in the experimental evaluation
helpers/ ... Python and Bash scripts to compute hypergraph metrics of interest
run.sh ... Bash script for running the Java code
config.cfg ... configuration file for the Java code
requirements.txt ... Python libraries required to run the scripts in helpers
The directory helpers
includes the Python and Bash scripts used to calculate
the metrics presented in the tables and charts in the paper.
-
To build the Java code, you need maven;
-
To run the Java code (the source files are in the folder code), you need Java JRE v1.8.0;
-
To run the Python scripts and check the results of our experimental evaluation, you must install the libraries listed in the file
requirements.txt
.
Each line of the input file for directed hypergraphs consists of a tab-separated pair of two comma-separated lists of integers. The first list in the pair is the head of a hyperedge, and the second list is the tail.
All the scripts assume that the file extension is .tsv
. The folder data
includes all the datasets used in our experimental evaluation.
Build the project and generate the JAR file by running mvn package
from the
code
directory, then copy the JAR from the code/target
directory to the root
directory of this repository. Note that the
configuration file assumes that the name of the JAR file is NUDHY.jar
.
Now you can use DiNgHy, DiNgHy-B, and NuDHy-Degs by running the script
run.sh
in the root directory of this repository.
The value of each parameter used by the samplers must be set in the
configuration file config.cfg
(format described below).
datasetsDir
: directory where the input files are placedresultsDir
: directory where the output files will be storedseed
: seed for reproducibility.numThreads
: number of threads for parallel computation.maxNumSwapsFactor
: integer used in the Convergence experiment.k
: number of frequent itemsets to return in the Convergence experiment.size
: min size of a frequent itemset to return in the Convergence experiment.samplerType
: name of the sampler to use (values admissible are "DiNgHy", "DiNgHy_B", and "NuDHy_Degs").store
: whether the samples generated must be stored on disk.
- Dataset names: names of the dataset files under
datasetsDir
. - Default values: comma-separated list of default values for each dataset, i.e., number of swaps to perform before returning the random sample, and number of random samples to generate.
- Minimum frequency thresholds: minimum frequency thresholds used to return the top-k frequent itemsets of size greater than size in the Convergence experiment.
- Experimental flags: test to perform among (1) convergence
(
ConvergenceFI.java
), (2) Structural Entropy (StructuralEntropy.java
), (3) scalability (Scalability.java
), (4) generation of random samples (Sampling.java
), and (5) generation of random samples with snapshots recorded (SamplingForConvergence.java
). Then, the arrays that store the names, the default values, the frequency thresholds, and the experimental flags of each dataset to test must be declared at the beginning of the scriptrun.sh
. Only results from (4) and (5) are included in our results in the paper, as we did not run (1-3) on DiNgHy.
The directory helpers
includes the code that preprocesses datasets as
described in the paper and creates the tables and graphs included in the main
text and supplementary material. Most of the scripts take no input, except
clean\_datasets\_irreducibility
which takes in the percentage of largest edges
that should be removed from datasets where it ensures irreducibility.
The values and datasets that we cleaned to remove degeneracy or ensure irreducibility
are included in the data
folder.
The directory includes also:
config
: configuration files to set the value of each relevant parameter.io
: methods to read the data from disk in various formats.parser\_for\_output\_file\_names
: methods to parse the file names of the samples and extract relevant information about the sampling process.utils
: some useful methods.
To be able to run the scripts listed above, you first need to generate some
random samples using the chosen sampler. The samples generated must be placed in
a .zip
folder with the name of the chosen dataset. The .zip
folder must be
placed in a directory with the name of the chosen sampler.
For example, if you generated some samples using NuDHy-Degs
starting from the observed dataset enron
, you must place the samples at the
following path:
sample_path/NuDHy/enron.zip
where sample_path
is a parameter that can be set in the config.cfg
file.
This package is released under the GNU General Public License 2.0.
- Abuissa, M, Riondato, M., , Upfal, E.. "DiNgHy: Null Models for Non-Degenerate Directed Hypergraphs", to appear in the proceedings of ECML PKDD'25.
- Preti, G., Fazzone, A., Petri, G., De Francisci Morales, G.. "Higher-order null models as a lens for social systems", Physical Review X 14(3), 031032 (2024)