←INTELLI•GRAPHS→
IntelliGraphs: Benchmarking Generative Models for Knowledge Graphs
IntelliGraphs is a collection of graph datasets designed to benchmark generative models for knowledge graphs. It provides a Python package that includes easy access to datasets, pre- and post-processing functions, baseline models, and evaluation tools. This tool is essential for researchers and developers working on knowledge graph generation and evaluation.
Benefits
- Easy Access to Datasets:IntelliGraphs provides a simple way to download and access various graph datasets, making it easier to benchmark and evaluate models.
- Pre- and Post-Processing Functions:The package includes functions to preprocess and post-process data, ensuring consistency and accuracy in your experiments.
- Baseline Models:IntelliGraphs offers baseline models for comparison, helping you understand the performance of your models against established benchmarks.
- Evaluation Tools:The package includes tools to evaluate the performance of generative models, providing insights into their strengths and weaknesses.
- Customization:IntelliGraphs allows for customization of dataset generators, enabling researchers to create tailored datasets for specific use cases.
Use Cases
- Research and Development:IntelliGraphs is ideal for researchers and developers working on generative models for knowledge graphs. It provides the necessary tools and datasets to benchmark and evaluate new models.
- Education and Training:The package can be used in educational settings to teach students about knowledge graph generation and evaluation. It offers a hands-on approach to learning about graph datasets and models.
- Industry Applications:IntelliGraphs can be used in various industries where knowledge graphs are essential, such as healthcare, finance, and e-commerce. It helps in understanding the performance of models in real-world scenarios.
Installation
IntelliGraphs can be installed using eitherpiporconda. Dependencies are automatically installed during the installation process.
Install with pip:
pip install intelligraphs # Standard pipuv pip install intelligraphs # Using UV (faster)Install with conda:
conda install -c thiv intelligraphsVerifying the Installation
After installation, you can verify that IntelliGraphs has been successfully installed by running the following command in your Python environment:
python -c "import intelligraphs; print(intelligraphs.__version__)"It is recommended to use the latest version. If you don't have the latest version, please ensure to update your installation before using it:
pip install --upgrade intelligraphs # or conda install -c thiv intelligraphs --upgradeDownloading the Datasets
The datasets required for this project can be obtained either manually or automatically through the IntelliGraphs Python package.
Manual Download
The datasets are hosted on Zenodo. You can download the datasets and extract the files to your preferred directory.
Automatic Dataset Download
To download, verify, and extract datasets automatically, use:
python -m intelligraphs.data_loaders.downloadThis command will download all IntelliGraphs datasets, verify their integrity using MD5 checksums, and then extract them into the.datadirectory in your current working directory.
IntelliGraphs Data Loader
TheDataLoaderclass is a utility for loading IntelliGraphs datasets, simplifying the process of accessing and organizing the data for machine learning tasks. It provides functionalities to download, extract, and load the datasets into PyTorch tensors.
Usage
- Instantiate the DataLoader:
from intelligraphs import DataLoaderdata_loader = DataLoader(dataset_name='syn-paths')- Load the Data:
train_loader, valid_loader, test_loader = data_loader.load_torch(batch_size=32,padding=True,shuffle_train=False,shuffle_valid=False,shuffle_test=False)- Access the Data:
for batch in train_loader:# Perform training steps with the batchfor batch in valid_loader:# Perform validation steps with the batchfor batch in test_loader:# Perform testing steps with the batchIntelliGraphs Synthetic KG Generator
SynPathsGenerator
This generator creates path graphs where each node represents a city in the Netherlands and each edge represents a mode of transport (cycle_to,drive_to,train_to).*Entities:Dutch cities*Relations:Modes of transport between cities*Use case:Structural learning
SynTIPRGenerator
This generator creates graphs representing academic roles, timelines, and people. The nodes represent individuals, roles, and years, and the edges represent relationships likehas_name,has_role,start_year, andend_year.*Entities:Names, roles, years*Relations:Relationships between academic roles and timeframes*Use case:Basic temporal reasoning and type checking
SynTypesGenerator
This generator creates graphs where nodes represent countries, languages, and cities, and edges represent relationships likespoken_in,part_of, andsame_as.*Entities:Countries, languages, cities*Relations:Geographical and linguistic relationships*Use case:Type checking
Customization
Each generator class inherits fromBaseSyntheticDatasetGeneratorand can be customized by overriding methods or adjusting parameters. The base class provides utility methods for splitting datasets, checking for unique graphs, and visualizing graphs.
Extending Functionality
To create a new dataset generator, simply create a new class that inherits fromBaseSyntheticDatasetGeneratorand implement thesample_synthetic_datamethod to define your dataset's logic.
class MyCustomDatasetGenerator(BaseSyntheticDatasetGenerator):def sample_synthetic_data(self, num_graphs):# Implement your custom logic herepassData Generation
You can generate synthetic datasets by running the corresponding script for each generator. Each generator allows customization of dataset size, random seed, and other parameters.
python intelligraphs/generator/synthetic/synpaths_generator.py --train_size 60000 --val_size 20000 --test_size 20000 --num_edges 3 --random_seed 42 --dataset_name "syn-paths"python intelligraphs/generator/synthetic/syntypes_generator.py --train_size 60000 --val_size 20000 --test_size 20000 --num_edges 3 --random_seed 42 --dataset_name "syn-types"python intelligraphs/generator/synthetic/syntipr_generator.py --train_size 50000 --val_size 10000 --test_size 10000 --num_edges 3 --random_seed 42 --dataset_name "syn-tipr"IntelliGraphs Verifier
Rules
Every dataset comes with a set of rules that describe the nature of the graphs. TheConstraintVerifierclass includes a convenient method calledprint_rules()that allows you to view all the rules and their descriptions in a clean and organized format.
To use theprint_rules()method, simply instantiate a subclass ofConstraintVerifier, such asSynPathsVerifier, and then call theprint_rules()method on that instance to list the logical rules for a given dataset.
Example Usage
from intelligraphs.verifier.synthetic import SynPathsVerifier# Initialize the verifier for the syn-paths datasetverifier = SynPathsVerifier()# Print the rules and their descriptions for the syn-paths datasetverifier.print_rules()When you callprint_rules(), you'll get a formatted list of all the rules along with their corresponding descriptions. For example:
List of Rules and Descriptions:-> Rule 1:FOL: ∀x, y, z: connected(x, y) ∧ connected(y, z) ⇒ connected(x, z)Description: Ensures transitivity. If x is connected to y, and y is connected to z, then x should be connected to z.-> Rule 2:FOL: ∀x, y: edge(x, y) ⇒ connected(x, y)Description: If there's an edge between two nodes x and y, then x should be connected to y....Baseline Models
Importing Baseline Models
Our baseline models are also available through the Python API. You can find them inside class.
To import the Uniform Baseline model:
from intelligraphs.baseline_models import UniformBaselineTo import the Knowledge Graph Embedding (KGE) models:
from intelligraphs.baseline_models.knowledge_graph_embedding_model import KGEModelSetup
To recreate our experiments, we recommend using a fresh virtual environment with Python 3.10 installed.
1. Install package
pip install -e . # or: pip install intelligraphs # or: conda install -c thiv intelligraphs2. Install dependencies
pip install torch pyyaml tqdm wandb numpy scipy3. Configure tracking
wandb login # or disable with: export WANDB_MODE=disabledUniform Baseline Model
The uniform baseline model is designed to serve as a simple reference baseline. It applies a random compression strategy for the synthetic and real-world datasets. You can run this baseline using the following commands:
python benchmark/experiments/uniform_baseline_compression_test.pyIt should complete in about a minute without any GPU-acceleration.To run the graph sampling experiment using the uniform sampler, run the command:
python benchmark/experiments/uniform_baseline_graph_sampling.pyProbabilistic KGE Models
We've developed three CUDA-compatible probabilistic Knowledge Graph Embedding models: , , and . Run experiments using the commands below:
Synthetic Datasets
# syn-pathspython benchmark/experiments/probabilistic_kge_baselines.py --config benchmark/configs/syn-paths-[model].yaml# syn-tiprpython experiments/train_baseline.py --config benchmark/configs/syn-tipr-[model].yaml# syn-typespython benchmark/experiments/probabilistic_kge_baselines.py --config benchmark/configs/syn-types-[model].yamlWikidata Datasets
# wd-articles and wd-moviespython benchmark/experiments/probabilistic_kge_baselines.py --config benchmark/configs/wd-[dataset]-[model].yamlReplace[model]withtranse,complex, ordistmultand[dataset]with the appropriate dataset name.
Dataset Verification
We have written test functions to check the graphs in the datasets against the list of rules. It can be run using:
python intelligraphs/data_validation/validate_data.pyIf there are any errors in the data, it will raise aDataErrorexception and the error message will look similar to this:
intelligraphs.errors.custom_error.DataError: Violations found in a graph from the training dataset:- Rule 6: An academic's tenure end year cannot be before its start year. The following violation(s) were found: (_time, start_year, 1996), (_time, end_year, 1994).How to Cite
If you use IntelliGraphs in your research, please cite the following paper:
@article{thanapalasingam2023intelligraphs,title={IntelliGraphs: Datasets for Benchmarking Knowledge Graph Generation},author={Thanapalasingam, Thiviyan and van Krieken, Emile and Bloem, Peter and Groth, Paul},journal={arXiv preprint arXiv:2307.06698},year={2023}}Reporting Issues
If you encounter any bugs or have any feature requests, please file an issue .
License
IntelliGraphs datasets and the python package is licensed under CC-BY License. See for more information.
Platform Compatibility/Issues
This package has been and developed and tested on MacOS and Linux operating systems. If you experience any problems on Windows or any other issues, please .
Unit tests
Make sure to activate the virtual environment with the installation of the intelligraphs package.
To run the unit tests, install pytest:pip install pytestorconda install pytest
pytest --version # verify installationExecute the units tests using:
pytestContributing
If you would like to contribute code for a new feature or bug fix, here's how to get started:
First, set up your development environment:
git clone https://github.com/thiviyanT/IntelliGraphs.gitcd IntelliGraphspython -m venv venvsource venv/bin/activate # On Windows, use: venv\Scripts\activate# Install development dependenciespip install -e .For submitting changes:
# Create a new branch from devgit checkout devgit checkout -b feature/your-feature-name# Make your changes and commitgit add .git commit -m "Description of your changes"# Push to GitHubgit push -u origin feature/your-feature-nameTo submit changes:1. Ensure all tests pass by running pytest.2. Update the README.md, if needed.3. Create a pull request from your feature branch to the dev branch.4. The CI pipeline will automatically run tests on your pull request.
Changes must pass all tests and be approved before they can be merged into themainbranch. For questions or discussions, please open an issue on GitHub.
About
IntelliGraphs is a collection of graph datasets for benchmarking generative models for knowledge graphs.
This content is either user submitted or generated using AI technology (including, but not limited to, Google Gemini API, Llama, Grok, and Mistral), based on automated research and analysis of public data sources from search engines like DuckDuckGo, Google Search, and SearXNG, and directly from the tool's own website and with minimal to no human editing/review. THEJO AI is not affiliated with or endorsed by the AI tools or services mentioned. This is provided for informational and reference purposes only, is not an endorsement or official advice, and may contain inaccuracies or biases. Please verify details with original sources.
Comments
Please log in to post a comment.