src.datamodules.mind package§
Submodules§
src.datamodules.mind.datamodule_BERT module§
- class src.datamodules.mind.datamodule_BERT.MINDDataModuleBERT(mind_size='demo', data_dir=None, batch_size: int = 64, num_workers: int = 0, pin_memory: bool = False, download=True, column='title', bert_model=None, tokenizer=None)§
Bases:
MINDDataModule- news_dataframe(step, device=None)§
- prepare_data() None§
Use this to download and prepare data. Downloading and saving data with multiple processes (distributed settings) will result in corrupted data. Lightning ensures this method is called only within a single process, so you can safely add your downloading logic within.
Warning
DO NOT set state to the model (use
setupinstead) since this is NOT called on every deviceExample:
def prepare_data(self): # good download_data() tokenize() etc() # bad self.split = data_split self.some_state = some_other_state()
In a distributed environment,
prepare_datacan be called in two ways (using prepare_data_per_node)Once per node. This is the default and is only called on LOCAL_RANK=0.
Once in total. Only called on GLOBAL_RANK=0.
Example:
# DEFAULT # called once per node on LOCAL_RANK=0 of that node class LitDataModule(LightningDataModule): def __init__(self): super().__init__() self.prepare_data_per_node = True # call on GLOBAL_RANK=0 (great for shared file systems) class LitDataModule(LightningDataModule): def __init__(self): super().__init__() self.prepare_data_per_node = False
This is called before requesting the dataloaders:
model.prepare_data() initialize_distributed() model.setup(stage) model.train_dataloader() model.val_dataloader() model.test_dataloader() model.predict_dataloader()
- setup(stage=None)§
Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.
- Parameters:
stage – either
'fit','validate','test', or'predict'
Example:
class LitModel(...): def __init__(self): self.l1 = None def prepare_data(self): download_data() tokenize() # don't do this self.something = else def setup(self, stage): data = load_data(...) self.l1 = nn.Linear(28, data.num_classes)
- test_dataloader(batch_size=1)§
Implement one or multiple PyTorch DataLoaders for testing.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
test()
Note
Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- Returns:
A
torch.utils.data.DataLoaderor a sequence of them specifying testing samples.
Example:
def test_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=False ) return loader # can also return multiple dataloaders def test_dataloader(self): return [loader_a, loader_b, ..., loader_n]
Note
If you don’t need a test dataset and a
test_step(), you don’t need to implement this method.Note
In the case where you return multiple test dataloaders, the
test_step()will have an argumentdataloader_idxwhich matches the order here.
- train_dataloader(batch_size=1)§
Implement one or more PyTorch DataLoaders for training.
- Returns:
A collection of
torch.utils.data.DataLoaderspecifying training samples. In the case of multiple dataloaders, please see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()
Note
Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
Example:
# single dataloader def train_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=True ) return loader # multiple dataloaders, return as list def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a list of tensors: [batch_mnist, batch_cifar] return [mnist_loader, cifar_loader] # multiple dataloader, return as dict def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar} return {'mnist': mnist_loader, 'cifar': cifar_loader}
- val_dataloader(batch_size=1)§
Implement one or multiple PyTorch DataLoaders for validation.
The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
It’s recommended that all data downloads and preparation happen in
prepare_data().fit()validate()
Note
Lightning adds the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
- Returns:
A
torch.utils.data.DataLoaderor a sequence of them specifying validation samples.
Examples:
def val_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=False ) return loader # can also return multiple dataloaders def val_dataloader(self): return [loader_a, loader_b, ..., loader_n]
Note
If you don’t need a validation dataset and a
validation_step(), you don’t need to implement this method.Note
In the case where you return multiple validation dataloaders, the
validation_step()will have an argumentdataloader_idxwhich matches the order here.
src.datamodules.mind.datamodule_Base module§
- class src.datamodules.mind.datamodule_Base.MINDDataModule(data_dir, mind_size, batch_size, num_workers, download, pin_memory)§
Bases:
LightningDataModuleBase Datamodule for the MIND dataset
- Parameters:
data_dir (str) – Data directory
batch_size (int) – Batch size for dataloaders
num_workers (int) – Number of workers for dataloaders
pin_memory (bool) – Whether to use pin memory
download (bool) – Whether the mind dataset should be downloaded
mind_size (str) – Which dataset size should be used
train_val_test_split (list) – Whether to use automatic train-validation-test data splits
- prepare() None§
- setup(stage=None)§
Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.
- Parameters:
stage – either
'fit','validate','test', or'predict'
Example:
class LitModel(...): def __init__(self): self.l1 = None def prepare_data(self): download_data() tokenize() # don't do this self.something = else def setup(self, stage): data = load_data(...) self.l1 = nn.Linear(28, data.num_classes)
- test_dataloader()§
Implement one or multiple PyTorch DataLoaders for testing.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
test()prepare_data()
Note
Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- Returns:
A
torch.utils.data.DataLoaderor a sequence of them specifying testing samples.
Example:
def test_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=False ) return loader # can also return multiple dataloaders def test_dataloader(self): return [loader_a, loader_b, ..., loader_n]
Note
If you don’t need a test dataset and a
test_step(), you don’t need to implement this method.Note
In the case where you return multiple test dataloaders, the
test_step()will have an argumentdataloader_idxwhich matches the order here.
- train_dataloader()§
Implement one or more PyTorch DataLoaders for training.
- Returns:
A collection of
torch.utils.data.DataLoaderspecifying training samples. In the case of multiple dataloaders, please see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()prepare_data()
Note
Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
Example:
# single dataloader def train_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=True ) return loader # multiple dataloaders, return as list def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a list of tensors: [batch_mnist, batch_cifar] return [mnist_loader, cifar_loader] # multiple dataloader, return as dict def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar} return {'mnist': mnist_loader, 'cifar': cifar_loader}
- val_dataloader()§
Implement one or multiple PyTorch DataLoaders for validation.
The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
It’s recommended that all data downloads and preparation happen in
prepare_data().fit()validate()prepare_data()
Note
Lightning adds the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
- Returns:
A
torch.utils.data.DataLoaderor a sequence of them specifying validation samples.
Examples:
def val_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=False ) return loader # can also return multiple dataloaders def val_dataloader(self): return [loader_a, loader_b, ..., loader_n]
Note
If you don’t need a validation dataset and a
validation_step(), you don’t need to implement this method.Note
In the case where you return multiple validation dataloaders, the
validation_step()will have an argumentdataloader_idxwhich matches the order here.
src.datamodules.mind.datamodule_CollaborativeFiltering module§
- class src.datamodules.mind.datamodule_CollaborativeFiltering.MINDDataModuleCollaborativeFiltering(data_dir, batch_size, num_workers, pin_memory, download, mind_size)§
Bases:
MINDDataModuleDatamodule for the Collaborative Filtering model using the MIND dataset
- Parameters:
data_dir (str) – Data directory
batch_size (int) – Batch size for dataloaders
num_workers (int) – Number of workers for dataloaders
pin_memory (bool) – Whether to use pin memory
download (bool) – Whether the mind dataset should be downloaded
mind_size (str) – Dataset size
- prepare()§
Prepare data for model usage, prior to model instantiation. Create ratings files for training, validation, testing.
- prepare_data()§
Use this to download and prepare data. Downloading and saving data with multiple processes (distributed settings) will result in corrupted data. Lightning ensures this method is called only within a single process, so you can safely add your downloading logic within.
Warning
DO NOT set state to the model (use
setupinstead) since this is NOT called on every deviceExample:
def prepare_data(self): # good download_data() tokenize() etc() # bad self.split = data_split self.some_state = some_other_state()
In a distributed environment,
prepare_datacan be called in two ways (using prepare_data_per_node)Once per node. This is the default and is only called on LOCAL_RANK=0.
Once in total. Only called on GLOBAL_RANK=0.
Example:
# DEFAULT # called once per node on LOCAL_RANK=0 of that node class LitDataModule(LightningDataModule): def __init__(self): super().__init__() self.prepare_data_per_node = True # call on GLOBAL_RANK=0 (great for shared file systems) class LitDataModule(LightningDataModule): def __init__(self): super().__init__() self.prepare_data_per_node = False
This is called before requesting the dataloaders:
model.prepare_data() initialize_distributed() model.setup(stage) model.train_dataloader() model.val_dataloader() model.test_dataloader() model.predict_dataloader()
- setup(stage=None)§
Create ratings datasets, knowledge graph dataset for dataloaders.
- train_dataloader()§
Implement one or more PyTorch DataLoaders for training.
- Returns:
A collection of
torch.utils.data.DataLoaderspecifying training samples. In the case of multiple dataloaders, please see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()
Note
Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
Example:
# single dataloader def train_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=True ) return loader # multiple dataloaders, return as list def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a list of tensors: [batch_mnist, batch_cifar] return [mnist_loader, cifar_loader] # multiple dataloader, return as dict def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar} return {'mnist': mnist_loader, 'cifar': cifar_loader}
src.datamodules.mind.datamodule_MKR module§
- class src.datamodules.mind.datamodule_MKR.MINDDataModuleMKR(use_categories, use_subcategories, use_title_entities, use_abstract_entities, use_title_tokens, use_wikidata, data_dir, batch_size, num_workers, pin_memory, download, mind_size)§
Bases:
MINDDataModuleDatamodule for the MKR model using the MIND dataset
- Parameters:
use_categories (bool) – Whether the data preprocessing includes news categories
use_subcategories (bool) – Whether the data preprocessing includes news subcategories
use_title_entities (bool) – Whether the data preprocessing includes news title entities
use_abstract_entities (bool) – Whether the data preprocessing includes news abstract entities
use_title_tokens (bool) – Whether the data preprocessing includes news title tokens
use_wikidata (bool) – Whether the data preprocessing includes additional news entity wikidata knowledge graph
data_dir (str) – Data directory
batch_size (int) – Batch size for dataloaders
num_workers (int) – Number of workers for dataloaders
pin_memory (bool) – Whether to use pin memory
download (bool) – Whether the mind dataset should be downloaded
mind_size (str) – Dataset size
- prepare()§
Prepare data for model usage, prior to model instantiation. Download wikidata knowledge graph, create knowledge graph and ratings files for training, validation, testing.
- prepare_data() None§
Use this to download and prepare data. Downloading and saving data with multiple processes (distributed settings) will result in corrupted data. Lightning ensures this method is called only within a single process, so you can safely add your downloading logic within.
Warning
DO NOT set state to the model (use
setupinstead) since this is NOT called on every deviceExample:
def prepare_data(self): # good download_data() tokenize() etc() # bad self.split = data_split self.some_state = some_other_state()
In a distributed environment,
prepare_datacan be called in two ways (using prepare_data_per_node)Once per node. This is the default and is only called on LOCAL_RANK=0.
Once in total. Only called on GLOBAL_RANK=0.
Example:
# DEFAULT # called once per node on LOCAL_RANK=0 of that node class LitDataModule(LightningDataModule): def __init__(self): super().__init__() self.prepare_data_per_node = True # call on GLOBAL_RANK=0 (great for shared file systems) class LitDataModule(LightningDataModule): def __init__(self): super().__init__() self.prepare_data_per_node = False
This is called before requesting the dataloaders:
model.prepare_data() initialize_distributed() model.setup(stage) model.train_dataloader() model.val_dataloader() model.test_dataloader() model.predict_dataloader()
- setup(stage)§
Create ratings datasets and knowledge graph dataset for dataloaders.
- train_dataloader()§
Implement one or more PyTorch DataLoaders for training.
- Returns:
A collection of
torch.utils.data.DataLoaderspecifying training samples. In the case of multiple dataloaders, please see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()
Note
Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
Example:
# single dataloader def train_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=True ) return loader # multiple dataloaders, return as list def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a list of tensors: [batch_mnist, batch_cifar] return [mnist_loader, cifar_loader] # multiple dataloader, return as dict def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar} return {'mnist': mnist_loader, 'cifar': cifar_loader}
src.datamodules.mind.datamodule_NAML module§
- class src.datamodules.mind.datamodule_NAML.MINDDataModuleNAML(dataset_attributes, mind_size='small', data_dir=None, batch_size: int = 64, num_workers: int = 0, pin_memory: bool = False, num_clicked_news_a_user=50, num_words_title=20, num_words_abstract=50, word_freq_threshold=1, entity_freq_threshold=2, entity_confidence_threshold=0.5, negative_sampling_ratio=2, word_embedding_dim=300, entity_embedding_dim=100, download=True, glove_size=6)§
Bases:
MINDDataModuleDatamodule for the NAML model using the MIND dataset
Code based on https://github.com/Microsoft/Recommenders
- Parameters:
dataset_attributes (dict) – Attributes are set based on the model
mind_size (string) – Size of the MIND Dataset (demo, small, large)
data_dir (Optional[string]) – Path of the data directory for the dataset
batch_size (int) – Batch size for dataloaders
num_workers (int) – Number of workers for dataloaders
pin_memory (bool) – Requires more memory but might imporve performance
num_clicked_news_a_user (int) – Number of clicked news for each user
num_words_title (int) – Number of words in the title
num_words_abstract (int) – Number of words in the abstract
word_freq_threshold (int) – Frequency threshold of words
entity_freq_threshold (int) – Frequency threshold of entities
entity_confidence_threshold (float) – Confidence threshold of entities
negative_sampling_ratio (int) – Negative sampling ratio
word_embedding_dim (int) – Dimension of word embeddings
entity_embedding_dim (int) – Dimension of entity embeddings
download (bool) – Enable the download and extraction of the MIND dataset. When set to false, extract data must be available in data_dir.
glove_size (int) – Size of Glove embeddings to download
- news_dataloader(step, device=None)§
- prepare() None§
- setup(stage)§
Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.
- Parameters:
stage – either
'fit','validate','test', or'predict'
Example:
class LitModel(...): def __init__(self): self.l1 = None def prepare_data(self): download_data() tokenize() # don't do this self.something = else def setup(self, stage): data = load_data(...) self.l1 = nn.Linear(28, data.num_classes)
- test_dataloader(batch_size=1)§
Implement one or multiple PyTorch DataLoaders for testing.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
test()prepare_data()
Note
Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- Returns:
A
torch.utils.data.DataLoaderor a sequence of them specifying testing samples.
Example:
def test_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=False ) return loader # can also return multiple dataloaders def test_dataloader(self): return [loader_a, loader_b, ..., loader_n]
Note
If you don’t need a test dataset and a
test_step(), you don’t need to implement this method.Note
In the case where you return multiple test dataloaders, the
test_step()will have an argumentdataloader_idxwhich matches the order here.
- user_dataloader(step)§
- val_dataloader(batch_size=1)§
Implement one or multiple PyTorch DataLoaders for validation.
The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
It’s recommended that all data downloads and preparation happen in
prepare_data().fit()validate()prepare_data()
Note
Lightning adds the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
- Returns:
A
torch.utils.data.DataLoaderor a sequence of them specifying validation samples.
Examples:
def val_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=False ) return loader # can also return multiple dataloaders def val_dataloader(self): return [loader_a, loader_b, ..., loader_n]
Note
If you don’t need a validation dataset and a
validation_step(), you don’t need to implement this method.Note
In the case where you return multiple validation dataloaders, the
validation_step()will have an argumentdataloader_idxwhich matches the order here.
src.datamodules.mind.datamodule_RippleNet module§
- class src.datamodules.mind.datamodule_RippleNet.MINDDataModuleRippleNet(use_categories, use_subcategories, use_title_entities, use_abstract_entities, use_title_tokens, use_wikidata, data_dir, batch_size, num_workers, pin_memory, download, mind_size)§
Bases:
MINDDataModuleDatamodule for the RippleNet model using the MIND dataset
- Parameters:
use_categories (bool) – Whether the data preprocessing includes news categories
use_subcategories (bool) – Whether the data preprocessing includes news subcategories
use_title_entities (bool) – Whether the data preprocessing includes news title entities
use_abstract_entities (bool) – Whether the data preprocessing includes news abstract entities
use_title_tokens (bool) – Whether the data preprocessing includes news title tokens
use_wikidata (bool) – Whether the data preprocessing includes additional news entity wikidata knowledge graph
data_dir (str) – Data directory
batch_size (int) – Batch size for dataloaders
num_workers (int) – Number of workers for dataloaders
pin_memory (bool) – Whether to use pin memory
download (bool) – Whether the mind dataset should be downloaded
mind_size (str) – Dataset size
- prepare()§
Prepare data for model usage, prior to model instantiation Download wikidata knowledge graph, create knowledge graph and ratings files for training, validation, testing.
- prepare_data() None§
Use this to download and prepare data. Downloading and saving data with multiple processes (distributed settings) will result in corrupted data. Lightning ensures this method is called only within a single process, so you can safely add your downloading logic within.
Warning
DO NOT set state to the model (use
setupinstead) since this is NOT called on every deviceExample:
def prepare_data(self): # good download_data() tokenize() etc() # bad self.split = data_split self.some_state = some_other_state()
In a distributed environment,
prepare_datacan be called in two ways (using prepare_data_per_node)Once per node. This is the default and is only called on LOCAL_RANK=0.
Once in total. Only called on GLOBAL_RANK=0.
Example:
# DEFAULT # called once per node on LOCAL_RANK=0 of that node class LitDataModule(LightningDataModule): def __init__(self): super().__init__() self.prepare_data_per_node = True # call on GLOBAL_RANK=0 (great for shared file systems) class LitDataModule(LightningDataModule): def __init__(self): super().__init__() self.prepare_data_per_node = False
This is called before requesting the dataloaders:
model.prepare_data() initialize_distributed() model.setup(stage) model.train_dataloader() model.val_dataloader() model.test_dataloader() model.predict_dataloader()
- setup(stage=None)§
Create ratings datasets for dataloaders.
- train_dataloader()§
Implement one or more PyTorch DataLoaders for training.
- Returns:
A collection of
torch.utils.data.DataLoaderspecifying training samples. In the case of multiple dataloaders, please see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()
Note
Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
Example:
# single dataloader def train_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=True ) return loader # multiple dataloaders, return as list def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a list of tensors: [batch_mnist, batch_cifar] return [mnist_loader, cifar_loader] # multiple dataloader, return as dict def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar} return {'mnist': mnist_loader, 'cifar': cifar_loader}
src.datamodules.mind.dataset module§
- class src.datamodules.mind.dataset.BaseDataset(behaviors_path, news_path, dataset_attributes, num_words_title, num_words_abstract, num_clicked_news_a_user)§
Bases:
DatasetBase Dataset for training
- Parameters:
behaviors_path (str) – Path to behaviors file
news_path (str) – Path to news file
dataset_attributes (list) – Dataset attributes
() (num_clicked_news_a_user) – Number of title words
() – Number of abstract words
() – Number of clicked news
- class src.datamodules.mind.dataset.BehaviorsBERTDataset(behaviors_path)§
Bases:
DatasetBehaviors dataset for BERT model
- Parameters:
behaviors_path (str) – Path to behaviors file
- class src.datamodules.mind.dataset.BehaviorsDataset(behaviors_path)§
Bases:
DatasetUser behaviors dataset for evaluation. (user, time) pair as session
- Parameters:
behaviors_path (str) – Path to behaviors file
- class src.datamodules.mind.dataset.KGDataset(numpy_data)§
Bases:
DatasetNews knowledge graph dataset for dataloaders
- Parameters:
numpy_data (numpy.ndarray) – Knowledge graph numpy data
- class src.datamodules.mind.dataset.NewsBERTDataset(news_path)§
Bases:
DatasetNews dataset for BERT model
- Parameters:
news_path (str) – Path to news file
- class src.datamodules.mind.dataset.NewsDataset(news_path, dataset_attributes)§
Bases:
DatasetNews dataset for evaluation
- Parameters:
news_path (str) – Path to news file
dataset_attributes (list) – Dataset attributes
- to(device)§
- class src.datamodules.mind.dataset.RatingsDataset(numpy_data, train: bool)§
Bases:
DatasetUser Ratings knowledge graph dataset for dataloaders
- Parameters:
numpy_data (numpy.ndarray) – Ratings numpy data
train (bool) – Whether the dataset contains training data
- class src.datamodules.mind.dataset.UserDataset(behaviors_path, user2int_path, num_clicked_news_a_user)§
Bases:
DatasetUsers dataset for evaluation. Duplicated rows will be dropped
- Parameters:
behaviors_path (str) – Path to behaviors file
user2int_path (str) – Path to user index file
() (num_clicked_news_a_user) –
src.datamodules.mind.download module§
- src.datamodules.mind.download.download_and_extract_glove(zip_path=None, dest_path=None, glove_size=6)§
Download and extract the Glove embedding
- Parameters:
dest_path (str) – Destination directory path for the downloaded file
- Returns:
File path where Glove was extracted.
- src.datamodules.mind.download.download_and_extract_mind(size='small', dest_path=None)§
Download and extract the MIND dataset
- Parameters:
size (str) – Dataset size
dest_path (str) – Save path for the zip dataset
- Returns:
Tuple (train_path, valid_path, test_path) where train_path is the path to the train folder, valid_path is the path to the validation folder and test_path is the path to the test folder
- src.datamodules.mind.download.download_and_extract_wikidata_kg(dest_path, clean_zip_file)§
Download and extract the wikidata knowledge graph for the MIND dataset
- Parameters:
dest_path (str) – Path for saving the downloaded zip file
clean_zip_file (bool) – Whether to delete the zip file after unzipping
- Returns:
Path to the unzipped wikidata knowledge graph folder
- src.datamodules.mind.download.extract_mind(train_zip, valid_zip, test_zip, root_folder=None, train_folder='train', valid_folder='valid', test_folder='test', clean_zip_file=False)§
Extract MIND dataset
- Parameters:
train_zip (str) – Path to train zip file
valid_zip (str) – Path to valid zip file
train_folder (str) – Destination folder for train set
valid_folder (str) – Destination folder for validation set
- Returns:
Tuple (path_train, path_valid) where path_train is the path to the training folder and path_valid is the path to the validation folder
- src.datamodules.mind.download.generate_embeddings(data_path, news_words, news_entities, train_entities, valid_entities, max_sentence=10, word_embedding_dim=100)§
Generate embeddings.
- Parameters:
data_path (str) – Data path.
news_words (dict) – News word dictionary.
news_entities (dict) – News entity dictionary.
train_entities (str) – Train entity file.
valid_entities (str) – Validation entity file.
max_sentence (int) – Max sentence size.
word_embedding_dim (int) – Word embedding dimension.
- Returns:
Tuple containing the paths to the news, word and entity embeddings
- src.datamodules.mind.download.get_train_input(session, train_file_path, npratio=4)§
Generate train file.
- Parameters:
session (list) – List of user session with user_id, clicks, positive and negative interactions.
train_file_path (str) – Path to file.
npratio (int) – Ratio for negative sampling.
- src.datamodules.mind.download.get_user_history(train_history, valid_history, user_history_path)§
Generate user history file.
- Parameters:
train_history (list) – Train history.
valid_history (list) – Validation history
user_history_path (str) – Path to file.
- src.datamodules.mind.download.get_valid_input(session, valid_file_path)§
Generate validation file.
- Parameters:
session (list) – List of user session with user_id, clicks, positive and negative interactions.
valid_file_path (str) – Path to file.
- src.datamodules.mind.download.load_glove_matrix(path_emb, word_dict, word_embedding_dim)§
Load pretrained embedding metrics of words in word_dict
- Parameters:
path_emb (string) – Folder path of downloaded glove file
word_dict (dict) – Word dictionary
word_embedding_dim – Dimention of word embedding vectors
- Returns:
Numpy.ndarray list containing pretrained word embedding metrics, words can be found in glove files
- src.datamodules.mind.download.read_clickhistory(path, filename)§
Read click history file
- Parameters:
path (str) – Folder path
filename (str) – Filename
- Returns:
Tuple (list, dict) where list is a list of user session with user_id, clicks, positive and negative interactions and dict is a dictionary with user_id click history.
- src.datamodules.mind.download.read_news(filepath, tokenizer)§
Read news file
- Parameters:
filepath (str) – Path to news file
tokenizer (tokenizer) – Tokenizer for news title tokenization
- Returns:
Tuple (news_words, news_entities, news_abstract_entities, news_categories, news_subcategories) where each item is a dictionary containing the items, specified by the dictionary name, assigned to each user
- src.datamodules.mind.download.read_news_ids(filepath)§
Read news ids
- Parameters:
filepath (str) – Path to news file
- Returns:
Dictionary containing news identifiers and generated ids
- src.datamodules.mind.download.word_tokenize(sent)§
Tokenize a sentence
- Parameters:
sent (str) – Sentence to be tokenized
- Returns:
Word list
src.datamodules.mind.parse module§
- src.datamodules.mind.parse.generate_word_embedding(source, target, word2int_path, word_embedding_dim)§
Generate from pretrained word embedding file If a word not in embedding file, initial its embedding by N(0, 1)
- Parameters:
source (str) – Path of pretrained word embedding file, e.g. glove.840B.300d.txt
target (str) – Path for saving word embedding
word2int_path (str) – Path to vocabulary file
- src.datamodules.mind.parse.parse_behaviors(source, target, user2int_path, negative_sampling_ratio)§
Parse behaviors file in training set.
- Parameters:
source (str) – Source behaviors file
target (str) – Target behaviors file
user2int_path (str) – Path for saving user2int file
- Returns:
Number of users
- src.datamodules.mind.parse.parse_behaviors_bert(source, target, news_ids_set)§
Parse behaviors for using BERT baseline Get all the history(NewsID) for each user Get the candidate_news from impressions for each user :param source: source news file :param target: target news file
- Returns:
behaviors_parsed(id, history:<NEWS_IDS>, candidate_news<NEWS_IDS>, labels:<y_true>)
- Return type:
DataFrame
- src.datamodules.mind.parse.parse_mind(train_dir, val_dir, test_dir, glove_dir, glove_size, negative_sampling_ratio, num_words_title, num_words_abstract, entity_confidence_threshold, word_freq_threshold, entity_freq_threshold, word_embedding_dim, entity_embedding_dim)§
Parse MIND dataset :param train_dir: Path to train directory :type train_dir: str :param val_dir: Path to validation directory :type val_dir: str :param test_dir: Path to test directory :type test_dir: str :param glove_dir: Path to glove directory :type glove_dir: str :param glove_size (): Glove size :param negative_sampling_ratio (): :param num_words_title: number of words in title :type num_words_title: long :param num_words_abstract: number of words in abstract :type num_words_abstract: long :param entity_confidence_threshold (): :param word_freq_threshold: Threshold for word frequency :type word_freq_threshold: float :param entity_freq_threshold: Threshold for entity frequency :type entity_freq_threshold: float :param word_embedding_dim (): Word embedding dimension :param entity_embedding_dim (): Entity embedding dimension
- Returns:
Tuple (num_users, num_categories, num_words, num_entities) containing number of users, number of categories, number of words, number of entities
- src.datamodules.mind.parse.parse_mind_bert(train_dir, val_dir, test_dir, column)§
- src.datamodules.mind.parse.parse_news(source, target, category2int_path, word2int_path, entity2int_path, mode, num_words_title, num_words_abstract, entity_confidence_threshold, word_freq_threshold, entity_freq_threshold)§
Parse news for training set and test set
- Parameters:
source (str) – Source news file
target (str) – Target news file
category2int_path (str) – Path to category2int file. If mode == ‘train’: Path to save. If mode == ‘test’: Path to load from.
word2int_path (str) – Path to word2int file. If mode == ‘train’: Path to save. If mode == ‘test’: Path to load from.
entity2int_path (str) – Path to entity2int file. If mode == ‘train’: Path to save. If mode == ‘test’: Path to load from.
mode (str) – Either ‘train’ or ‘test’
num_words_title (long) – number of words in title
num_words_abstract (long) – number of words in abstract
() (entity_confidence_threshold) –
word_freq_threshold (float) – Threshold for word frequency
entity_freq_threshold (float) – Threshold for entity frequency
- src.datamodules.mind.parse.parse_news_bert(source, target, column)§
Parse news for using BERT baseline Generate BERT embedding for the text in news df :param source: source news file :param target: target news file :param column: the text that will represent the news
- Returns:
news_parsed(news_id, text)
- Return type:
DataFrame
- src.datamodules.mind.parse.transform_entity_embedding(source, target, entity2int_path, entity_embedding_dim)§
Transform entity embedding
- Parameters:
source (str) – Path of embedding file
target (str) – Path of transformed embedding file in numpy format
entity2int_path (str) – Path to entity ids file
src.datamodules.mind.preprocessing module§
- src.datamodules.mind.preprocessing.create_knowledge_graph_file(model, paths, use_categories, use_subcategories, use_title_entities, use_abstract_entities, use_title_tokens, use_wikidata)§
Creates a news article knowledge graph file including the properties set in the constructor.
- Parameters:
model (str) – Model name, either “MKR” or “RippleNet”. Necessary for filepath.
paths (list) – Contains paths to the train and/or validation and/or test news files.
use_categories (boolean) – Whether to use news categories in knowledge graph.
use_subcategories (boolean) – Whether to use news subcategories in knowledge graph.
use_title_entities (boolean) – Whether to use news title entities in knowledge graph.
use_abstract_entities (boolean) – Whether to use news abstract entities in knowledge graph.
use_title_tokens (boolean) – Whether to use news title tokens in knowledge graph.
use_wikidata (boolean) – Whether to use additional wikidata knowledge graph in knowledge graph.
- Returns:
Tuple (path1, path2) where path1 is the path to the file containing the knowledge graph and path2 is the path to the file containing the item index to entity id hashes.
- src.datamodules.mind.preprocessing.create_rating_file(paths, path_to_news_ids, model)§
Creates a file specifying which news articles have been read by users and which have not
- Parameters:
paths (list) – List containing paths to behaviours files (train, validation, test)
path_to_news_ids (str) – Path to file containing news ids and corresponding item index
model (str) – Model name, either “MKR” or “RippleNet”. Necessary for path
- src.datamodules.mind.preprocessing.create_rating_file_collaborative(paths, model)§
Creates a file specifying which news articles have been read by users and which have not. Exclusively for the Collaborative Filtering model
- Parameters:
paths (list) – List containing paths to behaviours files (train, validation, test)
model (str) – Model name, either “MKR” or “RippleNet”. Necessary for path
- src.datamodules.mind.preprocessing.prepare_numpy_data(path, model)§
Converts the rating.txt file to numpy format
- Parameters:
path (str) – Path to the ratings.txt file
model (str) – Model name, either “MKR” or “RippleNet”. Necessary for path
- Returns:
Numpy.ndarray containing the rating data in numpy format
- src.datamodules.mind.preprocessing.prepare_numpy_kg(path, model)§
Converts the knowledge graph.txt file to numpy format
- Parameters:
path (str) – Path to the knowledge graph.txt file
model (str) – Model name, either “MKR” or “RippleNet”. Necessary for path
- Returns:
numpy.Ndarray containing the knowledge graph data in numpy format