src.datamodules.mind package§

Submodules§

src.datamodules.mind.datamodule_BERT module§

class src.datamodules.mind.datamodule_BERT.MINDDataModuleBERT(mind_size='demo', data_dir=None, batch_size: int = 64, num_workers: int = 0, pin_memory: bool = False, download=True, column='title', bert_model=None, tokenizer=None)§

Bases: MINDDataModule

news_dataframe(step, device=None)§
prepare_data() None§

Use this to download and prepare data. Downloading and saving data with multiple processes (distributed settings) will result in corrupted data. Lightning ensures this method is called only within a single process, so you can safely add your downloading logic within.

Warning

DO NOT set state to the model (use setup instead) since this is NOT called on every device

Example:

def prepare_data(self):
    # good
    download_data()
    tokenize()
    etc()

    # bad
    self.split = data_split
    self.some_state = some_other_state()

In a distributed environment, prepare_data can be called in two ways (using prepare_data_per_node)

  1. Once per node. This is the default and is only called on LOCAL_RANK=0.

  2. Once in total. Only called on GLOBAL_RANK=0.

Example:

# DEFAULT
# called once per node on LOCAL_RANK=0 of that node
class LitDataModule(LightningDataModule):
    def __init__(self):
        super().__init__()
        self.prepare_data_per_node = True


# call on GLOBAL_RANK=0 (great for shared file systems)
class LitDataModule(LightningDataModule):
    def __init__(self):
        super().__init__()
        self.prepare_data_per_node = False

This is called before requesting the dataloaders:

model.prepare_data()
initialize_distributed()
model.setup(stage)
model.train_dataloader()
model.val_dataloader()
model.test_dataloader()
model.predict_dataloader()
setup(stage=None)§

Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.

Parameters:

stage – either 'fit', 'validate', 'test', or 'predict'

Example:

class LitModel(...):
    def __init__(self):
        self.l1 = None

    def prepare_data(self):
        download_data()
        tokenize()

        # don't do this
        self.something = else

    def setup(self, stage):
        data = load_data(...)
        self.l1 = nn.Linear(28, data.num_classes)
test_dataloader(batch_size=1)§

Implement one or multiple PyTorch DataLoaders for testing.

For data processing use the following pattern:

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Returns:

A torch.utils.data.DataLoader or a sequence of them specifying testing samples.

Example:

def test_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform,
                    download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=False
    )

    return loader

# can also return multiple dataloaders
def test_dataloader(self):
    return [loader_a, loader_b, ..., loader_n]

Note

If you don’t need a test dataset and a test_step(), you don’t need to implement this method.

Note

In the case where you return multiple test dataloaders, the test_step() will have an argument dataloader_idx which matches the order here.

train_dataloader(batch_size=1)§

Implement one or more PyTorch DataLoaders for training.

Returns:

A collection of torch.utils.data.DataLoader specifying training samples. In the case of multiple dataloaders, please see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Example:

# single dataloader
def train_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform,
                    download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=True
    )
    return loader

# multiple dataloaders, return as list
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a list of tensors: [batch_mnist, batch_cifar]
    return [mnist_loader, cifar_loader]

# multiple dataloader, return as dict
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar}
    return {'mnist': mnist_loader, 'cifar': cifar_loader}
val_dataloader(batch_size=1)§

Implement one or multiple PyTorch DataLoaders for validation.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

It’s recommended that all data downloads and preparation happen in prepare_data().

Note

Lightning adds the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Returns:

A torch.utils.data.DataLoader or a sequence of them specifying validation samples.

Examples:

def val_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=False,
                    transform=transform, download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=False
    )

    return loader

# can also return multiple dataloaders
def val_dataloader(self):
    return [loader_a, loader_b, ..., loader_n]

Note

If you don’t need a validation dataset and a validation_step(), you don’t need to implement this method.

Note

In the case where you return multiple validation dataloaders, the validation_step() will have an argument dataloader_idx which matches the order here.

src.datamodules.mind.datamodule_Base module§

class src.datamodules.mind.datamodule_Base.MINDDataModule(data_dir, mind_size, batch_size, num_workers, download, pin_memory)§

Bases: LightningDataModule

Base Datamodule for the MIND dataset

Parameters:
  • data_dir (str) – Data directory

  • batch_size (int) – Batch size for dataloaders

  • num_workers (int) – Number of workers for dataloaders

  • pin_memory (bool) – Whether to use pin memory

  • download (bool) – Whether the mind dataset should be downloaded

  • mind_size (str) – Which dataset size should be used

  • train_val_test_split (list) – Whether to use automatic train-validation-test data splits

prepare() None§
setup(stage=None)§

Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.

Parameters:

stage – either 'fit', 'validate', 'test', or 'predict'

Example:

class LitModel(...):
    def __init__(self):
        self.l1 = None

    def prepare_data(self):
        download_data()
        tokenize()

        # don't do this
        self.something = else

    def setup(self, stage):
        data = load_data(...)
        self.l1 = nn.Linear(28, data.num_classes)
test_dataloader()§

Implement one or multiple PyTorch DataLoaders for testing.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Returns:

A torch.utils.data.DataLoader or a sequence of them specifying testing samples.

Example:

def test_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform,
                    download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=False
    )

    return loader

# can also return multiple dataloaders
def test_dataloader(self):
    return [loader_a, loader_b, ..., loader_n]

Note

If you don’t need a test dataset and a test_step(), you don’t need to implement this method.

Note

In the case where you return multiple test dataloaders, the test_step() will have an argument dataloader_idx which matches the order here.

train_dataloader()§

Implement one or more PyTorch DataLoaders for training.

Returns:

A collection of torch.utils.data.DataLoader specifying training samples. In the case of multiple dataloaders, please see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Example:

# single dataloader
def train_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform,
                    download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=True
    )
    return loader

# multiple dataloaders, return as list
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a list of tensors: [batch_mnist, batch_cifar]
    return [mnist_loader, cifar_loader]

# multiple dataloader, return as dict
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar}
    return {'mnist': mnist_loader, 'cifar': cifar_loader}
val_dataloader()§

Implement one or multiple PyTorch DataLoaders for validation.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

It’s recommended that all data downloads and preparation happen in prepare_data().

  • fit()

  • validate()

  • prepare_data()

  • setup()

Note

Lightning adds the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Returns:

A torch.utils.data.DataLoader or a sequence of them specifying validation samples.

Examples:

def val_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=False,
                    transform=transform, download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=False
    )

    return loader

# can also return multiple dataloaders
def val_dataloader(self):
    return [loader_a, loader_b, ..., loader_n]

Note

If you don’t need a validation dataset and a validation_step(), you don’t need to implement this method.

Note

In the case where you return multiple validation dataloaders, the validation_step() will have an argument dataloader_idx which matches the order here.

src.datamodules.mind.datamodule_CollaborativeFiltering module§

class src.datamodules.mind.datamodule_CollaborativeFiltering.MINDDataModuleCollaborativeFiltering(data_dir, batch_size, num_workers, pin_memory, download, mind_size)§

Bases: MINDDataModule

Datamodule for the Collaborative Filtering model using the MIND dataset

Parameters:
  • data_dir (str) – Data directory

  • batch_size (int) – Batch size for dataloaders

  • num_workers (int) – Number of workers for dataloaders

  • pin_memory (bool) – Whether to use pin memory

  • download (bool) – Whether the mind dataset should be downloaded

  • mind_size (str) – Dataset size

prepare()§

Prepare data for model usage, prior to model instantiation. Create ratings files for training, validation, testing.

prepare_data()§

Use this to download and prepare data. Downloading and saving data with multiple processes (distributed settings) will result in corrupted data. Lightning ensures this method is called only within a single process, so you can safely add your downloading logic within.

Warning

DO NOT set state to the model (use setup instead) since this is NOT called on every device

Example:

def prepare_data(self):
    # good
    download_data()
    tokenize()
    etc()

    # bad
    self.split = data_split
    self.some_state = some_other_state()

In a distributed environment, prepare_data can be called in two ways (using prepare_data_per_node)

  1. Once per node. This is the default and is only called on LOCAL_RANK=0.

  2. Once in total. Only called on GLOBAL_RANK=0.

Example:

# DEFAULT
# called once per node on LOCAL_RANK=0 of that node
class LitDataModule(LightningDataModule):
    def __init__(self):
        super().__init__()
        self.prepare_data_per_node = True


# call on GLOBAL_RANK=0 (great for shared file systems)
class LitDataModule(LightningDataModule):
    def __init__(self):
        super().__init__()
        self.prepare_data_per_node = False

This is called before requesting the dataloaders:

model.prepare_data()
initialize_distributed()
model.setup(stage)
model.train_dataloader()
model.val_dataloader()
model.test_dataloader()
model.predict_dataloader()
setup(stage=None)§

Create ratings datasets, knowledge graph dataset for dataloaders.

train_dataloader()§

Implement one or more PyTorch DataLoaders for training.

Returns:

A collection of torch.utils.data.DataLoader specifying training samples. In the case of multiple dataloaders, please see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Example:

# single dataloader
def train_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform,
                    download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=True
    )
    return loader

# multiple dataloaders, return as list
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a list of tensors: [batch_mnist, batch_cifar]
    return [mnist_loader, cifar_loader]

# multiple dataloader, return as dict
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar}
    return {'mnist': mnist_loader, 'cifar': cifar_loader}

src.datamodules.mind.datamodule_MKR module§

class src.datamodules.mind.datamodule_MKR.MINDDataModuleMKR(use_categories, use_subcategories, use_title_entities, use_abstract_entities, use_title_tokens, use_wikidata, data_dir, batch_size, num_workers, pin_memory, download, mind_size)§

Bases: MINDDataModule

Datamodule for the MKR model using the MIND dataset

Parameters:
  • use_categories (bool) – Whether the data preprocessing includes news categories

  • use_subcategories (bool) – Whether the data preprocessing includes news subcategories

  • use_title_entities (bool) – Whether the data preprocessing includes news title entities

  • use_abstract_entities (bool) – Whether the data preprocessing includes news abstract entities

  • use_title_tokens (bool) – Whether the data preprocessing includes news title tokens

  • use_wikidata (bool) – Whether the data preprocessing includes additional news entity wikidata knowledge graph

  • data_dir (str) – Data directory

  • batch_size (int) – Batch size for dataloaders

  • num_workers (int) – Number of workers for dataloaders

  • pin_memory (bool) – Whether to use pin memory

  • download (bool) – Whether the mind dataset should be downloaded

  • mind_size (str) – Dataset size

prepare()§

Prepare data for model usage, prior to model instantiation. Download wikidata knowledge graph, create knowledge graph and ratings files for training, validation, testing.

prepare_data() None§

Use this to download and prepare data. Downloading and saving data with multiple processes (distributed settings) will result in corrupted data. Lightning ensures this method is called only within a single process, so you can safely add your downloading logic within.

Warning

DO NOT set state to the model (use setup instead) since this is NOT called on every device

Example:

def prepare_data(self):
    # good
    download_data()
    tokenize()
    etc()

    # bad
    self.split = data_split
    self.some_state = some_other_state()

In a distributed environment, prepare_data can be called in two ways (using prepare_data_per_node)

  1. Once per node. This is the default and is only called on LOCAL_RANK=0.

  2. Once in total. Only called on GLOBAL_RANK=0.

Example:

# DEFAULT
# called once per node on LOCAL_RANK=0 of that node
class LitDataModule(LightningDataModule):
    def __init__(self):
        super().__init__()
        self.prepare_data_per_node = True


# call on GLOBAL_RANK=0 (great for shared file systems)
class LitDataModule(LightningDataModule):
    def __init__(self):
        super().__init__()
        self.prepare_data_per_node = False

This is called before requesting the dataloaders:

model.prepare_data()
initialize_distributed()
model.setup(stage)
model.train_dataloader()
model.val_dataloader()
model.test_dataloader()
model.predict_dataloader()
setup(stage)§

Create ratings datasets and knowledge graph dataset for dataloaders.

train_dataloader()§

Implement one or more PyTorch DataLoaders for training.

Returns:

A collection of torch.utils.data.DataLoader specifying training samples. In the case of multiple dataloaders, please see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Example:

# single dataloader
def train_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform,
                    download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=True
    )
    return loader

# multiple dataloaders, return as list
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a list of tensors: [batch_mnist, batch_cifar]
    return [mnist_loader, cifar_loader]

# multiple dataloader, return as dict
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar}
    return {'mnist': mnist_loader, 'cifar': cifar_loader}

src.datamodules.mind.datamodule_NAML module§

class src.datamodules.mind.datamodule_NAML.MINDDataModuleNAML(dataset_attributes, mind_size='small', data_dir=None, batch_size: int = 64, num_workers: int = 0, pin_memory: bool = False, num_clicked_news_a_user=50, num_words_title=20, num_words_abstract=50, word_freq_threshold=1, entity_freq_threshold=2, entity_confidence_threshold=0.5, negative_sampling_ratio=2, word_embedding_dim=300, entity_embedding_dim=100, download=True, glove_size=6)§

Bases: MINDDataModule

Datamodule for the NAML model using the MIND dataset

Code based on https://github.com/Microsoft/Recommenders

Parameters:
  • dataset_attributes (dict) – Attributes are set based on the model

  • mind_size (string) – Size of the MIND Dataset (demo, small, large)

  • data_dir (Optional[string]) – Path of the data directory for the dataset

  • batch_size (int) – Batch size for dataloaders

  • num_workers (int) – Number of workers for dataloaders

  • pin_memory (bool) – Requires more memory but might imporve performance

  • num_clicked_news_a_user (int) – Number of clicked news for each user

  • num_words_title (int) – Number of words in the title

  • num_words_abstract (int) – Number of words in the abstract

  • word_freq_threshold (int) – Frequency threshold of words

  • entity_freq_threshold (int) – Frequency threshold of entities

  • entity_confidence_threshold (float) – Confidence threshold of entities

  • negative_sampling_ratio (int) – Negative sampling ratio

  • word_embedding_dim (int) – Dimension of word embeddings

  • entity_embedding_dim (int) – Dimension of entity embeddings

  • download (bool) – Enable the download and extraction of the MIND dataset. When set to false, extract data must be available in data_dir.

  • glove_size (int) – Size of Glove embeddings to download

news_dataloader(step, device=None)§
prepare() None§
setup(stage)§

Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.

Parameters:

stage – either 'fit', 'validate', 'test', or 'predict'

Example:

class LitModel(...):
    def __init__(self):
        self.l1 = None

    def prepare_data(self):
        download_data()
        tokenize()

        # don't do this
        self.something = else

    def setup(self, stage):
        data = load_data(...)
        self.l1 = nn.Linear(28, data.num_classes)
test_dataloader(batch_size=1)§

Implement one or multiple PyTorch DataLoaders for testing.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Returns:

A torch.utils.data.DataLoader or a sequence of them specifying testing samples.

Example:

def test_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform,
                    download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=False
    )

    return loader

# can also return multiple dataloaders
def test_dataloader(self):
    return [loader_a, loader_b, ..., loader_n]

Note

If you don’t need a test dataset and a test_step(), you don’t need to implement this method.

Note

In the case where you return multiple test dataloaders, the test_step() will have an argument dataloader_idx which matches the order here.

user_dataloader(step)§
val_dataloader(batch_size=1)§

Implement one or multiple PyTorch DataLoaders for validation.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

It’s recommended that all data downloads and preparation happen in prepare_data().

  • fit()

  • validate()

  • prepare_data()

  • setup()

Note

Lightning adds the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Returns:

A torch.utils.data.DataLoader or a sequence of them specifying validation samples.

Examples:

def val_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=False,
                    transform=transform, download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=False
    )

    return loader

# can also return multiple dataloaders
def val_dataloader(self):
    return [loader_a, loader_b, ..., loader_n]

Note

If you don’t need a validation dataset and a validation_step(), you don’t need to implement this method.

Note

In the case where you return multiple validation dataloaders, the validation_step() will have an argument dataloader_idx which matches the order here.

src.datamodules.mind.datamodule_RippleNet module§

class src.datamodules.mind.datamodule_RippleNet.MINDDataModuleRippleNet(use_categories, use_subcategories, use_title_entities, use_abstract_entities, use_title_tokens, use_wikidata, data_dir, batch_size, num_workers, pin_memory, download, mind_size)§

Bases: MINDDataModule

Datamodule for the RippleNet model using the MIND dataset

Parameters:
  • use_categories (bool) – Whether the data preprocessing includes news categories

  • use_subcategories (bool) – Whether the data preprocessing includes news subcategories

  • use_title_entities (bool) – Whether the data preprocessing includes news title entities

  • use_abstract_entities (bool) – Whether the data preprocessing includes news abstract entities

  • use_title_tokens (bool) – Whether the data preprocessing includes news title tokens

  • use_wikidata (bool) – Whether the data preprocessing includes additional news entity wikidata knowledge graph

  • data_dir (str) – Data directory

  • batch_size (int) – Batch size for dataloaders

  • num_workers (int) – Number of workers for dataloaders

  • pin_memory (bool) – Whether to use pin memory

  • download (bool) – Whether the mind dataset should be downloaded

  • mind_size (str) – Dataset size

prepare()§

Prepare data for model usage, prior to model instantiation Download wikidata knowledge graph, create knowledge graph and ratings files for training, validation, testing.

prepare_data() None§

Use this to download and prepare data. Downloading and saving data with multiple processes (distributed settings) will result in corrupted data. Lightning ensures this method is called only within a single process, so you can safely add your downloading logic within.

Warning

DO NOT set state to the model (use setup instead) since this is NOT called on every device

Example:

def prepare_data(self):
    # good
    download_data()
    tokenize()
    etc()

    # bad
    self.split = data_split
    self.some_state = some_other_state()

In a distributed environment, prepare_data can be called in two ways (using prepare_data_per_node)

  1. Once per node. This is the default and is only called on LOCAL_RANK=0.

  2. Once in total. Only called on GLOBAL_RANK=0.

Example:

# DEFAULT
# called once per node on LOCAL_RANK=0 of that node
class LitDataModule(LightningDataModule):
    def __init__(self):
        super().__init__()
        self.prepare_data_per_node = True


# call on GLOBAL_RANK=0 (great for shared file systems)
class LitDataModule(LightningDataModule):
    def __init__(self):
        super().__init__()
        self.prepare_data_per_node = False

This is called before requesting the dataloaders:

model.prepare_data()
initialize_distributed()
model.setup(stage)
model.train_dataloader()
model.val_dataloader()
model.test_dataloader()
model.predict_dataloader()
setup(stage=None)§

Create ratings datasets for dataloaders.

train_dataloader()§

Implement one or more PyTorch DataLoaders for training.

Returns:

A collection of torch.utils.data.DataLoader specifying training samples. In the case of multiple dataloaders, please see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Example:

# single dataloader
def train_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform,
                    download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=True
    )
    return loader

# multiple dataloaders, return as list
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a list of tensors: [batch_mnist, batch_cifar]
    return [mnist_loader, cifar_loader]

# multiple dataloader, return as dict
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar}
    return {'mnist': mnist_loader, 'cifar': cifar_loader}

src.datamodules.mind.dataset module§

class src.datamodules.mind.dataset.BaseDataset(behaviors_path, news_path, dataset_attributes, num_words_title, num_words_abstract, num_clicked_news_a_user)§

Bases: Dataset

Base Dataset for training

Parameters:
  • behaviors_path (str) – Path to behaviors file

  • news_path (str) – Path to news file

  • dataset_attributes (list) – Dataset attributes

  • () (num_clicked_news_a_user) – Number of title words

  • () – Number of abstract words

  • () – Number of clicked news

class src.datamodules.mind.dataset.BehaviorsBERTDataset(behaviors_path)§

Bases: Dataset

Behaviors dataset for BERT model

Parameters:

behaviors_path (str) – Path to behaviors file

class src.datamodules.mind.dataset.BehaviorsDataset(behaviors_path)§

Bases: Dataset

User behaviors dataset for evaluation. (user, time) pair as session

Parameters:

behaviors_path (str) – Path to behaviors file

class src.datamodules.mind.dataset.KGDataset(numpy_data)§

Bases: Dataset

News knowledge graph dataset for dataloaders

Parameters:

numpy_data (numpy.ndarray) – Knowledge graph numpy data

class src.datamodules.mind.dataset.NewsBERTDataset(news_path)§

Bases: Dataset

News dataset for BERT model

Parameters:

news_path (str) – Path to news file

class src.datamodules.mind.dataset.NewsDataset(news_path, dataset_attributes)§

Bases: Dataset

News dataset for evaluation

Parameters:
  • news_path (str) – Path to news file

  • dataset_attributes (list) – Dataset attributes

to(device)§
class src.datamodules.mind.dataset.RatingsDataset(numpy_data, train: bool)§

Bases: Dataset

User Ratings knowledge graph dataset for dataloaders

Parameters:
  • numpy_data (numpy.ndarray) – Ratings numpy data

  • train (bool) – Whether the dataset contains training data

class src.datamodules.mind.dataset.UserDataset(behaviors_path, user2int_path, num_clicked_news_a_user)§

Bases: Dataset

Users dataset for evaluation. Duplicated rows will be dropped

Parameters:
  • behaviors_path (str) – Path to behaviors file

  • user2int_path (str) – Path to user index file

  • () (num_clicked_news_a_user) –

src.datamodules.mind.download module§

src.datamodules.mind.download.download_and_extract_glove(zip_path=None, dest_path=None, glove_size=6)§

Download and extract the Glove embedding

Parameters:

dest_path (str) – Destination directory path for the downloaded file

Returns:

File path where Glove was extracted.

src.datamodules.mind.download.download_and_extract_mind(size='small', dest_path=None)§

Download and extract the MIND dataset

Parameters:
  • size (str) – Dataset size

  • dest_path (str) – Save path for the zip dataset

Returns:

Tuple (train_path, valid_path, test_path) where train_path is the path to the train folder, valid_path is the path to the validation folder and test_path is the path to the test folder

src.datamodules.mind.download.download_and_extract_wikidata_kg(dest_path, clean_zip_file)§

Download and extract the wikidata knowledge graph for the MIND dataset

Parameters:
  • dest_path (str) – Path for saving the downloaded zip file

  • clean_zip_file (bool) – Whether to delete the zip file after unzipping

Returns:

Path to the unzipped wikidata knowledge graph folder

src.datamodules.mind.download.extract_mind(train_zip, valid_zip, test_zip, root_folder=None, train_folder='train', valid_folder='valid', test_folder='test', clean_zip_file=False)§

Extract MIND dataset

Parameters:
  • train_zip (str) – Path to train zip file

  • valid_zip (str) – Path to valid zip file

  • train_folder (str) – Destination folder for train set

  • valid_folder (str) – Destination folder for validation set

Returns:

Tuple (path_train, path_valid) where path_train is the path to the training folder and path_valid is the path to the validation folder

src.datamodules.mind.download.generate_embeddings(data_path, news_words, news_entities, train_entities, valid_entities, max_sentence=10, word_embedding_dim=100)§

Generate embeddings.

Parameters:
  • data_path (str) – Data path.

  • news_words (dict) – News word dictionary.

  • news_entities (dict) – News entity dictionary.

  • train_entities (str) – Train entity file.

  • valid_entities (str) – Validation entity file.

  • max_sentence (int) – Max sentence size.

  • word_embedding_dim (int) – Word embedding dimension.

Returns:

Tuple containing the paths to the news, word and entity embeddings

src.datamodules.mind.download.get_train_input(session, train_file_path, npratio=4)§

Generate train file.

Parameters:
  • session (list) – List of user session with user_id, clicks, positive and negative interactions.

  • train_file_path (str) – Path to file.

  • npratio (int) – Ratio for negative sampling.

src.datamodules.mind.download.get_user_history(train_history, valid_history, user_history_path)§

Generate user history file.

Parameters:
  • train_history (list) – Train history.

  • valid_history (list) – Validation history

  • user_history_path (str) – Path to file.

src.datamodules.mind.download.get_valid_input(session, valid_file_path)§

Generate validation file.

Parameters:
  • session (list) – List of user session with user_id, clicks, positive and negative interactions.

  • valid_file_path (str) – Path to file.

src.datamodules.mind.download.load_glove_matrix(path_emb, word_dict, word_embedding_dim)§

Load pretrained embedding metrics of words in word_dict

Parameters:
  • path_emb (string) – Folder path of downloaded glove file

  • word_dict (dict) – Word dictionary

  • word_embedding_dim – Dimention of word embedding vectors

Returns:

Numpy.ndarray list containing pretrained word embedding metrics, words can be found in glove files

src.datamodules.mind.download.read_clickhistory(path, filename)§

Read click history file

Parameters:
  • path (str) – Folder path

  • filename (str) – Filename

Returns:

Tuple (list, dict) where list is a list of user session with user_id, clicks, positive and negative interactions and dict is a dictionary with user_id click history.

src.datamodules.mind.download.read_news(filepath, tokenizer)§

Read news file

Parameters:
  • filepath (str) – Path to news file

  • tokenizer (tokenizer) – Tokenizer for news title tokenization

Returns:

Tuple (news_words, news_entities, news_abstract_entities, news_categories, news_subcategories) where each item is a dictionary containing the items, specified by the dictionary name, assigned to each user

src.datamodules.mind.download.read_news_ids(filepath)§

Read news ids

Parameters:

filepath (str) – Path to news file

Returns:

Dictionary containing news identifiers and generated ids

src.datamodules.mind.download.word_tokenize(sent)§

Tokenize a sentence

Parameters:

sent (str) – Sentence to be tokenized

Returns:

Word list

src.datamodules.mind.parse module§

src.datamodules.mind.parse.generate_word_embedding(source, target, word2int_path, word_embedding_dim)§

Generate from pretrained word embedding file If a word not in embedding file, initial its embedding by N(0, 1)

Parameters:
  • source (str) – Path of pretrained word embedding file, e.g. glove.840B.300d.txt

  • target (str) – Path for saving word embedding

  • word2int_path (str) – Path to vocabulary file

src.datamodules.mind.parse.parse_behaviors(source, target, user2int_path, negative_sampling_ratio)§

Parse behaviors file in training set.

Parameters:
  • source (str) – Source behaviors file

  • target (str) – Target behaviors file

  • user2int_path (str) – Path for saving user2int file

Returns:

Number of users

src.datamodules.mind.parse.parse_behaviors_bert(source, target, news_ids_set)§

Parse behaviors for using BERT baseline Get all the history(NewsID) for each user Get the candidate_news from impressions for each user :param source: source news file :param target: target news file

Returns:

behaviors_parsed(id, history:<NEWS_IDS>, candidate_news<NEWS_IDS>, labels:<y_true>)

Return type:

DataFrame

src.datamodules.mind.parse.parse_mind(train_dir, val_dir, test_dir, glove_dir, glove_size, negative_sampling_ratio, num_words_title, num_words_abstract, entity_confidence_threshold, word_freq_threshold, entity_freq_threshold, word_embedding_dim, entity_embedding_dim)§

Parse MIND dataset :param train_dir: Path to train directory :type train_dir: str :param val_dir: Path to validation directory :type val_dir: str :param test_dir: Path to test directory :type test_dir: str :param glove_dir: Path to glove directory :type glove_dir: str :param glove_size (): Glove size :param negative_sampling_ratio (): :param num_words_title: number of words in title :type num_words_title: long :param num_words_abstract: number of words in abstract :type num_words_abstract: long :param entity_confidence_threshold (): :param word_freq_threshold: Threshold for word frequency :type word_freq_threshold: float :param entity_freq_threshold: Threshold for entity frequency :type entity_freq_threshold: float :param word_embedding_dim (): Word embedding dimension :param entity_embedding_dim (): Entity embedding dimension

Returns:

Tuple (num_users, num_categories, num_words, num_entities) containing number of users, number of categories, number of words, number of entities

src.datamodules.mind.parse.parse_mind_bert(train_dir, val_dir, test_dir, column)§
src.datamodules.mind.parse.parse_news(source, target, category2int_path, word2int_path, entity2int_path, mode, num_words_title, num_words_abstract, entity_confidence_threshold, word_freq_threshold, entity_freq_threshold)§

Parse news for training set and test set

Parameters:
  • source (str) – Source news file

  • target (str) – Target news file

  • category2int_path (str) – Path to category2int file. If mode == ‘train’: Path to save. If mode == ‘test’: Path to load from.

  • word2int_path (str) – Path to word2int file. If mode == ‘train’: Path to save. If mode == ‘test’: Path to load from.

  • entity2int_path (str) – Path to entity2int file. If mode == ‘train’: Path to save. If mode == ‘test’: Path to load from.

  • mode (str) – Either ‘train’ or ‘test’

  • num_words_title (long) – number of words in title

  • num_words_abstract (long) – number of words in abstract

  • () (entity_confidence_threshold) –

  • word_freq_threshold (float) – Threshold for word frequency

  • entity_freq_threshold (float) – Threshold for entity frequency

src.datamodules.mind.parse.parse_news_bert(source, target, column)§

Parse news for using BERT baseline Generate BERT embedding for the text in news df :param source: source news file :param target: target news file :param column: the text that will represent the news

Returns:

news_parsed(news_id, text)

Return type:

DataFrame

src.datamodules.mind.parse.transform_entity_embedding(source, target, entity2int_path, entity_embedding_dim)§

Transform entity embedding

Parameters:
  • source (str) – Path of embedding file

  • target (str) – Path of transformed embedding file in numpy format

  • entity2int_path (str) – Path to entity ids file

src.datamodules.mind.preprocessing module§

src.datamodules.mind.preprocessing.create_knowledge_graph_file(model, paths, use_categories, use_subcategories, use_title_entities, use_abstract_entities, use_title_tokens, use_wikidata)§

Creates a news article knowledge graph file including the properties set in the constructor.

Parameters:
  • model (str) – Model name, either “MKR” or “RippleNet”. Necessary for filepath.

  • paths (list) – Contains paths to the train and/or validation and/or test news files.

  • use_categories (boolean) – Whether to use news categories in knowledge graph.

  • use_subcategories (boolean) – Whether to use news subcategories in knowledge graph.

  • use_title_entities (boolean) – Whether to use news title entities in knowledge graph.

  • use_abstract_entities (boolean) – Whether to use news abstract entities in knowledge graph.

  • use_title_tokens (boolean) – Whether to use news title tokens in knowledge graph.

  • use_wikidata (boolean) – Whether to use additional wikidata knowledge graph in knowledge graph.

Returns:

Tuple (path1, path2) where path1 is the path to the file containing the knowledge graph and path2 is the path to the file containing the item index to entity id hashes.

src.datamodules.mind.preprocessing.create_rating_file(paths, path_to_news_ids, model)§

Creates a file specifying which news articles have been read by users and which have not

Parameters:
  • paths (list) – List containing paths to behaviours files (train, validation, test)

  • path_to_news_ids (str) – Path to file containing news ids and corresponding item index

  • model (str) – Model name, either “MKR” or “RippleNet”. Necessary for path

src.datamodules.mind.preprocessing.create_rating_file_collaborative(paths, model)§

Creates a file specifying which news articles have been read by users and which have not. Exclusively for the Collaborative Filtering model

Parameters:
  • paths (list) – List containing paths to behaviours files (train, validation, test)

  • model (str) – Model name, either “MKR” or “RippleNet”. Necessary for path

src.datamodules.mind.preprocessing.prepare_numpy_data(path, model)§

Converts the rating.txt file to numpy format

Parameters:
  • path (str) – Path to the ratings.txt file

  • model (str) – Model name, either “MKR” or “RippleNet”. Necessary for path

Returns:

Numpy.ndarray containing the rating data in numpy format

src.datamodules.mind.preprocessing.prepare_numpy_kg(path, model)§

Converts the knowledge graph.txt file to numpy format

Parameters:
  • path (str) – Path to the knowledge graph.txt file

  • model (str) – Model name, either “MKR” or “RippleNet”. Necessary for path

Returns:

numpy.Ndarray containing the knowledge graph data in numpy format

Module contents§