Skip to content

Datasource and GenerationLength#56

Closed
chkla wants to merge 23 commits into
bigscience-workshop:masterfrom
chkla:datasource-example
Closed

Datasource and GenerationLength#56
chkla wants to merge 23 commits into
bigscience-workshop:masterfrom
chkla:datasource-example

Conversation

@chkla
Copy link
Copy Markdown
Collaborator

@chkla chkla commented Nov 5, 2021

@timoschick my PR adds some new basic functions for processing the metadata types DATASOURCE and LENGTH:

  • datasource parser (w/o threshold)
  • datasource preprocessor
  • datasource tests
  • length parser (global)
  • length preprocessor
  • length tests

@chkla chkla changed the title Datasource example Datasource and GenerationLength Nov 17, 2021
Copy link
Copy Markdown
Collaborator

@changjonathanc changjonathanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a full review. I still need some time to understand the Submodule diff.

Comment thread .gitignore Outdated
Comment thread bsmetadata/preprocessing_utils.py Outdated
@chkla chkla marked this pull request as draft November 18, 2021 18:05
Copy link
Copy Markdown
Contributor

@timoschick timoschick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chkla, thanks for your work 👍 This looks good to me overall, I've added a few comments.

# We represent the DATASOURCE by using meaningful information of the URL.
# URL: http://www.example.de/2015/forum/article/21-new-project
# Example: example.de forum article new project
return "".join(["Datasource", self.cfg.metadata_key_value_sep, metadata_attrs["value"]])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the preprocessor logic, shouldn't the example rather be example.de > forum > article > new project?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ you're right, I've updated the example

# We represent the length of a text by the number of characters.
# Example: Length: 123

return "".join(["Text Length", self.cfg.metadata_key_value_sep, metadata_attrs["value"]])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the model always get the exact number? I'm not sure if we've discussed this already, but wouldn't it make sense to have some kind of bucketing (e.g., using powers of 2)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And another quick thought: Did we decide to have generation length only at a global level? It might be useful to view this as a sentence-level (or <p> / <div> level, if we have HTML tags) kind of metadata.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And another quick thought: Did we decide to have generation length only at a global level? It might be useful to view this as a sentence-level (or <p> / <div> level, if we have HTML tags) kind of metadata.

✅ yeah, the local level was on my to-do list, and I've now added a new function to the length preprocessor to do this at the local (sentence-based) and global (text-based) level. but we can also think of a more complex splitting method than just using „dots“ to identify the relevant „parts" of a text.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the model always get the exact number? I'm not sure if we've discussed this already, but wouldn't it make sense to have some kind of bucketing (e.g., using powers of 2)?

🚧 If I remember correctly, we talked about alternative categories, e.g. small, medium, large, but not about a specific mapping between the number of characters and a category or so. So I think we decided to continue with the specific number of characters, but I would be also a fan of more abstract categories/ buckets as well.

Comment thread bsmetadata/preprocessing_utils.py Outdated
class DatasourcePreprocessor(MetadataPreprocessor):
"""An exemplary metadata preprocessor for adding datasource information based on URLs."""

def _check_numbers(self, sub_part):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think readability could be improved here by adding type hints (_check_numbers(self, sub_part: List[str]) -> List[str]) but I guess that's a matter of taste. At first, I thought that sub_part was a str and was confused why you would only remove the first/last character.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ good point, I like the idea. I've updated the function in this regard. thanks a lot

@chkla chkla marked this pull request as ready for review November 26, 2021 12:38
@manandey
Copy link
Copy Markdown
Member

manandey commented May 4, 2022

Merged in #81.

@manandey manandey closed this May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants