Datasource and GenerationLength by chkla · Pull Request #56 · bigscience-workshop/metadata

chkla · 2021-11-05T15:19:24Z

@timoschick my PR adds some new basic functions for processing the metadata types DATASOURCE and LENGTH:

datasource parser (w/o threshold)
datasource preprocessor
datasource tests
length parser (global)
length preprocessor
length tests

changjonathanc

This is not a full review. I still need some time to understand the Submodule diff.

timoschick

Hi @chkla, thanks for your work 👍 This looks good to me overall, I've added a few comments.

timoschick · 2021-11-19T13:56:47Z

+        # We represent the DATASOURCE by using meaningful information of the URL.
+        # URL: http://www.example.de/2015/forum/article/21-new-project
+        # Example: example.de forum article new project
+        return "".join(["Datasource", self.cfg.metadata_key_value_sep, metadata_attrs["value"]])


Based on the preprocessor logic, shouldn't the example rather be example.de > forum > article > new project?

✅ you're right, I've updated the example

timoschick · 2021-11-19T13:57:48Z

+        # We represent the length of a text by the number of characters.
+        # Example: Length: 123
+
+        return "".join(["Text Length", self.cfg.metadata_key_value_sep, metadata_attrs["value"]])


Does the model always get the exact number? I'm not sure if we've discussed this already, but wouldn't it make sense to have some kind of bucketing (e.g., using powers of 2)?

And another quick thought: Did we decide to have generation length only at a global level? It might be useful to view this as a sentence-level (or <p> / <div> level, if we have HTML tags) kind of metadata.

And another quick thought: Did we decide to have generation length only at a global level? It might be useful to view this as a sentence-level (or <p> / <div> level, if we have HTML tags) kind of metadata.

✅ yeah, the local level was on my to-do list, and I've now added a new function to the length preprocessor to do this at the local (sentence-based) and global (text-based) level. but we can also think of a more complex splitting method than just using „dots“ to identify the relevant „parts" of a text.

Does the model always get the exact number? I'm not sure if we've discussed this already, but wouldn't it make sense to have some kind of bucketing (e.g., using powers of 2)?

🚧 If I remember correctly, we talked about alternative categories, e.g. small, medium, large, but not about a specific mapping between the number of characters and a category or so. So I think we decided to continue with the specific number of characters, but I would be also a fan of more abstract categories/ buckets as well.

timoschick · 2021-11-19T14:08:21Z

+class DatasourcePreprocessor(MetadataPreprocessor):
+    """An exemplary metadata preprocessor for adding datasource information based on URLs."""
+
+    def _check_numbers(self, sub_part):


I think readability could be improved here by adding type hints (_check_numbers(self, sub_part: List[str]) -> List[str]) but I guess that's a matter of taste. At first, I thought that sub_part was a str and was confused why you would only remove the first/last character.

✅ good point, I like the idea. I've updated the function in this regard. thanks a lot

…into datasource-example

manandey · 2022-05-04T03:58:13Z

Merged in #81.

chkla and others added 16 commits November 4, 2021 23:50

add datasource processor

11c16b8

add generation length processor

0532df8

add GenerationLengthPreprocessor

3ed077d

add GenerationLengthPreprocessor

82a2528

add DatasourcePreprocessor w/o threshold

2a8c491

add comments + format

241a5a5

add tests for generation length and datasource + format errors

db01514

tests for the DatasourcePreprocessor

86b4a7b

solved var and format issues

c4a45cb

add datautil to ignore

e132b2c

add new processors to dict

b0d4c88

updated GenerationLengthPreprocessor + test

8d6c673

format

eb7e21c

updated style

1d5ca29

updated datasource test

40f967e

fomat issues

de83058

chkla changed the title ~~Datasource example~~ Datasource and GenerationLength Nov 17, 2021

changjonathanc reviewed Nov 18, 2021

View reviewed changes

Comment thread .gitignore Outdated

Comment thread bsmetadata/preprocessing_utils.py Outdated

chkla and others added 2 commits November 18, 2021 18:42

deleted local sys path

6ed8fb6

updated gitignore

818da53

chkla marked this pull request as draft November 18, 2021 18:05

timoschick reviewed Nov 19, 2021

View reviewed changes

chkla added 3 commits November 21, 2021 13:39

add local length preprocessor + test + dsexample

6ddc1d2

Merge branch 'datasource-example' of https://github.com/chkla/metadata …

fc4b17e

…into datasource-example

format

15018f6

chkla marked this pull request as ready for review November 26, 2021 12:38

chkla added 2 commits December 7, 2021 14:02

formatting

e814331

add experiment files

32bae98

manandey closed this May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasource and GenerationLength#56

Datasource and GenerationLength#56
chkla wants to merge 23 commits into
bigscience-workshop:masterfrom
chkla:datasource-example

chkla commented Nov 5, 2021 •

edited

Loading

Uh oh!

changjonathanc left a comment

Uh oh!

Uh oh!

Uh oh!

timoschick left a comment

Uh oh!

timoschick Nov 19, 2021

Uh oh!

chkla Nov 21, 2021

Uh oh!

timoschick Nov 19, 2021

Uh oh!

timoschick Nov 19, 2021

Uh oh!

chkla Nov 21, 2021

Uh oh!

chkla Nov 21, 2021

Uh oh!

timoschick Nov 19, 2021

Uh oh!

chkla Nov 21, 2021

Uh oh!

manandey commented May 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

chkla commented Nov 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changjonathanc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

timoschick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

manandey commented May 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chkla commented Nov 5, 2021 •

edited

Loading