update: generation length and datasource by chkla · Pull Request #81 · bigscience-workshop/metadata

chkla · 2021-12-08T00:05:15Z

Hi @ALL, the last PR was now far behind the master branch, so I have updated the branch and it should be ready for the master branch. The PR includes:

datasource parser (w/o threshold)
datasource preprocessor
datasource tests
length parser (global and local)
length preprocessor
length tests

Best,
Chris

…into datasource-example

timoschick

I didn't carefully check the details, but from a high-level perspective, this looks good to me 👍

timoschick · 2021-12-17T12:18:42Z

+class GenerationLengthPreprocessor(MetadataPreprocessor):
+    """An exemplary metadata preprocessor for adding generation length information based on text."""
+
+    def __init__(self, mode):


I would use "global" as the default value here

SaulLu · 2021-12-21T16:36:12Z

+            example_metadata.append(example_length)
+        examples["metadata"] = (
+            [m[0] for m in examples["metadata"]] if self.mode == "sentence" else examples["metadata"]
+        )  # reformatting of nested lists
+


@chkla , reading these lines, I have the impression that we can have some problems if there is already some metadata stored in the column (m[0] will return an error).

If I understand correctly, self._extract_length_from_sentences(example_text) will return a list ?

What do you think of doing instead:

for example_text, example_metadata in zip(examples["text"], examples["metadata"]): if self.mode == "text": text_length = self._extract_length_from_text(example_text) example_length = {"key": "length", "type": "global", "value": text_length} if not example_length: continue example_metadata.append(example_length) elif self.mode == "sentence": example_length = self._extract_length_from_sentences(example_text) example_metadata.extend(example_length) else: print("Please select a valid length type [text or sentence].")

SaulLu · 2021-12-21T16:38:00Z

+
+        return str(len(text))  # char-based length
+
+    def _extract_length_from_sentences(self, text: str) -> Optional[str]:


Reading the content of the method, I guessed that the output format is a list (and not a str). Are you agree?

SaulLu · 2021-12-21T16:39:15Z

+        len_sentences = [self._extract_length_from_text(sent) for sent in text.split(".")]
+
+        # Iterate through the sentences of a text, storing the absolute beginning and end of each sentence and the associated length of each sentence.
+        for sent_pos, sent_len, i in zip(pos_sentences, len_sentences, range(len(len_sentences))):


here I have the impression that sent_pos and sent_len are not used. If this is the case, maybe we can simply a little bit this loop.

SaulLu · 2021-12-21T16:40:08Z

+
+        for example_url, example_meta in zip(examples["url"], examples["metadata"]):
+            example_datasource = self._extract_datasource_from_url(example_url)
+            print(example_datasource)


I think there was a little oversight here 🙂

chkla and others added 26 commits November 4, 2021 23:50

add datasource processor

11c16b8

add generation length processor

0532df8

add GenerationLengthPreprocessor

3ed077d

add GenerationLengthPreprocessor

82a2528

add DatasourcePreprocessor w/o threshold

2a8c491

add comments + format

241a5a5

add tests for generation length and datasource + format errors

db01514

tests for the DatasourcePreprocessor

86b4a7b

solved var and format issues

c4a45cb

add datautil to ignore

e132b2c

add new processors to dict

b0d4c88

updated GenerationLengthPreprocessor + test

8d6c673

format

eb7e21c

updated style

1d5ca29

updated datasource test

40f967e

fomat issues

de83058

deleted local sys path

6ed8fb6

updated gitignore

818da53

add local length preprocessor + test + dsexample

6ddc1d2

Merge branch 'datasource-example' of https://github.com/chkla/metadata …

fc4b17e

…into datasource-example

format

15018f6

formatting

e814331

Merge branch 'datasource-example'

73f6177

Merge branch 'master' of https://github.com/chkla/metadata

46666c1

delete merge comments

6d86ada

quality check

bc51d1e

timoschick approved these changes Dec 17, 2021

View reviewed changes

chkla and others added 3 commits December 20, 2021 17:59

Merge branch 'master' into master

a378877

imports

9f21863

style & quality

0e34e7a

style check

8caa6a5

chkla merged commit b56d3b8 into bigscience-workshop:master Dec 20, 2021

SaulLu mentioned this pull request Dec 21, 2021

Revert "update: generation length and datasource" #107

Closed

SaulLu reviewed Dec 21, 2021

View reviewed changes

SaulLu mentioned this pull request Dec 22, 2021

feat: add a feature to choose where to extract metadata #116

Merged

manandey mentioned this pull request May 4, 2022

Datasource and GenerationLength #56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update: generation length and datasource#81

update: generation length and datasource#81
chkla merged 30 commits into
bigscience-workshop:masterfrom
chkla:master

chkla commented Dec 8, 2021

Uh oh!

timoschick left a comment

Uh oh!

timoschick Dec 17, 2021

Uh oh!

SaulLu Dec 21, 2021 •

edited

Loading

Uh oh!

SaulLu Dec 21, 2021

Uh oh!

SaulLu Dec 21, 2021

Uh oh!

SaulLu Dec 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		return str(len(text)) # char-based length

		def _extract_length_from_sentences(self, text: str) -> Optional[str]:

Conversation

chkla commented Dec 8, 2021

Uh oh!

timoschick left a comment

Choose a reason for hiding this comment

Uh oh!

timoschick Dec 17, 2021

Choose a reason for hiding this comment

Uh oh!

SaulLu Dec 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SaulLu Dec 21, 2021

Choose a reason for hiding this comment

Uh oh!

SaulLu Dec 21, 2021

Choose a reason for hiding this comment

Uh oh!

SaulLu Dec 21, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SaulLu Dec 21, 2021 •

edited

Loading