update: generation length and datasource#81
Conversation
…into datasource-example
timoschick
left a comment
There was a problem hiding this comment.
I didn't carefully check the details, but from a high-level perspective, this looks good to me 👍
| class GenerationLengthPreprocessor(MetadataPreprocessor): | ||
| """An exemplary metadata preprocessor for adding generation length information based on text.""" | ||
|
|
||
| def __init__(self, mode): |
There was a problem hiding this comment.
I would use "global" as the default value here
| example_metadata.append(example_length) | ||
| examples["metadata"] = ( | ||
| [m[0] for m in examples["metadata"]] if self.mode == "sentence" else examples["metadata"] | ||
| ) # reformatting of nested lists | ||
|
|
There was a problem hiding this comment.
@chkla , reading these lines, I have the impression that we can have some problems if there is already some metadata stored in the column (m[0] will return an error).
If I understand correctly, self._extract_length_from_sentences(example_text) will return a list ?
What do you think of doing instead:
for example_text, example_metadata in zip(examples["text"], examples["metadata"]):
if self.mode == "text":
text_length = self._extract_length_from_text(example_text)
example_length = {"key": "length", "type": "global", "value": text_length}
if not example_length:
continue
example_metadata.append(example_length)
elif self.mode == "sentence":
example_length = self._extract_length_from_sentences(example_text)
example_metadata.extend(example_length)
else:
print("Please select a valid length type [text or sentence].")
|
|
||
| return str(len(text)) # char-based length | ||
|
|
||
| def _extract_length_from_sentences(self, text: str) -> Optional[str]: |
There was a problem hiding this comment.
Reading the content of the method, I guessed that the output format is a list (and not a str). Are you agree?
| len_sentences = [self._extract_length_from_text(sent) for sent in text.split(".")] | ||
|
|
||
| # Iterate through the sentences of a text, storing the absolute beginning and end of each sentence and the associated length of each sentence. | ||
| for sent_pos, sent_len, i in zip(pos_sentences, len_sentences, range(len(len_sentences))): |
There was a problem hiding this comment.
here I have the impression that sent_pos and sent_len are not used. If this is the case, maybe we can simply a little bit this loop.
|
|
||
| for example_url, example_meta in zip(examples["url"], examples["metadata"]): | ||
| example_datasource = self._extract_datasource_from_url(example_url) | ||
| print(example_datasource) |
There was a problem hiding this comment.
I think there was a little oversight here 🙂
Hi @ALL, the last PR was now far behind the master branch, so I have updated the branch and it should be ready for the master branch. The PR includes:
Best,
Chris