Datasource and GenerationLength#56
Conversation
changjonathanc
left a comment
There was a problem hiding this comment.
This is not a full review. I still need some time to understand the Submodule diff.
timoschick
left a comment
There was a problem hiding this comment.
Hi @chkla, thanks for your work 👍 This looks good to me overall, I've added a few comments.
| # We represent the DATASOURCE by using meaningful information of the URL. | ||
| # URL: http://www.example.de/2015/forum/article/21-new-project | ||
| # Example: example.de forum article new project | ||
| return "".join(["Datasource", self.cfg.metadata_key_value_sep, metadata_attrs["value"]]) |
There was a problem hiding this comment.
Based on the preprocessor logic, shouldn't the example rather be example.de > forum > article > new project?
There was a problem hiding this comment.
✅ you're right, I've updated the example
| # We represent the length of a text by the number of characters. | ||
| # Example: Length: 123 | ||
|
|
||
| return "".join(["Text Length", self.cfg.metadata_key_value_sep, metadata_attrs["value"]]) |
There was a problem hiding this comment.
Does the model always get the exact number? I'm not sure if we've discussed this already, but wouldn't it make sense to have some kind of bucketing (e.g., using powers of 2)?
There was a problem hiding this comment.
And another quick thought: Did we decide to have generation length only at a global level? It might be useful to view this as a sentence-level (or <p> / <div> level, if we have HTML tags) kind of metadata.
There was a problem hiding this comment.
And another quick thought: Did we decide to have generation length only at a global level? It might be useful to view this as a sentence-level (or
<p>/<div>level, if we have HTML tags) kind of metadata.
✅ yeah, the local level was on my to-do list, and I've now added a new function to the length preprocessor to do this at the local (sentence-based) and global (text-based) level. but we can also think of a more complex splitting method than just using „dots“ to identify the relevant „parts" of a text.
There was a problem hiding this comment.
Does the model always get the exact number? I'm not sure if we've discussed this already, but wouldn't it make sense to have some kind of bucketing (e.g., using powers of 2)?
🚧 If I remember correctly, we talked about alternative categories, e.g. small, medium, large, but not about a specific mapping between the number of characters and a category or so. So I think we decided to continue with the specific number of characters, but I would be also a fan of more abstract categories/ buckets as well.
| class DatasourcePreprocessor(MetadataPreprocessor): | ||
| """An exemplary metadata preprocessor for adding datasource information based on URLs.""" | ||
|
|
||
| def _check_numbers(self, sub_part): |
There was a problem hiding this comment.
I think readability could be improved here by adding type hints (_check_numbers(self, sub_part: List[str]) -> List[str]) but I guess that's a matter of taste. At first, I thought that sub_part was a str and was confused why you would only remove the first/last character.
There was a problem hiding this comment.
✅ good point, I like the idea. I've updated the function in this regard. thanks a lot
|
Merged in #81. |
@timoschick my PR adds some new basic functions for processing the metadata types DATASOURCE and LENGTH: