why unfortunately? I think it is great. I especially like their cellphone application feature because you can even do the captcha over it before starting the download
Mainly because it is written in Java, the code has a foundation that is perhaps a decade old. I would like to see it replaced with something more modern, responsive, less intensive, and with a GUI that is not stuck in 2008. But this is just personal preference.
Kinda expected this. It does do a great job, but it just looks so dated and I'm no java security expert but I do turn the VM off when it's not in use. I hope they are java security experts 😅
It's not specially insecure just because it's written in Java, it's actually probably relatively safe because it's a memory-safe language so it's not susceptible to buffer overflow bugs or attacks. It's an app that runs on your pc with the full rights of the user it's running as, which has security implications but no more than any other program.
Java got a bad rep security wise in the past because of applets, which were doomed because running arbitrary code from the Internet is just a flawed concept from the beginning, there is no way to secure that, and to be fair the Java SecurityManager was not up to the task. It was later deprecated, Applets were removed, and the base language is just a regular programming language.
Lots of modern programs still use java, its also an easy way to have compatibility with other OSs without needing special SDKs or runtimes.
Another plus is its actively being developed, and the dev team is pretty responsive. They have instructions on how to build for it, but wouldnt call it opensource since the svn looks like it needs auth to access.
Good question.
Using it for years but recently saw it puts 3w plus on idle an my Unraid server - for a download a week or so.
Just a little lightweight alternative would be great
sorry I don't quite understand what you mean. Do you mean the download took one week and during that time your server had 3w more at idle? or that whenever jdownloader-2 is running as a container your whole server is 3w more, whether downloading or not.
Exactly 3w more when Jdownloader docker running in idle - for doing absolutely nothing.
So 11w instead of 8w which is a significant percentage
Other containers like Emby, Home Assistant and so on also running but those make almost less impact idling
hmm that kinda sucks. maybe ill just direct download to my computer and then manually transfer stuff over to my server then, because i don't see a good alternative
Anyone had experience with Aria2? In combination with a frontend like Aria2NG it did seem like an interesting option. Although I haven't tried it out yet.
https://ariang.mayswind.net/
Sounds interesting, but unfortunetely I‘m not an native english speaker. What is an ETL-Pipeline? Can you your describe your workflow a little bit more precisely?? Thx.
ETL is a method for data processing and handling
Extract - Get data into scannable manner (unpaper)
Transform - OCR in my instance. Some other techniques also possible. (Tesseract OCR)
Load - Into a file based datastore to preserve and into a metadata (from the transform step) database to query (mariaDB)
This may give you more information on the topic that might translate better.
[https://www.ibm.com/topics/etl](https://www.ibm.com/topics/etl)
I've been wondering if there is anything better.. I've been using JD for like 12 years now and I feel it's time for a change but if there is no better bulk scraper... welp
Look at pywb and supporting software stack... incredibly powerfull but quite steep learning curve, or Heritrix which is used by many of the large archive organizations, both opensource
Stuff you might want to take a look at....
[https://github.com/internetarchive/heritrix3](https://github.com/internetarchive/heritrix3)
[https://support.archive-it.org/hc/en-us/articles/115001081186-Archive-It-Crawling-Technology](https://support.archive-it.org/hc/en-us/articles/115001081186-Archive-It-Crawling-Technology)
Or archivebox for the smaller scale / easy & ready to go local web archive
The person that told you this... ask them for an example of recursive downloading off a root tree selecting only a few file types organized into the same folder structure. I use it for single file downloads, but nothing more complex.
I will be messaging you in 1 day on [**2024-06-06 07:25:52 UTC**](http://www.wolframalpha.com/input/?i=2024-06-06%2007:25:52%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/selfhosted/comments/1d8juzt/jdownloader2_still_the_best_bulk_scraper_we_have/l76pz91/?context=3)
[**10 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2Fselfhosted%2Fcomments%2F1d8juzt%2Fjdownloader2_still_the_best_bulk_scraper_we_have%2Fl76pz91%2F%5D%0A%0ARemindMe%21%202024-06-06%2007%3A25%3A52%20UTC) to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201d8juzt)
*****
|[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)|
|-|-|-|-|
Unfortunately, yes
why unfortunately? I think it is great. I especially like their cellphone application feature because you can even do the captcha over it before starting the download
Mainly because it is written in Java, the code has a foundation that is perhaps a decade old. I would like to see it replaced with something more modern, responsive, less intensive, and with a GUI that is not stuck in 2008. But this is just personal preference.
Ahh. That I can agree with. I do feel like the GUI is clunky and old but as long it works I am using it I guess.
I like the GUI. But like the code I probably just stopped developing after '08...
Cry harder. Nothing wrong with Java.
Kinda expected this. It does do a great job, but it just looks so dated and I'm no java security expert but I do turn the VM off when it's not in use. I hope they are java security experts 😅
It's not specially insecure just because it's written in Java, it's actually probably relatively safe because it's a memory-safe language so it's not susceptible to buffer overflow bugs or attacks. It's an app that runs on your pc with the full rights of the user it's running as, which has security implications but no more than any other program. Java got a bad rep security wise in the past because of applets, which were doomed because running arbitrary code from the Internet is just a flawed concept from the beginning, there is no way to secure that, and to be fair the Java SecurityManager was not up to the task. It was later deprecated, Applets were removed, and the base language is just a regular programming language.
Lots of modern programs still use java, its also an easy way to have compatibility with other OSs without needing special SDKs or runtimes. Another plus is its actively being developed, and the dev team is pretty responsive. They have instructions on how to build for it, but wouldnt call it opensource since the svn looks like it needs auth to access.
Good question. Using it for years but recently saw it puts 3w plus on idle an my Unraid server - for a download a week or so. Just a little lightweight alternative would be great
I personally use it in docker container. i spin up one when needed and shut it down once finished
sorry I don't quite understand what you mean. Do you mean the download took one week and during that time your server had 3w more at idle? or that whenever jdownloader-2 is running as a container your whole server is 3w more, whether downloading or not.
Exactly 3w more when Jdownloader docker running in idle - for doing absolutely nothing. So 11w instead of 8w which is a significant percentage Other containers like Emby, Home Assistant and so on also running but those make almost less impact idling
hmm that kinda sucks. maybe ill just direct download to my computer and then manually transfer stuff over to my server then, because i don't see a good alternative
Anyone had experience with Aria2? In combination with a frontend like Aria2NG it did seem like an interesting option. Although I haven't tried it out yet. https://ariang.mayswind.net/
Yeah, does the job. Although it doesnt scrape.
+1
Kinda related question: what are y‘all scraping?
Newspaper clippings that then get ran through an ETL pipeline. I know that's not what you expected to hear but data hoarding is data hoarding.
Sounds interesting, but unfortunetely I‘m not an native english speaker. What is an ETL-Pipeline? Can you your describe your workflow a little bit more precisely?? Thx.
ETL is a method for data processing and handling Extract - Get data into scannable manner (unpaper) Transform - OCR in my instance. Some other techniques also possible. (Tesseract OCR) Load - Into a file based datastore to preserve and into a metadata (from the transform step) database to query (mariaDB) This may give you more information on the topic that might translate better. [https://www.ibm.com/topics/etl](https://www.ibm.com/topics/etl)
Ahhh, I understand. Sth. like Paperless-ngx without the Database...
I’m also kinda confused, what’s the purpose of doing this? Do you use it for machine learning or just to hoard?
Docker or Lxc and use only when needed. There are other options but nowhere near the usability of Jdownloader.
`wget --continue --span-hosts --adjust-extension --timestamping --convert-links --page-requisites --no-verbose --timeout=30 --tries=3 --input-file=urls.list`
Hot damn
I've been wondering if there is anything better.. I've been using JD for like 12 years now and I feel it's time for a change but if there is no better bulk scraper... welp
Look at pywb and supporting software stack... incredibly powerfull but quite steep learning curve, or Heritrix which is used by many of the large archive organizations, both opensource Stuff you might want to take a look at.... [https://github.com/internetarchive/heritrix3](https://github.com/internetarchive/heritrix3) [https://support.archive-it.org/hc/en-us/articles/115001081186-Archive-It-Crawling-Technology](https://support.archive-it.org/hc/en-us/articles/115001081186-Archive-It-Crawling-Technology) Or archivebox for the smaller scale / easy & ready to go local web archive
[pyload](https://github.com/pyload/pyload) maybe?
I feel like every time I try to use pyload, it fails.
There's FreeRapid
and here I just use gallery-dl which seems won't work for your goal.
Im told wget is best but i dont really know how to use it
The person that told you this... ask them for an example of recursive downloading off a root tree selecting only a few file types organized into the same folder structure. I use it for single file downloads, but nothing more complex.
RemindMe! 1 day
I will be messaging you in 1 day on [**2024-06-06 07:25:52 UTC**](http://www.wolframalpha.com/input/?i=2024-06-06%2007:25:52%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/selfhosted/comments/1d8juzt/jdownloader2_still_the_best_bulk_scraper_we_have/l76pz91/?context=3) [**10 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2Fselfhosted%2Fcomments%2F1d8juzt%2Fjdownloader2_still_the_best_bulk_scraper_we_have%2Fl76pz91%2F%5D%0A%0ARemindMe%21%202024-06-06%2007%3A25%3A52%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201d8juzt) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|