Entry tags:
Question thread #62
It's time for another question thread!
The rules:
- You may ask any dev-related question you have in a comment. (It doesn't even need to be about Dreamwidth, although if it involves a language/library/framework/database Dreamwidth doesn't use, you will probably get answers pointing that out and suggesting a better place to ask.)
- You may also answer any question, using the guidelines given in To Answer, Or Not To Answer and in this comment thread.
The rules:
- You may ask any dev-related question you have in a comment. (It doesn't even need to be about Dreamwidth, although if it involves a language/library/framework/database Dreamwidth doesn't use, you will probably get answers pointing that out and suggesting a better place to ask.)
- You may also answer any question, using the guidelines given in To Answer, Or Not To Answer and in this comment thread.
enhanced Livejounal import
I don't know how to ask this question without a really long introduction, but if you want to save some reading, skip to the final paragraph once you get the idea.
I have a Livejournal with almost 8000 entries, over 1000 pictures, and 5000 comments received. Frequently I link one entry to another when referring to things I wrote about in past messages (and sometimes I even revise past messages to link later messages).
For example:
Those intra-journal links are a mess for importing from Livejournal to Dreamwidth, because message serial numbers (the "999" in "999.html" above, in case that's not the standard term for those numbers) aren't necessarily going to be the same on different journal hosts.
The simple answer is that the intra-journal links stay as-is, linking the original journal rather than within the new journal, but that sends readers off to Russia. Not ideal.
My thought about how to make it work was to build a tool that opens an entry, finds intra-journal links, and inserts an HTML data-link-conversion="(value)" attribute into the <a href="(URL)"> tag. Then, after an import, the tool could find the data-link-conversion attributes, and modify the URL to match the new host of the impoted journal.
Note: the data-link-conversion tag is a user-defined attribute; search data-* attribute if it's unfamiliar.
My thought is that the attribute value of the data-link-conversion attribute could be either the year-month-date-hour:minute or the year-month-date-serial of the destination journal entry, or maybe even year-month-date-hour:minute-serial for error checking.
Of course, using a message's position in the calendar breaks if an edit to a message changes its calendar position without correcting all data-link-conversion attributes that point to it, which is messy business. Best answer there is, "Don't do that!" More precisely, it's probably best to strip the data-link-conversion attributes from both source and destination journal versions once the import and link conversion is finished.
So, finally, my questions are: 1. Is this a problem that anyone else sees as a problem? 2. Does my solution outline look like a good way to re-link intra-journal links? 3. Is anyone else interested in teaming up on such a tool? 4. Where should I look for code to use as a starting point for such a tool?
Re: enhanced Livejounal import
Are you thinking this would be an onsite tool or a downloadable/scripted/third party tool? We do store the original LJ URL on imported entries, so I can think of a couple of much simpler ways to do it onsite, actually. I'm probably missing some but here's what comes immediately to mind:
1) as an addition to the on-site import tool, at time of import, add a final step that will kick off a worker to scan through each entry and look for links to the journal that was imported from; populate a table of LJ<->DW entry URLs (the data exists but only searchable one way, I think), then go back and replace every LJ link in the journal to the corresponding DW link. Advantage: once-and-done. Disadvantage: see above re: complicating the importer, has the potential to be edge-casey as all get out (you'd be surprised at the horrible things people do to an innocent <a href=> in practice), the failure mode could result in data loss, did I mention we really REALLY don't want to add any other features to the importer that we'd have to support (and in fact have been discussing dropping support for a few of the more support-load-generating features), will make imports take longer since they'll be scanning every journal for internal links even if the journal doesn't have 'em.
2) Create a separate page that will let people push a button that will launch a worker that would let people scan through each entry and look for links to the journal that was imported from; populate table of LJ<->DW entry URLs, add a thing to the HTML cleaner that says "look for whether anything was ever imported to this journal; if it was, and you encounter a link leading to the import source, check the correspondence tables and rewrite the link at pageload. Advantage: doesn't complicate the importer, only happens if people ask the site to do it, if there's an edge case malformed HTML tag in the entry it will preserve the original that you can correct if the post is edited; disadvantage, higher load, possible performance issues since you're repeating the on-the-fly rewrite every time someone loads an entry.
3. Blah blah worker that will populate table of LJ<->DW URLS blah as above, HTML cleaner permanently rewrites links in the entry at the first time an entry is accessed after the URL mapping worker is done. Advantage, once-and-done; disadvantage, higher load, failure mode risks data loss when your regex encounters the edge case malformed HTML.
4. Blah blah worker that will populate table of LJ<->DW URLS blah as above. If a journal contains anything that's been imported and the user ticks off the "replace LJ links with DW links" button or whatever it's called, at time of pageload any link to the original journal source gets changed to https://username.dreamwidth.org/redirect?orig=http://username.livejournal.com/999.html instead, and the lookup doesn't happen until someone clicks the link (same way previous/next entries are calculated). Advantage: the data is only looked up if it's needed and not on every load. Disadvantage: once again, risk of "your regex meets edge case data and cries".
Of all those options, I would wager 4 is most likely the only approach
If you're thinking of third-party/downloadable tool, there still *might* be an easier way of doing it than with attributes, although I'm not familiar enough with attributes to say which would work better/faster. To explain what I'm thinking of, lemme quickly cover how URLs are formed:
* Each entry in a journal gets a unique 'jitemid' (journal item ID). It's the next number in sequence of the count of entries in your journal, with the caveat that if you post and then delete an entry, its jitemid is never reused. (So, if I have 10 entries in my journal, then post an 11th, the jitemid of that new entry is 11. If I delete jitemid 11, the next entry I post will be jitemid 12 and the entries will go 9, 10, 12.
* But! If we used those numbers in the URLs or showed them in journals or whatever, people looking at what my journal would look like in the above example would immediately be able to say "oho! Denise either deleted entry 11, or it's a private or filtered entry that I can't see!" Lesson learned painfully in the good old days of LJ, people get really weird if they know there's content their friend has posted that they can't see. So we add a "fuzz factor" by generating a random number from 0 to 255 (called the 'anum', I don't remember what it stands for).
* The jitemid and the anum together are used to form the entry's itemid, which is what you're calling the message serial numbers here. The itemid of an entry is formed by the following formula:
($jitemid * 256) + $anum = $itemid
* Soooo, given an entry with a URL of https://dw-dev.dreamwidth.org/205028.html, you can reconstruct that the itemid is 205028, and you can reconstruct the jitemid by dividing the 205028 by 256 and getting 800.890625 (so the jitemid is 800, meaning this is the 800th entry ever posted to dw-dev including entries that have been deleted) and you can reconstruct the anum by multiplying 800*256=204800 and subtracting that from the itemid (205028-204800 = 228).
ALL THAT HAVING BEEN INTRODUCED:
I honestly don't remember off the top of my head if the importer makes sure the DW version of an imported journal keeps the same itemid count as the itemid count of the LJ version. If it does and I reiterate that I'm not sure if it does or not, the following will be easy; if it doesn't, the following will be more complex, but still doable.
-- Connect to your LJ account, download its contents (using the old LJ version of jbackup.pl, which I can't guarantee still works on LJ because they closed their source in 2014), store the data locallly
-- Connect to your DW account, download its contents (using the DW version of jbackup.pl, which may also work on LJ if the old LJ one doesn't work for you anymore), store the data locally
----- Easy mode (jitemids are the same): looking up each jitemid in your exported data from each site, make a table of which URL on LJ corresponds to which URL on DW by finding the itemid for each jitemid
----- Harder mode (jitemids aren't the same): Compare contents across the two, using things like the jitemid, anum, itemid, subject line, user-provided time, calculations of the various identifiers as described above etc, to figure out which entries are the same (don't use server time; the server time on the DW side will be the time the import ran, not the time the original LJ post was made). (You will probably want to have your script keep count of how many jitemids were skipped and where, in case you have to manually intervene later if something goes horribly awry.) Make a table of which URL on LJ corresponds to which URL on DW once you're sure you've got it right.
-- Once you've identified which LJ post goes with which DW post, iterate over your DW exported data and find all your DW entries that contain links to http://steve98052.livejournal.com (by DW entry)
-- For each DW entry that contains links to http://steve98052.livejournal.com, consult your correspondence table and identify what the URL of that entry is on DW
-- Either have the script spit out a long list of DW entries for you to manually edit and change the links on (easier, doesn't require you to write an interface to DW, more about that in a sec) or teach the script how to ask DW to edit a particular entry and have it do the edits itself.
*pant pant pant*
OKAY COMING TO THE LAST PIECE I SWEAR but I'm gonna post this first because I bet I'm running up against the limit
Re: enhanced Livejounal import
First, a minor point: I read some Livejournal code years ago, and I seem to recall the itemid number generation slightly differently: I thought the new itemid was the highest existing itemid (even if deleted) plus 1 to 256 (possibly implemented as something like a next-item-id plus 0 to 255). If it were jitemid*256 plus random, one could derive jitemid by dividing itemid and discarding the fraction.
In the case of the serial number in the "year-month-date-serial" I mentioned, it's something in the portion of the client interface that I read and understood: one way that the server presents messages to clients is by year-month-date plus a serial number, or alternatively, flags that say whether it's the first, last, or neither entry for a given date.
I agree that importing an entire journal, searching for intra-journal links and building an associative array, and then generating a list would be fewer steps than the way I imagined. But I was interested in a design that would allow me to examine only a user-specified selection of entries, whether it's a single entry, a date range, or the whole works.
Of course there's a need for all sorts of bullet-proofing. And anything that would be designated as an official Dreamwidth tool would need to be designed to minimize support load from users. The wider the intended audience for such a tool, the better the error messages would have to be.
Re: enhanced Livejounal import
Re: enhanced Livejounal import
Re: enhanced Livejounal import
Re: enhanced Livejounal import
Re: enhanced Livejounal import
If you want your script to do the edits for you, it will need to be a DW client. For something to be a DW client, it will need to understand the DW API. The caveat: the current (version 1.0) version of the API is... well, I'll be charitable and call it "outdated". (By which I mean, 'I don't think it was created in this century'.) It's XML-RPC, it has not aged well, and it's not really documented (because it's so old we're a little ashamed of it and we've been trying to write a replacement API that, you know, isn't old enough to vote.)
https://github.com/dreamwidth/dw-free/blob/develop/src/xmlrpc-client/journal.pl is a very old and not very featureful command-line client Brad used to use that will give you some ideas, and https://github.com/dreamwidth/dw-free/blob/develop/src/jbackup/jbackup.pl is relevant here again because it's also a client (it just isn't geared to pulling up specific entries for edits). I don't THINK we have any sample code for interacting-with-already-posted-entries clients but https://www.livejournal.com/doc/server/ljp.csp.protocol.html is the old LJ server docs for the XML-RPC API and we haven't changed THAT much. I think dw-free/cgi-bin/LJ/Protocol.pm is where the magic happens if you need to consult it.
(I am a lot fuzzier on that last bit than on the stuff in the other comment; if you get stuck anywhere, it's best to make a new top-level dw-dev post and hope someone who knows more than I do sees it.)
Re: enhanced Livejounal import
My thought, based on the implementation that I proposed, was to build it as a client, because things like the data-link-conversion tag I imagined would require mucking with entries on both the source and destination journals.
I see that XML-RPC was the latest and greatest technology back when Livejournal went live, but yeah, in software years, it's antiquated. (And "old enough to vote" gave me a giggle.) But although it's an archaic interface, it's not a horribly complicated one. On the other hand, my Perl skills are negligible, so it might not be something I'd know how to do. On the other-other hand, if the client to do the job is substantial, learning Perl properly might be easier than coding the whole works in Java.
Re: enhanced Livejounal import
Re: enhanced Livejounal import
Re: enhanced Livejounal import
Re: enhanced Livejounal import
Re: enhanced Livejounal import
I am sort of envisioning being able to approve a small batch, or something.
SVG images
Re: SVG images
Re: SVG images
(Anonymous) 2018-03-28 08:22 pm (UTC)(link)I know there's an HTML cleaner that strips all HTML that can be abused. Would an SVG cleaner that passes only white-listed SVG elements be a substantial task? I know I don't use all that many different SVG elements for my vector images, even if we count all the headers and helpers that Inkscape adds as things I use.
Probably obvious, but:
Any kind of cleaner would have to be based on a white-list rather than a black-list, because a black-list can't reasonably be expected to strip every future element with abuse potential, while a white-list that strips safe elements can be updated if enough people are affected to be worth revising the white-list.
Re: SVG images
Re: SVG images
Re: SVG images
I found an article about SVG on Wordpress, explaining a little about the problems (which hadn't occurred to me, since I have only ever used the harmless parts of SVG). It points to a sanitizer that might be able to run on Dreamwidth, since it's Perl.
https://bjornjohansen.no/svg-in-wordpress
This is a PHP "sanitizer" for SVG, mentioned in the Wordpress article:
https://github.com/darylldoyle/svg-sanitizer
This is a Javascript santizer for HTML, MathML, and SVG, which was the basis for the PHP sanitizer:
https://github.com/cure53/DOMPurify
This is another PHP sanitizer, which is a lot simpler. That suggests that it's lighter on server load, and easier to validate, but also that it's more likely to reject non-malicious SVG:
https://github.com/alnorris/SVG-Sanitizer
Also, Mediawiki, using on Wikipedia and other Wikimedia sites, supports SVG. But in spite of some searching for how its SVG support works, all I could find is that the general information that it can serve SVG to users whose browsers can render it correctly, but normally it automatically converts to PNG for more consistent behavior between browsers.
One of the points about consistent rendering behavior is that if an SVG file includes text, but requests a font that isn't installed on the machine that does the rendering (the server if the server is automatically converting to PNG, or the client if it's delivering SVG directly), the results may not be as intended.
Wikipedia user documentation advises users who generate SVG images that include text that they should upload two versions: one that has all the text vectorized, and one that includes the native text, the latter so that other users can edit the images without trying to reverse engineer the vectorized text. It also includes a link to a long list of fonts that Wikimedia can correctly render, most of which apparently are somewhat obscure free alternatives to more familiar proprietary fonts.
Sorry I don't have links for the Mediawiki articles, but I bounced around a lot gathering that information, and lost track of where I found it all among the abundance of information scattered around on Wikimedia sites.
Re: SVG images
Hmm, reading up on your links a bit, I don't think we could be confident enough in the sanitizers, especially since images in entries can display in a lot of contexts (and with a lot of permissions) and therefore need strict scrutiny. But thank you so much for looking into it more and pointing me to the resources!
Re: SVG images
I'll keep looking, and if I can find something that avoids the abundant risks of SVG (even if at the cost of greatly narrowing the extent of what one can do with it) I'll follow up more.
Also, I came across a more limited vector format that might be a possibility, if I can find it again. That might offer another path to vector image support, particularly if there's a good way to save an SVG as the other format.
Re: SVG images
no subject
I'm still writing a small browser script to enable the "Repost" functionality, similar to what the LJ has.
It is a simple GreaseMonkey/TamperMonkey Javascript tool, no LJ API is used.
At the moment, I have managed to populate Subject and Entry Text, but I seem to be stuck populating "to Community" dropdown, Tags, and "Icon" fields. These fields are apparently JS-rich and heavily depend on user interaction (mouse clicks etc).
Is there a tutorial of how to fill these fields programmatically?
no subject
no subject
Yes, my goal is to pre-populate the form so that it remains to only press the "Post" button. I tried to direct populate the fields that are submitted along with the form, but:
1. this would not display on screen
2. if you simply press "Post" button, some fields (Tags and something else, don't remember for sure) is not posted to the server upon the form submission.
no subject
.selected
), Icon dropdown, and Tags string input (after setting its.value
).Here's how to do it for Tags:
no subject
https://www.dreamwidth.org/update?subject=SUBJECT&event=BODY
no subject
Icons either a) duplicate somehow, replacing other icons, b) load after refreshing the page 3-4 times, or c) won't load at all. Is there something wrong with my databases or mogile, I wonder?
no subject
1- Do icons that have one of problems a-c have that problem only, consistently? Or do the same icons have sometimes a, sometimes b, sometimes c, and (maybe) sometimes no problem at all?
2- What is your %MOGILEFS_CONFIG and @BLOBSTORES in ext/local/etc/config* ? (Sanitize as necessary.)
3- What do your Apache logs show when problems b and c happen?
4- For problem a, is it just the icon that gets replaced, or also the keywords, comment, and description?
no subject
1) These problems are not consistent. Sometimes refreshing a few times fixes it, but not always. The ones that have problem a (duplicated icon) ARE consistent and the only way to fix it is to delete it and upload the correct icon again. Some icons have absolutely no problem at all.
2) Here's my %MOGILEFS_CONFIG:
BLOBLSTORE is nowhere to be found in the config file.
3) Here is part of my apache error log. Currently it's at 209 pages just from the last 24 hours with the same errors, so I only included the first 1-2 pages. The first line confuses me—"NOTE: Google::Checkout::* Perl modules were not found.
". I'm not sure what that means exactly.
4) Only the picture is replaced. Keywords/description/comments do not change.
no subject
no subject