dw_dev | Question thread #62

I don't know how to ask this question without a really long introduction, but if you want to save some reading, skip to the final paragraph once you get the idea.

I have a Livejournal with almost 8000 entries, over 1000 pictures, and 5000 comments received. Frequently I link one entry to another when referring to things I wrote about in past messages (and sometimes I even revise past messages to link later messages).

For example:

My first impressions of <a href="http://steve98052.livejournal.com/999.html"><i>Film Title</i></a> were kind of lukewarm, but having seen it again, . . .

Those intra-journal links are a mess for importing from Livejournal to Dreamwidth, because message serial numbers (the "999" in "999.html" above, in case that's not the standard term for those numbers) aren't necessarily going to be the same on different journal hosts.

The simple answer is that the intra-journal links stay as-is, linking the original journal rather than within the new journal, but that sends readers off to Russia. Not ideal.

My thought about how to make it work was to build a tool that opens an entry, finds intra-journal links, and inserts an HTML data-link-conversion="(value)" attribute into the <a href="(URL)"> tag. Then, after an import, the tool could find the data-link-conversion attributes, and modify the URL to match the new host of the impoted journal.

Note: the data-link-conversion tag is a user-defined attribute; search data-* attribute if it's unfamiliar.

My thought is that the attribute value of the data-link-conversion attribute could be either the year-month-date-hour:minute or the year-month-date-serial of the destination journal entry, or maybe even year-month-date-hour:minute-serial for error checking.

Of course, using a message's position in the calendar breaks if an edit to a message changes its calendar position without correcting all data-link-conversion attributes that point to it, which is messy business. Best answer there is, "Don't do that!" More precisely, it's probably best to strip the data-link-conversion attributes from both source and destination journal versions once the import and link conversion is finished.

So, finally, my questions are: 1. Is this a problem that anyone else sees as a problem? 2. Does my solution outline look like a good way to re-link intra-journal links? 3. Is anyone else interested in teaming up on such a tool? 4. Where should I look for code to use as a starting point for such a tool?

Yeah, this is a thing that people run into occasionally but not often enough for us to have put in the energy to solve it. (We've also been trying to keep the importer as simple as possible, because oh God the number of ways it breaks when it meets messy real-world data from up to twenty years or so of entries in multiple encodings and such is bad enough already, especially since each failure is pretty much unique and requires lots of troubleshooting time from the two people who have access to the server side instead of from just anyone, and hoo boy are those two people overworked.)

Are you thinking this would be an onsite tool or a downloadable/scripted/third party tool? We do store the original LJ URL on imported entries, so I can think of a couple of much simpler ways to do it onsite, actually. I'm probably missing some but here's what comes immediately to mind:

1) as an addition to the on-site import tool, at time of import, add a final step that will kick off a worker to scan through each entry and look for links to the journal that was imported from; populate a table of LJ<->DW entry URLs (the data exists but only searchable one way, I think), then go back and replace every LJ link in the journal to the corresponding DW link. Advantage: once-and-done. Disadvantage: see above re: complicating the importer, has the potential to be edge-casey as all get out (you'd be surprised at the horrible things people do to an innocent <a href=> in practice), the failure mode could result in data loss, did I mention we really REALLY don't want to add any other features to the importer that we'd have to support (and in fact have been discussing dropping support for a few of the more support-load-generating features), will make imports take longer since they'll be scanning every journal for internal links even if the journal doesn't have 'em.

2) Create a separate page that will let people push a button that will launch a worker that would let people scan through each entry and look for links to the journal that was imported from; populate table of LJ<->DW entry URLs, add a thing to the HTML cleaner that says "look for whether anything was ever imported to this journal; if it was, and you encounter a link leading to the import source, check the correspondence tables and rewrite the link at pageload. Advantage: doesn't complicate the importer, only happens if people ask the site to do it, if there's an edge case malformed HTML tag in the entry it will preserve the original that you can correct if the post is edited; disadvantage, higher load, possible performance issues since you're repeating the on-the-fly rewrite every time someone loads an entry.

3. Blah blah worker that will populate table of LJ<->DW URLS blah as above, HTML cleaner permanently rewrites links in the entry at the first time an entry is accessed after the URL mapping worker is done. Advantage, once-and-done; disadvantage, higher load, failure mode risks data loss when your regex encounters the edge case malformed HTML.

4. Blah blah worker that will populate table of LJ<->DW URLS blah as above. If a journal contains anything that's been imported and the user ticks off the "replace LJ links with DW links" button or whatever it's called, at time of pageload any link to the original journal source gets changed to https://username.dreamwidth.org/redirect?orig=http://username.livejournal.com/999.html instead, and the lookup doesn't happen until someone clicks the link (same way previous/next entries are calculated). Advantage: the data is only looked up if it's needed and not on every load. Disadvantage: once again, risk of "your regex meets edge case data and cries".

Of all those options, I would wager 4 is most likely the only approach

mark would even think of okaying (for performance reasons), or a 4a that would be one-time rewrite worker instead of an in-HTML-cleaner at-load rewrite. (I can't speak for him, obviously! but that would be my best guess.)

If you're thinking of third-party/downloadable tool, there still *might* be an easier way of doing it than with attributes, although I'm not familiar enough with attributes to say which would work better/faster. To explain what I'm thinking of, lemme quickly cover how URLs are formed:

* Each entry in a journal gets a unique 'jitemid' (journal item ID). It's the next number in sequence of the count of entries in your journal, with the caveat that if you post and then delete an entry, its jitemid is never reused. (So, if I have 10 entries in my journal, then post an 11th, the jitemid of that new entry is 11. If I delete jitemid 11, the next entry I post will be jitemid 12 and the entries will go 9, 10, 12.

* But! If we used those numbers in the URLs or showed them in journals or whatever, people looking at what my journal would look like in the above example would immediately be able to say "oho! Denise either deleted entry 11, or it's a private or filtered entry that I can't see!" Lesson learned painfully in the good old days of LJ, people get really weird if they know there's content their friend has posted that they can't see. So we add a "fuzz factor" by generating a random number from 0 to 255 (called the 'anum', I don't remember what it stands for).

* The jitemid and the anum together are used to form the entry's itemid, which is what you're calling the message serial numbers here. The itemid of an entry is formed by the following formula:

($jitemid * 256) + $anum = $itemid

* Soooo, given an entry with a URL of https://dw-dev.dreamwidth.org/205028.html, you can reconstruct that the itemid is 205028, and you can reconstruct the jitemid by dividing the 205028 by 256 and getting 800.890625 (so the jitemid is 800, meaning this is the 800th entry ever posted to dw-dev including entries that have been deleted) and you can reconstruct the anum by multiplying 800*256=204800 and subtracting that from the itemid (205028-204800 = 228).

ALL THAT HAVING BEEN INTRODUCED:

I honestly don't remember off the top of my head if the importer makes sure the DW version of an imported journal keeps the same itemid count as the itemid count of the LJ version. If it does and I reiterate that I'm not sure if it does or not, the following will be easy; if it doesn't, the following will be more complex, but still doable.

-- Connect to your LJ account, download its contents (using the old LJ version of jbackup.pl, which I can't guarantee still works on LJ because they closed their source in 2014), store the data locallly
-- Connect to your DW account, download its contents (using the DW version of jbackup.pl, which may also work on LJ if the old LJ one doesn't work for you anymore), store the data locally

----- Easy mode (jitemids are the same): looking up each jitemid in your exported data from each site, make a table of which URL on LJ corresponds to which URL on DW by finding the itemid for each jitemid

----- Harder mode (jitemids aren't the same): Compare contents across the two, using things like the jitemid, anum, itemid, subject line, user-provided time, calculations of the various identifiers as described above etc, to figure out which entries are the same (don't use server time; the server time on the DW side will be the time the import ran, not the time the original LJ post was made). (You will probably want to have your script keep count of how many jitemids were skipped and where, in case you have to manually intervene later if something goes horribly awry.) Make a table of which URL on LJ corresponds to which URL on DW once you're sure you've got it right.

-- Once you've identified which LJ post goes with which DW post, iterate over your DW exported data and find all your DW entries that contain links to http://steve98052.livejournal.com (by DW entry)

-- For each DW entry that contains links to http://steve98052.livejournal.com, consult your correspondence table and identify what the URL of that entry is on DW

-- Either have the script spit out a long list of DW entries for you to manually edit and change the links on (easier, doesn't require you to write an interface to DW, more about that in a sec) or teach the script how to ask DW to edit a particular entry and have it do the edits itself.

*pant pant pant*

OKAY COMING TO THE LAST PIECE I SWEAR but I'm gonna post this first because I bet I'm running up against the limit

First, a minor point: I read some Livejournal code years ago, and I seem to recall the itemid number generation slightly differently: I thought the new itemid was the highest existing itemid (even if deleted) plus 1 to 256 (possibly implemented as something like a next-item-id plus 0 to 255). If it were jitemid*256 plus random, one could derive jitemid by dividing itemid and discarding the fraction.

In the case of the serial number in the "year-month-date-serial" I mentioned, it's something in the portion of the client interface that I read and understood: one way that the server presents messages to clients is by year-month-date plus a serial number, or alternatively, flags that say whether it's the first, last, or neither entry for a given date.

I agree that importing an entire journal, searching for intra-journal links and building an associative array, and then generating a list would be fewer steps than the way I imagined. But I was interested in a design that would allow me to examine only a user-specified selection of entries, whether it's a single entry, a date range, or the whole works.

Of course there's a need for all sorts of bullet-proofing. And anything that would be designated as an official Dreamwidth tool would need to be designed to minimize support load from users. The wider the intended audience for such a tool, the better the error messages would have to be.

You can derive the jitemid, yeah (and it's occasionally useful in a support context or helping people to find an entry they thought they lost) -- it's not a perfect concealment, it's more "prevent people from noticing missing entries at a glance".

Funny, I looked at the code again, and even the old Livejournal code, and it's exactly like you say: jitemid times 256 plus a fudge factor. I just have read the code incorrectly way back when. So much for "I seem to recall". And I should have noticed the "we" in your reply and figured that a principal in the company would know the code a whole lot better than I did even at the time when I had freshly misread the code.

Oh, no, totally never take my word for something just because I co-own the place, I'm not any kind of authority on the tech end of things! (In this case, I actually did look it up because I couldn't remember the difference between itemid and jitemid, but sometimes I really am talking out of my ass and someone else is right instead.) So it's always good to check me. :)

Your attitude makes me all the more eager to figure out how to import my Livejournal completely, even though I have a permanent account there. Thanks. (And darn Russians!)

CONTINUED:

If you want your script to do the edits for you, it will need to be a DW client. For something to be a DW client, it will need to understand the DW API. The caveat: the current (version 1.0) version of the API is... well, I'll be charitable and call it "outdated". (By which I mean, 'I don't think it was created in this century'.) It's XML-RPC, it has not aged well, and it's not really documented (because it's so old we're a little ashamed of it and we've been trying to write a replacement API that, you know, isn't old enough to vote.)

https://github.com/dreamwidth/dw-free/blob/develop/src/xmlrpc-client/journal.pl is a very old and not very featureful command-line client Brad used to use that will give you some ideas, and https://github.com/dreamwidth/dw-free/blob/develop/src/jbackup/jbackup.pl is relevant here again because it's also a client (it just isn't geared to pulling up specific entries for edits). I don't THINK we have any sample code for interacting-with-already-posted-entries clients but https://www.livejournal.com/doc/server/ljp.csp.protocol.html is the old LJ server docs for the XML-RPC API and we haven't changed THAT much. I think dw-free/cgi-bin/LJ/Protocol.pm is where the magic happens if you need to consult it.

(I am a lot fuzzier on that last bit than on the stuff in the other comment; if you get stuck anywhere, it's best to make a new top-level dw-dev post and hope someone who knows more than I do sees it.)

My thought, based on the implementation that I proposed, was to build it as a client, because things like the data-link-conversion tag I imagined would require mucking with entries on both the source and destination journals.

I see that XML-RPC was the latest and greatest technology back when Livejournal went live, but yeah, in software years, it's antiquated. (And "old enough to vote" gave me a giggle.) But although it's an archaic interface, it's not a horribly complicated one. On the other hand, my Perl skills are negligible, so it might not be something I'd know how to do. On the other-other hand, if the client to do the job is substantial, learning Perl properly might be easier than coding the whole works in Java.

....I should also warn you that if you want to do the on-site tool, you should absolutely talk with

mark first and bounce your ideas off him, because I Am Not The Expert There and he might have better ideas! (Also, to save work for you in case I'm wrong about which option is less terrible.)

I am tentatively interested in testing a client that lets me review each change before it's made.

So am I.

One point in favor of that is that it avoids doing bad things in the event that the client doesn't have quite enough information to do the right thing. A point against it is that if there are thousands of changes to be made, reviewing them individually would be really tedious. Or is your meaning that you'd like to be a tester for such a tool, but would want confirmation of changes until it looks pretty solid?

In practice, I would probably want the tedious version until I felt it was solid, and I don't know how long that would take me.

I am sort of envisioning being able to approve a small batch, or something.

Sometimes it's nice to be able to include a vector diagram instead of a pixel image. Would it be complicated to be able to insert vector images into a Dreamwidth journal, instead of a pixel image?

Sadly, that's not possible, since SVG can contain scripting language and thus can be a security risk. Sorry!

I know there's an HTML cleaner that strips all HTML that can be abused. Would an SVG cleaner that passes only white-listed SVG elements be a substantial task? I know I don't use all that many different SVG elements for my vector images, even if we count all the headers and helpers that Inkscape adds as things I use.

Probably obvious, but:
Any kind of cleaner would have to be based on a white-list rather than a black-list, because a black-list can't reasonably be expected to strip every future element with abuse potential, while a white-list that strips safe elements can be updated if enough people are affected to be worth revising the white-list.

That was me; I mistakenly replied without logging in.

No worries. :) I'm not experienced enough to say if a SVG cleaner would alleviate issues -- maybe make a top level dw-dev post about it?

I did a little research on the matter, but I don't know nearly enough about the innards of Dreamwidth to guess what might fit the task. (If I was wrong about jitemid above, I may understand even less about it than I had thought.) Another consideration is how many people would want SVG support, or use it if it were available.

I found an article about SVG on Wordpress, explaining a little about the problems (which hadn't occurred to me, since I have only ever used the harmless parts of SVG). It points to a sanitizer that might be able to run on Dreamwidth, since it's Perl.

https://bjornjohansen.no/svg-in-wordpress

This is a PHP "sanitizer" for SVG, mentioned in the Wordpress article:

https://github.com/darylldoyle/svg-sanitizer

This is a Javascript santizer for HTML, MathML, and SVG, which was the basis for the PHP sanitizer:

https://github.com/cure53/DOMPurify

This is another PHP sanitizer, which is a lot simpler. That suggests that it's lighter on server load, and easier to validate, but also that it's more likely to reject non-malicious SVG:

https://github.com/alnorris/SVG-Sanitizer

Also, Mediawiki, using on Wikipedia and other Wikimedia sites, supports SVG. But in spite of some searching for how its SVG support works, all I could find is that the general information that it can serve SVG to users whose browsers can render it correctly, but normally it automatically converts to PNG for more consistent behavior between browsers.

One of the points about consistent rendering behavior is that if an SVG file includes text, but requests a font that isn't installed on the machine that does the rendering (the server if the server is automatically converting to PNG, or the client if it's delivering SVG directly), the results may not be as intended.

Wikipedia user documentation advises users who generate SVG images that include text that they should upload two versions: one that has all the text vectorized, and one that includes the native text, the latter so that other users can edit the images without trying to reverse engineer the vectorized text. It also includes a link to a long list of fonts that Wikimedia can correctly render, most of which apparently are somewhat obscure free alternatives to more familiar proprietary fonts.

Sorry I don't have links for the Mediawiki articles, but I bounced around a lot gathering that information, and lost track of where I found it all among the abundance of information scattered around on Wikimedia sites.

Hmm, reading up on your links a bit, I don't think we could be confident enough in the sanitizers, especially since images in entries can display in a lot of contexts (and with a lot of permissions) and therefore need strict scrutiny. But thank you so much for looking into it more and pointing me to the resources!

I'll keep looking, and if I can find something that avoids the abundant risks of SVG (even if at the cost of greatly narrowing the extent of what one can do with it) I'll follow up more.

Also, I came across a more limited vector format that might be a possibility, if I can find it again. That might offer another path to vector image support, particularly if there's a good way to save an SVG as the other format.

Before I toss it out as a top-level message, I figured I'd see whether I'm on the right track, particularly on the question of whether anyone else would find SVG useful.

Is there a guideline for programmatic populating the fields within the DreamWidth's "Create Entries" page?

I'm still writing a small browser script to enable the "Repost" functionality, similar to what the LJ has.

It is a simple GreaseMonkey/TamperMonkey Javascript tool, no LJ API is used.

At the moment, I have managed to populate Subject and Entry Text, but I seem to be stuck populating "to Community" dropdown, Tags, and "Icon" fields. These fields are apparently JS-rich and heavily depend on user interaction (mouse clicks etc).

Is there a tutorial of how to fill these fields programmatically?

Is there a tutorial of how to fill these fields programmatically?

If there is, I'm not aware of it. But I think it would be interesting to have one, if you feel like writing it as you go along. Alternatively, this may be a job better suited to the new API we have in development, if that is usable from GM scripts. (Which it may not be. I know very little about GM. Also, there might be no API to prepopulate that form without posting, if that's what you want to do.)

Thank you for the answer, this is what I was afraid of. :)

Yes, my goal is to pre-populate the form so that it remains to only press the "Post" button. I tried to direct populate the fields that are submitted along with the form, but:

1. this would not display on screen
2. if you simply press "Post" button, some fields (Tags and something else, don't remember for sure) is not posted to the server upon the form submission.

Wow, that was easy. Dispatching a 'change' event works for all three JS-rich elements: To-Community dropdown (after setting its .selected), Icon dropdown, and Tags string input (after setting its .value).
Here's how to do it for Tags:

const CLS_INPUT_TAGS = "autocomplete-input";
elTags = document.getElementsByClassName(CLS_INPUT_TAGS)[0];
elTags.value = "tag1, tag2";
elTags.dispatchEvent(new Event('change'));

Edited 2018-03-30 12:33 (UTC)

I don't know about Create Entries, but today I needed to figure out how to populate the "Post" page:

https://www.dreamwidth.org/update?subject=SUBJECT&event=BODY

Hi guys! I have another issue for y'all. I've got a problem with userpics that I can't figure out on my own. Here's what's happening:

Icons either a) duplicate somehow, replacing other icons, b) load after refreshing the page 3-4 times, or c) won't load at all. Is there something wrong with my databases or mogile, I wonder?

OK. I may be out of my depth here, but here's some semi-intelligent questions to try getting diagnosis started, hopefully.

1- Do icons that have one of problems a-c have that problem only, consistently? Or do the same icons have sometimes a, sometimes b, sometimes c, and (maybe) sometimes no problem at all?

2- What is your %MOGILEFS_CONFIG and @BLOBSTORES in ext/local/etc/config* ? (Sanitize as necessary.)

3- What do your Apache logs show when problems b and c happen?

4- For problem a, is it just the icon that gets replaced, or also the keywords, comment, and description?

Sorry for the delay—RL is quite a thing. Anyway!

1) These problems are not consistent. Sometimes refreshing a few times fixes it, but not always. The ones that have problem a (duplicated icon) ARE consistent and the only way to fix it is to delete it and upload the correct icon again. Some icons have absolutely no problem at all.

2) Here's my %MOGILEFS_CONFIG:

%MOGILEFS_CONFIG = (
        hosts => [ '127.0.0.1:7001' ],
        root => '/mnt/mogdata',
        classes => {
            'userpics' => 1, 'media' => 1, 'vgifts' => 1, 'temp' => 1  # define any special MogileFS classes you need
        },
    );

BLOBLSTORE is nowhere to be found in the config file.

3) Here is part of my apache error log. Currently it's at 209 pages just from the last 24 hours with the same errors, so I only included the first 1-2 pages. The first line confuses me—"NOTE: Google::Checkout::* Perl modules were not found.
". I'm not sure what that means exactly.

4) Only the picture is replaced. Keywords/description/comments do not change.

My turn to apologize for the delay. I was hoping Mark or someone who understands what goes on with Blobstore would answer. I'll try to put together a more substantive answer for you in the next few days, if no one beats me to it.

Thank you so much! Please take your time. :)

Question thread #62

enhanced Livejounal import

Re: enhanced Livejounal import

Re: enhanced Livejounal import

Re: enhanced Livejounal import

Re: enhanced Livejounal import

Re: enhanced Livejounal import

Re: enhanced Livejounal import

Re: enhanced Livejounal import

Re: enhanced Livejounal import

Re: enhanced Livejounal import

Re: enhanced Livejounal import

Re: enhanced Livejounal import

Re: enhanced Livejounal import

Re: enhanced Livejounal import

SVG images

Re: SVG images

Re: SVG images

Re: SVG images

Re: SVG images

Re: SVG images

Re: SVG images

Re: SVG images

Re: SVG images

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject