Entry tags:
Question thread #118
It's time for another question thread!
The rules:
- You may ask any dev-related question you have in a comment. (It doesn't even need to be about Dreamwidth, although if it involves a language/library/framework/database Dreamwidth doesn't use, you will probably get answers pointing that out and suggesting a better place to ask.)
- You may also answer any question, using the guidelines given in To Answer, Or Not To Answer and in this comment thread.
The rules:
- You may ask any dev-related question you have in a comment. (It doesn't even need to be about Dreamwidth, although if it involves a language/library/framework/database Dreamwidth doesn't use, you will probably get answers pointing that out and suggesting a better place to ask.)
- You may also answer any question, using the guidelines given in To Answer, Or Not To Answer and in this comment thread.

Re: well howdy
Re: well howdy
I suppose that could also be slow enough to cause an issue, I didn't think about that. I was thinking I was being rate-limited before so I really cranked it down.
It is very weird to me too and I'm pretty lost. :/
Re: well howdy
-- We occasionally have issues with the importer where a specific post is broken in some weird way (bad UTF-8, invalid characters, and control sequences that snuck invisibly in are the most common) that the sync API just chokes and gets stuck on. If lastsync is always getting stuck on the same post (or around it if you move posts around), maybe take the contents of that entry and dump them into a text editor like BBEdit that will let you "zap gremlins", then paste the cleaned up text back into the entry body and see if that lets you progress further? (It's also possible that it's choking on a comment to the entry, in which case you'll have to delete it in order to proceeed, if it's not one that you made.)
-- I haven't looked at the LJ-to-WP importer, but: is it possible that the stuff it's grabbing with no problem is stuff you imported from LJ and it's choking on stuff you've posted to DW? If it doesn't know how to handle entry properties that we added on DW that LJ doesn't have, that may be why it's dying; if (for instance) the entry it's getting stuck at is one you used Markdown formatting in, that's an entryprop that LJ doesn't have and it might be freaking out at that.
-- Are you pulling comments and entries, or just entries? If you're trying to pull comments, too, and store them with the entries with threading intact, you may need to do a lot of checking the validity of the comment parent/child tree. If the entry it's choking on has any deleted or suspended comments, that may be it. (You wouldn't believe how much error checking we have to do with that in the importer, and it still breaks from time to time and we have to fix things manually.)
-- Is the entry it's choking on (by itemid) one of a long run of entries that are backdated/dated out of order? There's a weird issue I'm only vaguely remembering because there's special-casing for it in the import workflow where if you have too many entries backdated in a row, lastsync will just throw up its hands and die horribly. I think it happens when the number of items you're trying to get from the lastsync point returns no entry that is not backdated -- so like, if you're trying to fetch things 50 items at a time, and you hit a run where there are 51 backdated entries, lastsync gets really badly confused. I think I vaguely remember that the importer uses backdating in some cases, so if you're hitting a run of entries that were imported from LJ, that may be the problem.
Those are the big ones that come to mind; two of them are things that cause problems in the on-site importer all the time, and you might want to take a look at some of the error checking and logging we do. My gut instinct is that the last one I thought of (the backdating thing -- which I thought of when I was looking through the importer code) is most likely to be the culprit, with the "bad characters in the entry it's choking on" in second place and the other two in a distant third.
(If it does turn out to be the backdating thing, the good news is it's fixable; the bad news is it's going to be a pain, because the only real fix is to edit the entries and uncheck the backdated flag.)
Re: well howdy
In order:
-- I've found a few entries with bad UTF-8 and fixed those. That solved a couple of problems I was having early on with this project, but I've run out of them.
-- It's not even looking at the details where it's failing; it's not choking on unknown entryprops. where it's failing, it's just trying to build the items list.
-- No comment pickup yet except in that we're getting the big list of all items. If I have to throw out comments, so be it.
-- None of the bad dates point directly at an item - it's always a comment. But if I edit the entry that has that comment, the item lastsync gets stuck on changes. After I did that several times, I decided/realised it was systemic and that wasn't much of a solution. :/
-- I am fetching a lot more than 50 entries at a time. This feels dumb, but I'm not seeing a way to tell whether an entry is backdated, however. But at least some of the entries in this timeframe were echoed from my old band blog, which would've been direct - I set that up originally to use JournalPress to crosspost to here (as a normal post, with comments enabled), and then use Dreamwidth to crosspost further back to Livejournal with comments (there) disabled and a pointer back here. And I'm pretty sure JournalPress doesn't mark posts backdated, as I wanted them to appear in friendlists. (And posts from my new let's-amalgamate-everything personal blog absolutely do, I'm getting comments here on them.)
Re: well howdy
You can tell if an entry is backdated by editing it: if you're using the old update page, it'll be a checkbox near the date item labeled "date out of order" or "don't show on entry pages" (I forget if we renamed it there or not), and if you're using the new entry page it's in the Display Date panel and labeled "don't show on reading pages".
If it's always getting stuck on comments, I wonder if the script you're using just doesn't have good comment mapping error handling. Have you taken a look at jbackup.pl that Pau linked to above? Is there anything like that in the script you're using? I've never looked at it, so I don't know how well it handles weirdnesses. I know you've probably hit the point of "I've put this much effort into it, I'm going to make this work goddammit," but just to rule out issues with the script itself: if you try running jbackup.pl does it get stuck in the same place?
Re: well howdy
That's not it. We're expecting C- items mixed in with L- items; this is the one big pull of all item metadata.
Also, neither side is getting stuck on a single actual item, the comment identifiers keep changing, as per the debug output I provided in another reply. It's just that they have huge swaths of repeated lastsyncs.
I'm kinda thinking this would fine at this point if we didn't have the shutdown for repeated pulls of the same lastsync.
Re: well howdy
Re: well howdy
Re: well howdy
Re: well howdy
Which tells me mostly I think that the database on your end is fine as far as your own tools are concerned (which isn't really surprising, I mean, of course it is).
Sadly, from what I've been able to find, there are not tools to import Dreamwidth XML exports to Wordpress - I'm not the first person to try this, I'm just the first person who hasn't bailed and copied everything over by hand. (At least, that's what I was finding in searches before I started trying to adapt the Livejournal importer to Dreamwidth.) As far as I've been able to find, few people have tried to make this work and so far none have succeeded.
I wonder if there's something different in Wordpress's PHP implementation of the XMLRPC libraries? But now I'm just in "whelp I dunno" territory and am guessing.
Re: well howdy