pauamma: Cartooney crab wearing hot pink and acid green facemask holding drink with straw (Default)
Res facta quae tamen fingi potuit ([personal profile] pauamma) wrote in [site community profile] dw_dev2023-02-01 01:57 am
Entry tags:

Question thread #118

It's time for another question thread!

The rules:

- You may ask any dev-related question you have in a comment. (It doesn't even need to be about Dreamwidth, although if it involves a language/library/framework/database Dreamwidth doesn't use, you will probably get answers pointing that out and suggesting a better place to ask.)
- You may also answer any question, using the guidelines given in To Answer, Or Not To Answer and in this comment thread.
solarbird: (Default)

Re: well howdy

[personal profile] solarbird 2023-02-06 09:21 am (UTC)(link)
Right now I have it slowed down pretty far because I'm pausing 2 seconds every authentication and I'm fine with that. The metadata elements are sent back in a big package and I'd realised that before writing that last post so I think I'm actually fine, authentication happens every batch so it's like... only a couple of calls per second at most.

I suppose that could also be slow enough to cause an issue, I didn't think about that. I was thinking I was being rate-limited before so I really cranked it down.

It is very weird to me too and I'm pretty lost. :/
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

Re: well howdy

[staff profile] denise 2023-02-07 09:28 pm (UTC)(link)
I'm also kind of stumped, but me thinking out loud some reasons why lastsync might be getting stuck:

-- We occasionally have issues with the importer where a specific post is broken in some weird way (bad UTF-8, invalid characters, and control sequences that snuck invisibly in are the most common) that the sync API just chokes and gets stuck on. If lastsync is always getting stuck on the same post (or around it if you move posts around), maybe take the contents of that entry and dump them into a text editor like BBEdit that will let you "zap gremlins", then paste the cleaned up text back into the entry body and see if that lets you progress further? (It's also possible that it's choking on a comment to the entry, in which case you'll have to delete it in order to proceeed, if it's not one that you made.)

-- I haven't looked at the LJ-to-WP importer, but: is it possible that the stuff it's grabbing with no problem is stuff you imported from LJ and it's choking on stuff you've posted to DW? If it doesn't know how to handle entry properties that we added on DW that LJ doesn't have, that may be why it's dying; if (for instance) the entry it's getting stuck at is one you used Markdown formatting in, that's an entryprop that LJ doesn't have and it might be freaking out at that.

-- Are you pulling comments and entries, or just entries? If you're trying to pull comments, too, and store them with the entries with threading intact, you may need to do a lot of checking the validity of the comment parent/child tree. If the entry it's choking on has any deleted or suspended comments, that may be it. (You wouldn't believe how much error checking we have to do with that in the importer, and it still breaks from time to time and we have to fix things manually.)

-- Is the entry it's choking on (by itemid) one of a long run of entries that are backdated/dated out of order? There's a weird issue I'm only vaguely remembering because there's special-casing for it in the import workflow where if you have too many entries backdated in a row, lastsync will just throw up its hands and die horribly. I think it happens when the number of items you're trying to get from the lastsync point returns no entry that is not backdated -- so like, if you're trying to fetch things 50 items at a time, and you hit a run where there are 51 backdated entries, lastsync gets really badly confused. I think I vaguely remember that the importer uses backdating in some cases, so if you're hitting a run of entries that were imported from LJ, that may be the problem.

Those are the big ones that come to mind; two of them are things that cause problems in the on-site importer all the time, and you might want to take a look at some of the error checking and logging we do. My gut instinct is that the last one I thought of (the backdating thing -- which I thought of when I was looking through the importer code) is most likely to be the culprit, with the "bad characters in the entry it's choking on" in second place and the other two in a distant third.

(If it does turn out to be the backdating thing, the good news is it's fixable; the bad news is it's going to be a pain, because the only real fix is to edit the entries and uncheck the backdated flag.)
solarbird: (Default)

Re: well howdy

[personal profile] solarbird 2023-02-07 09:48 pm (UTC)(link)
Hey, Denice ^_^ -

In order:

-- I've found a few entries with bad UTF-8 and fixed those. That solved a couple of problems I was having early on with this project, but I've run out of them.

-- It's not even looking at the details where it's failing; it's not choking on unknown entryprops. where it's failing, it's just trying to build the items list.

-- No comment pickup yet except in that we're getting the big list of all items. If I have to throw out comments, so be it.

-- None of the bad dates point directly at an item - it's always a comment. But if I edit the entry that has that comment, the item lastsync gets stuck on changes. After I did that several times, I decided/realised it was systemic and that wasn't much of a solution. :/

-- I am fetching a lot more than 50 entries at a time. This feels dumb, but I'm not seeing a way to tell whether an entry is backdated, however. But at least some of the entries in this timeframe were echoed from my old band blog, which would've been direct - I set that up originally to use JournalPress to crosspost to here (as a normal post, with comments enabled), and then use Dreamwidth to crosspost further back to Livejournal with comments (there) disabled and a pointer back here. And I'm pretty sure JournalPress doesn't mark posts backdated, as I wanted them to appear in friendlists. (And posts from my new let's-amalgamate-everything personal blog absolutely do, I'm getting comments here on them.)
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

Re: well howdy

[staff profile] denise 2023-02-07 09:55 pm (UTC)(link)
hmmmmmmm. Yeah, we're way out past the edges of my knowledge here (I'm just working from what we've seen before in the importer).

You can tell if an entry is backdated by editing it: if you're using the old update page, it'll be a checkbox near the date item labeled "date out of order" or "don't show on entry pages" (I forget if we renamed it there or not), and if you're using the new entry page it's in the Display Date panel and labeled "don't show on reading pages".

If it's always getting stuck on comments, I wonder if the script you're using just doesn't have good comment mapping error handling. Have you taken a look at jbackup.pl that Pau linked to above? Is there anything like that in the script you're using? I've never looked at it, so I don't know how well it handles weirdnesses. I know you've probably hit the point of "I've put this much effort into it, I'm going to make this work goddammit," but just to rule out issues with the script itself: if you try running jbackup.pl does it get stuck in the same place?
solarbird: (Default)

Re: well howdy

[personal profile] solarbird 2023-02-07 11:09 pm (UTC)(link)
If it's always getting stuck on comments, I wonder if the script you're using just doesn't have good comment mapping error handling.

That's not it. We're expecting C- items mixed in with L- items; this is the one big pull of all item metadata.

Also, neither side is getting stuck on a single actual item, the comment identifiers keep changing, as per the debug output I provided in another reply. It's just that they have huge swaths of repeated lastsyncs.

I'm kinda thinking this would fine at this point if we didn't have the shutdown for repeated pulls of the same lastsync.
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

Re: well howdy

[staff profile] denise 2023-02-09 10:58 pm (UTC)(link)
Have you been able to try jbackup.pl to see if it gets stuck the same way?
solarbird: (Default)

Re: well howdy

[personal profile] solarbird 2023-02-10 05:36 am (UTC)(link)
I haven't - I got busy finishing another project that went well past anticipated work. I'll give it a go tomorrow (Friday).
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

Re: well howdy

[staff profile] denise 2023-02-10 07:20 am (UTC)(link)
I don't necessarily expect it to work but it will help rule out issues with the script!
solarbird: (Default)

Re: well howdy

[personal profile] solarbird 2023-02-11 05:35 am (UTC)(link)
Okay. I haven't been eager to dive in because I know literally nothing about perl. But I found the needed dependencies (via Debian, which was easier than trying to do it in MacOS lol) and it ran to completion and after running again specifying XML (vs. default which I guess means raw which gave me some chonky binary file) I got a 60MB file which on casual inspection seems like it's probably complete.

Which tells me mostly I think that the database on your end is fine as far as your own tools are concerned (which isn't really surprising, I mean, of course it is).

Sadly, from what I've been able to find, there are not tools to import Dreamwidth XML exports to Wordpress - I'm not the first person to try this, I'm just the first person who hasn't bailed and copied everything over by hand. (At least, that's what I was finding in searches before I started trying to adapt the Livejournal importer to Dreamwidth.) As far as I've been able to find, few people have tried to make this work and so far none have succeeded.

I wonder if there's something different in Wordpress's PHP implementation of the XMLRPC libraries? But now I'm just in "whelp I dunno" territory and am guessing.
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

Re: well howdy

[staff profile] denise 2023-02-12 12:33 am (UTC)(link)
That ... is really weird and interesting and I was not expecting that result! Okay, yeah, at this point it could be because the PHP XML/RPC libraries may have bitrotted a bit because it's not a very popular format anymore or Wordpress is using a bad version or something. We're way out past the far, far edge of my knowledge too, but my suggestion would be a) check what version of the library Wordpress is using and see if there's a newer one (and that the newer one doesn't break Wordpress, heh), or b) step the module backwards a few versions and keep trying after each regression to see if an older version of it works. Basically, playing around with the versions and seeing if any of them fix the problem. But I don't know PHP or Wordpress, so this is just me guessing too. (We've had a few instances where we have to pin to a particular version of a Perl module for a while because a newer one breaks something, etc.)