pauamma: Cartooney crab wearing hot pink and acid green facemask holding drink with straw (Default)
Res facta quae tamen fingi potuit ([personal profile] pauamma) wrote in [site community profile] dw_dev2023-02-01 01:57 am
Entry tags:

Question thread #118

It's time for another question thread!

The rules:

- You may ask any dev-related question you have in a comment. (It doesn't even need to be about Dreamwidth, although if it involves a language/library/framework/database Dreamwidth doesn't use, you will probably get answers pointing that out and suggesting a better place to ask.)
- You may also answer any question, using the guidelines given in To Answer, Or Not To Answer and in this comment thread.
solarbird: (Default)

well howdy

[personal profile] solarbird 2023-02-04 08:31 am (UTC)(link)
Hi, been a while, threw you some code a few years ago, never left the site but haven't been doing much code lately. Been here for ages, no plans to leave.

What's good beahviour for remote API calls? What kind of delay or rate per hour is okay? I didn't see anything in the FAQ (apologies if I missed it), I didn't see anything when doing some searching.

The reason I'm asking is because while I'm not planning to leave, running a Mastodon server has reminded me that it's nice to have my own copies of my stuff, and since we already do a lot of Wordpress, that seemed like a good way. So I've mildly modified their Livejournal importer to import Dreamwidth, and got through about 130 entries before getting an API rate shutdown.

That's fine, I don't mind. But I have a lot of entries and that's gonna be a lot of halts. I'd rather add some delay in the importer to avoid that.

So. What's a good amount of time between, say, authentications - the easiest place to do it in this code - to slow down and wait?

Thanks!

eta: the specific error is "Client error: Client is making repeated requests. Perhaps it's broken?" in case that's different. This code does make a lot of requests.
Edited 2023-02-04 08:34 (UTC)
solarbird: (Default)

Re: well howdy

[personal profile] solarbird 2023-02-04 09:04 pm (UTC)(link)
Is that lastsync?

This is mostly just the extant Livejournal to Wodpress import code so I don't know it very well. But I think that the problem is that this code supports retrying after transient failure, and the above is the error it gets when trying to start back up, and I think it is requesting the same item twice because it's iterating through to find out where it needs to pick back up and when it finds it it then throws back to the normal import code which means repeating the value.

But it's a lot of code and I literally started working on this yesterday so I'm not sure. Don't quote me on it. xD

Anyway avoiding that is why I'm trying to find out how much I need to slow this down so I don't get an actual rate block. I've got it really slow right now and did a complete start-over of import and so far it's okay? But I have a lot of entries on my blog here. I don't think the odds of getting through all of them in one pass are high. xD
solarbird: (Default)

Re: well howdy

[personal profile] solarbird 2023-02-06 05:30 am (UTC)(link)
I have done a lot of work on this thing at this point and here's a thing that happens:

After getting a little over 4100 metadata elements (of over 11,000), Dreamwidth starts returning the same lastsync every time. [ETA: I'm slightly wrong - it is definitely returning mass repeats of lastsync, but if I let it go enough it starts incrementing them by a second or a few. I have an absolute boatload of debug output now.]

If I edit a post made on the day pointed to by the lastsync returned, which lastsync I get changes, but I get no additional lastsync entries.

Is this an API call rate response?

If so, how much do I have to slow down to fix it? Because I am clearly not slowing down enough.

If not, what is causing this?

(The most recent version is 2011-11-22 08:03:50.000000 if that's of use.)
Edited 2023-02-06 07:09 (UTC)
solarbird: (Default)

Re: well howdy

[personal profile] solarbird 2023-02-06 09:21 am (UTC)(link)
Right now I have it slowed down pretty far because I'm pausing 2 seconds every authentication and I'm fine with that. The metadata elements are sent back in a big package and I'd realised that before writing that last post so I think I'm actually fine, authentication happens every batch so it's like... only a couple of calls per second at most.

I suppose that could also be slow enough to cause an issue, I didn't think about that. I was thinking I was being rate-limited before so I really cranked it down.

It is very weird to me too and I'm pretty lost. :/
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

Re: well howdy

[staff profile] denise 2023-02-07 09:28 pm (UTC)(link)
I'm also kind of stumped, but me thinking out loud some reasons why lastsync might be getting stuck:

-- We occasionally have issues with the importer where a specific post is broken in some weird way (bad UTF-8, invalid characters, and control sequences that snuck invisibly in are the most common) that the sync API just chokes and gets stuck on. If lastsync is always getting stuck on the same post (or around it if you move posts around), maybe take the contents of that entry and dump them into a text editor like BBEdit that will let you "zap gremlins", then paste the cleaned up text back into the entry body and see if that lets you progress further? (It's also possible that it's choking on a comment to the entry, in which case you'll have to delete it in order to proceeed, if it's not one that you made.)

-- I haven't looked at the LJ-to-WP importer, but: is it possible that the stuff it's grabbing with no problem is stuff you imported from LJ and it's choking on stuff you've posted to DW? If it doesn't know how to handle entry properties that we added on DW that LJ doesn't have, that may be why it's dying; if (for instance) the entry it's getting stuck at is one you used Markdown formatting in, that's an entryprop that LJ doesn't have and it might be freaking out at that.

-- Are you pulling comments and entries, or just entries? If you're trying to pull comments, too, and store them with the entries with threading intact, you may need to do a lot of checking the validity of the comment parent/child tree. If the entry it's choking on has any deleted or suspended comments, that may be it. (You wouldn't believe how much error checking we have to do with that in the importer, and it still breaks from time to time and we have to fix things manually.)

-- Is the entry it's choking on (by itemid) one of a long run of entries that are backdated/dated out of order? There's a weird issue I'm only vaguely remembering because there's special-casing for it in the import workflow where if you have too many entries backdated in a row, lastsync will just throw up its hands and die horribly. I think it happens when the number of items you're trying to get from the lastsync point returns no entry that is not backdated -- so like, if you're trying to fetch things 50 items at a time, and you hit a run where there are 51 backdated entries, lastsync gets really badly confused. I think I vaguely remember that the importer uses backdating in some cases, so if you're hitting a run of entries that were imported from LJ, that may be the problem.

Those are the big ones that come to mind; two of them are things that cause problems in the on-site importer all the time, and you might want to take a look at some of the error checking and logging we do. My gut instinct is that the last one I thought of (the backdating thing -- which I thought of when I was looking through the importer code) is most likely to be the culprit, with the "bad characters in the entry it's choking on" in second place and the other two in a distant third.

(If it does turn out to be the backdating thing, the good news is it's fixable; the bad news is it's going to be a pain, because the only real fix is to edit the entries and uncheck the backdated flag.)
solarbird: (Default)

Re: well howdy

[personal profile] solarbird 2023-02-07 09:48 pm (UTC)(link)
Hey, Denice ^_^ -

In order:

-- I've found a few entries with bad UTF-8 and fixed those. That solved a couple of problems I was having early on with this project, but I've run out of them.

-- It's not even looking at the details where it's failing; it's not choking on unknown entryprops. where it's failing, it's just trying to build the items list.

-- No comment pickup yet except in that we're getting the big list of all items. If I have to throw out comments, so be it.

-- None of the bad dates point directly at an item - it's always a comment. But if I edit the entry that has that comment, the item lastsync gets stuck on changes. After I did that several times, I decided/realised it was systemic and that wasn't much of a solution. :/

-- I am fetching a lot more than 50 entries at a time. This feels dumb, but I'm not seeing a way to tell whether an entry is backdated, however. But at least some of the entries in this timeframe were echoed from my old band blog, which would've been direct - I set that up originally to use JournalPress to crosspost to here (as a normal post, with comments enabled), and then use Dreamwidth to crosspost further back to Livejournal with comments (there) disabled and a pointer back here. And I'm pretty sure JournalPress doesn't mark posts backdated, as I wanted them to appear in friendlists. (And posts from my new let's-amalgamate-everything personal blog absolutely do, I'm getting comments here on them.)
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

Re: well howdy

[staff profile] denise 2023-02-07 09:55 pm (UTC)(link)
hmmmmmmm. Yeah, we're way out past the edges of my knowledge here (I'm just working from what we've seen before in the importer).

You can tell if an entry is backdated by editing it: if you're using the old update page, it'll be a checkbox near the date item labeled "date out of order" or "don't show on entry pages" (I forget if we renamed it there or not), and if you're using the new entry page it's in the Display Date panel and labeled "don't show on reading pages".

If it's always getting stuck on comments, I wonder if the script you're using just doesn't have good comment mapping error handling. Have you taken a look at jbackup.pl that Pau linked to above? Is there anything like that in the script you're using? I've never looked at it, so I don't know how well it handles weirdnesses. I know you've probably hit the point of "I've put this much effort into it, I'm going to make this work goddammit," but just to rule out issues with the script itself: if you try running jbackup.pl does it get stuck in the same place?
solarbird: (Default)

Re: well howdy

[personal profile] solarbird 2023-02-07 11:09 pm (UTC)(link)
If it's always getting stuck on comments, I wonder if the script you're using just doesn't have good comment mapping error handling.

That's not it. We're expecting C- items mixed in with L- items; this is the one big pull of all item metadata.

Also, neither side is getting stuck on a single actual item, the comment identifiers keep changing, as per the debug output I provided in another reply. It's just that they have huge swaths of repeated lastsyncs.

I'm kinda thinking this would fine at this point if we didn't have the shutdown for repeated pulls of the same lastsync.
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

Re: well howdy

[staff profile] denise 2023-02-09 10:58 pm (UTC)(link)
Have you been able to try jbackup.pl to see if it gets stuck the same way?
solarbird: (Default)

Re: well howdy

[personal profile] solarbird 2023-02-10 05:36 am (UTC)(link)
I haven't - I got busy finishing another project that went well past anticipated work. I'll give it a go tomorrow (Friday).
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

Re: well howdy

[staff profile] denise 2023-02-10 07:20 am (UTC)(link)
I don't necessarily expect it to work but it will help rule out issues with the script!
solarbird: (Default)

Re: well howdy

[personal profile] solarbird 2023-02-11 05:35 am (UTC)(link)
Okay. I haven't been eager to dive in because I know literally nothing about perl. But I found the needed dependencies (via Debian, which was easier than trying to do it in MacOS lol) and it ran to completion and after running again specifying XML (vs. default which I guess means raw which gave me some chonky binary file) I got a 60MB file which on casual inspection seems like it's probably complete.

Which tells me mostly I think that the database on your end is fine as far as your own tools are concerned (which isn't really surprising, I mean, of course it is).

Sadly, from what I've been able to find, there are not tools to import Dreamwidth XML exports to Wordpress - I'm not the first person to try this, I'm just the first person who hasn't bailed and copied everything over by hand. (At least, that's what I was finding in searches before I started trying to adapt the Livejournal importer to Dreamwidth.) As far as I've been able to find, few people have tried to make this work and so far none have succeeded.

I wonder if there's something different in Wordpress's PHP implementation of the XMLRPC libraries? But now I'm just in "whelp I dunno" territory and am guessing.
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

Re: well howdy

[staff profile] denise 2023-02-12 12:33 am (UTC)(link)
That ... is really weird and interesting and I was not expecting that result! Okay, yeah, at this point it could be because the PHP XML/RPC libraries may have bitrotted a bit because it's not a very popular format anymore or Wordpress is using a bad version or something. We're way out past the far, far edge of my knowledge too, but my suggestion would be a) check what version of the library Wordpress is using and see if there's a newer one (and that the newer one doesn't break Wordpress, heh), or b) step the module backwards a few versions and keep trying after each regression to see if an older version of it works. Basically, playing around with the versions and seeing if any of them fix the problem. But I don't know PHP or Wordpress, so this is just me guessing too. (We've had a few instances where we have to pin to a particular version of a Perl module for a while because a newer one breaks something, etc.)
alierak: (Default)

Re: well howdy

[personal profile] alierak 2023-02-07 07:50 pm (UTC)(link)
Requesting syncitems, for the same journal, with the same lastsync, 3 times in an hour will get you the "repeated requests" error. There is no point in slowing down if you're repeating the same request. As to why you'd be stuck at a particular lastsync value (other than the most recent), I'm not really sure.

ETA: I mean to say, repeating the same request is normal and expected, if you're just occasionally checking for new posts or something. It's a problem if you've found yourself doing that in the middle of iterating, and the rate doesn't matter if you're making zero progress.
Edited 2023-02-07 19:58 (UTC)
solarbird: (Default)

Re: well howdy

[personal profile] solarbird 2023-02-07 09:10 pm (UTC)(link)
For context, what I'm doing is taking the existing official Wordpress-maintained Livejournal importer and make it work against Dreamwidth. I thought this would be like four lines of code changes to point to proper servers. I was very wrong.

I'm not handing Dreamwidth repeated lastsyncs. That is not a thing I am doing. This was happening, but I have coded around it. This code is fugly but that's not important right now, it works.

It turned out eventually that the reason I was having a problem with repeated lastsyncs is:

Dreamwidth is giving me the same lastsync over and over again.

The importer works as you'd expect for the first 4125 entries, then I start getting the same lastsync from Dreamwidth when trying to iterate through all items.

If it matters, they're also all C- items, not L- items. I'm wondering if this has anything to do with when when I imported everything from Livejournal to here, though that's kind of a guess and I don't even remember when I bailed on LJ.

The last L- item I get is 2011-11-22 08:03:50.000000, then it's all C- items with repeated lastsync numbers.

If I let my "let's not trigger repeat detection" code and start switching up my requests on it (and that's some stupid code let me tell you but it works) then eventually it starts incrementing lastsync again - by between one and a small number of seconds. They're still all C- items.

If I let _that_ run long enough to try to chew through all the C- items, Dreamwidth stops responding to authentication before I get to the end of total items.

Excerpted from a massive debug lot, here's the last L- item, followed by the "C-" flood starting. As you can see, the item numbers keep changing, but the lastsync doesn't.

[124] => Array ( [action] => update [item] => L-467 [time] => 2011-11-22 08:03:50.000000 )
[125] => Array ( [action] => update [item] => C-481 [time] => 2011-11-26 07:04:24.000000 )
[126] => Array ( [time] => 2011-11-26 07:04:24.000000 [item] => C-473 [action] => update )
[127] => Array ( [action] => update [item] => C-489 [time] => 2011-11-26 07:04:24.000000 )
[128] => Array ( [action] => update [time] => 2011-11-26 07:04:24.000000 [item] => C-483 )
[129] => Array ( [item] => C-482 [time] => 2011-11-26 07:04:24.000000 [action] => update )
[130] => Array ( [action] => update [time] => 2011-11-26 07:04:24.000000 [item] => C-474 )
[131] => Array ( [item] => C-491 [time] => 2011-11-26 07:04:25.000000 [action] => update )
[132] => Array ( [time] => 2011-11-26 07:04:25.000000 [item] => C-499 [action] => update )
[133] => Array ( [action] => update [time] => 2011-11-26 07:04:25.000000 [item] => C-498 )
[134] => Array ( [time] => 2011-11-26 07:04:25.000000 [item] => C-492 [action] => update )
[135] => Array ( [item] => C-495 [time] => 2011-11-26 07:04:25.000000 [action] => update )
[136] => Array ( [action] => update [item] => C-496 [time] => 2011-11-26 07:04:25.000000 )
[137] => Array ( [action] => update [time] => 2011-11-26 07:04:25.000000 [item] => C-500 )
[138] => Array ( [action] => update [item] => C-516 [time] => 2011-11-26 07:04:26.000000 )
[139] => Array ( [action] => update [time] => 2011-11-26 07:04:26.000000 [item] => C-515 )
[140] => Array ( [item] => C-518 [time] => 2011-11-26 07:04:26.000000 [action] => update )
[141] => Array ( [action] => update [item] => C-519 [time] => 2011-11-26 07:04:26.000000 )
[142] => Array ( [action] => update [item] => C-514 [time] => 2011-11-26 07:04:26.000000 )
[143] => Array ( [item] => C-503 [time] => 2011-11-26 07:04:26.000000 [action] => update )
[144] => Array ( [item] => C-517 [time] => 2011-11-26 07:04:26.000000 [action] => update )
[145] => Array ( [action] => update [time] => 2011-11-26 07:04:27.000000 [item] => C-520 )
[146] => Array ( [action] => update [item] => C-524 [time] => 2011-11-26 07:04:27.000000 )
[147] => Array ( [action] => update [item] => C-525 [time] => 2011-11-26 07:04:27.000000 )
solarbird: (Default)

Re: well howdy

[personal profile] solarbird 2023-02-21 01:01 am (UTC)(link)
I don't know if anyone is still following this nonsense but

I did it

I did it by taking the .xml Dreamwith will hand me and processing that and my code is a complete - I must stress, complete - trashfire, but by god it works.

21 years of lj+dw+my old band blog all imported. 7261 messages and their comments.

Wordpress did not enjoy this experience. We both did some things it will regret, and I learned a bunch about resource allocation in php running under a webserver.

But eventually, it worked. Not as perfectly as I'd like but 99%. Thanks for the help, team - particularly [staff profile] denise. :D
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

Re: well howdy

[staff profile] denise 2023-02-21 05:13 am (UTC)(link)
Oh YAY I was hoping you'd be able to get it! Was it the XML/RPC libraries?
solarbird: (Default)

Re: well howdy

[personal profile] solarbird 2023-02-21 05:38 am (UTC)(link)
Well, once I realised I could just get a big XML chunk I poked at the old approach a little bit more and then just said to hell with it, I'm using the XML I can just, you know, have, and scraped together an importer for that.

Along the way I did start to see... they did some extraordinarily weird things in their comment processing? I mean, what am I looking at?! things?

And I'm wondering now if that's about resource control, because to bring in all my utter nonsense my way I had to up a whole bunch of limits - [wp-home]/wp-includes/functions.php needed to be given a five minute runtime max, everything else I bumped up to 90 seconds (defaults are 30 seconds in all cases); I lifted PHP's ability to allocate ram to an entire gig (though that was almost certainly excessive, it really topped out under 250mb but I was done screwing around with incrementalism lol), and I threw MariaDB's max packet size up also to 1G after reading about that being necessary in other large-database-import environments.

In short, this thing is a complete pig. But it runs - or, well, lumbers to completion - and that's what really matters. xD
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

Re: well howdy

[staff profile] denise 2023-02-22 02:53 am (UTC)(link)
ooooooof, yeah, running into limits might also have been the case! Anyway, either way, I'm glad you figured it out and got it all working!