Entry tags:
Question thread #118
It's time for another question thread!
The rules:
- You may ask any dev-related question you have in a comment. (It doesn't even need to be about Dreamwidth, although if it involves a language/library/framework/database Dreamwidth doesn't use, you will probably get answers pointing that out and suggesting a better place to ask.)
- You may also answer any question, using the guidelines given in To Answer, Or Not To Answer and in this comment thread.
The rules:
- You may ask any dev-related question you have in a comment. (It doesn't even need to be about Dreamwidth, although if it involves a language/library/framework/database Dreamwidth doesn't use, you will probably get answers pointing that out and suggesting a better place to ask.)
- You may also answer any question, using the guidelines given in To Answer, Or Not To Answer and in this comment thread.

well howdy
What's good beahviour for remote API calls? What kind of delay or rate per hour is okay? I didn't see anything in the FAQ (apologies if I missed it), I didn't see anything when doing some searching.
The reason I'm asking is because while I'm not planning to leave, running a Mastodon server has reminded me that it's nice to have my own copies of my stuff, and since we already do a lot of Wordpress, that seemed like a good way. So I've mildly modified their Livejournal importer to import Dreamwidth, and got through about 130 entries before getting an API rate shutdown.
That's fine, I don't mind. But I have a lot of entries and that's gonna be a lot of halts. I'd rather add some delay in the importer to avoid that.
So. What's a good amount of time between, say, authentications - the easiest place to do it in this code - to slow down and wait?
Thanks!
eta: the specific error is "Client error: Client is making repeated requests. Perhaps it's broken?" in case that's different. This code does make a lot of requests.
Re: well howdy
There's an example (in perl) of how to do it in https://github.com/dreamwidth/dreamwidth/blob/main/src/jbackup/jbackup.pl.
Re: well howdy
This is mostly just the extant Livejournal to Wodpress import code so I don't know it very well. But I think that the problem is that this code supports retrying after transient failure, and the above is the error it gets when trying to start back up, and I think it is requesting the same item twice because it's iterating through to find out where it needs to pick back up and when it finds it it then throws back to the normal import code which means repeating the value.
But it's a lot of code and I literally started working on this yesterday so I'm not sure. Don't quote me on it. xD
Anyway avoiding that is why I'm trying to find out how much I need to slow this down so I don't get an actual rate block. I've got it really slow right now and did a complete start-over of import and so far it's okay? But I have a lot of entries on my blog here. I don't think the odds of getting through all of them in one pass are high. xD
Re: well howdy
After getting a little over 4100 metadata elements (of over 11,000), Dreamwidth starts returning the same lastsync every time. [ETA: I'm slightly wrong - it is definitely returning mass repeats of lastsync, but if I let it go enough it starts incrementing them by a second or a few. I have an absolute boatload of debug output now.]
If I edit a post made on the day pointed to by the lastsync returned, which lastsync I get changes, but I get no additional lastsync entries.
Is this an API call rate response?
If so, how much do I have to slow down to fix it? Because I am clearly not slowing down enough.
If not, what is causing this?
(The most recent version is 2011-11-22 08:03:50.000000 if that's of use.)
Re: well howdy
Re: well howdy
I suppose that could also be slow enough to cause an issue, I didn't think about that. I was thinking I was being rate-limited before so I really cranked it down.
It is very weird to me too and I'm pretty lost. :/
Re: well howdy
ETA: I mean to say, repeating the same request is normal and expected, if you're just occasionally checking for new posts or something. It's a problem if you've found yourself doing that in the middle of iterating, and the rate doesn't matter if you're making zero progress.
Re: well howdy
I'm not handing Dreamwidth repeated lastsyncs. That is not a thing I am doing. This was happening, but I have coded around it. This code is fugly but that's not important right now, it works.
It turned out eventually that the reason I was having a problem with repeated lastsyncs is:
Dreamwidth is giving me the same lastsync over and over again.
The importer works as you'd expect for the first 4125 entries, then I start getting the same lastsync from Dreamwidth when trying to iterate through all items.
If it matters, they're also all C- items, not L- items. I'm wondering if this has anything to do with when when I imported everything from Livejournal to here, though that's kind of a guess and I don't even remember when I bailed on LJ.
The last L- item I get is 2011-11-22 08:03:50.000000, then it's all C- items with repeated lastsync numbers.
If I let my "let's not trigger repeat detection" code and start switching up my requests on it (and that's some stupid code let me tell you but it works) then eventually it starts incrementing lastsync again - by between one and a small number of seconds. They're still all C- items.
If I let _that_ run long enough to try to chew through all the C- items, Dreamwidth stops responding to authentication before I get to the end of total items.
Excerpted from a massive debug lot, here's the last L- item, followed by the "C-" flood starting. As you can see, the item numbers keep changing, but the lastsync doesn't.
[124] => Array ( [action] => update [item] => L-467 [time] => 2011-11-22 08:03:50.000000 )
[125] => Array ( [action] => update [item] => C-481 [time] => 2011-11-26 07:04:24.000000 )
[126] => Array ( [time] => 2011-11-26 07:04:24.000000 [item] => C-473 [action] => update )
[127] => Array ( [action] => update [item] => C-489 [time] => 2011-11-26 07:04:24.000000 )
[128] => Array ( [action] => update [time] => 2011-11-26 07:04:24.000000 [item] => C-483 )
[129] => Array ( [item] => C-482 [time] => 2011-11-26 07:04:24.000000 [action] => update )
[130] => Array ( [action] => update [time] => 2011-11-26 07:04:24.000000 [item] => C-474 )
[131] => Array ( [item] => C-491 [time] => 2011-11-26 07:04:25.000000 [action] => update )
[132] => Array ( [time] => 2011-11-26 07:04:25.000000 [item] => C-499 [action] => update )
[133] => Array ( [action] => update [time] => 2011-11-26 07:04:25.000000 [item] => C-498 )
[134] => Array ( [time] => 2011-11-26 07:04:25.000000 [item] => C-492 [action] => update )
[135] => Array ( [item] => C-495 [time] => 2011-11-26 07:04:25.000000 [action] => update )
[136] => Array ( [action] => update [item] => C-496 [time] => 2011-11-26 07:04:25.000000 )
[137] => Array ( [action] => update [time] => 2011-11-26 07:04:25.000000 [item] => C-500 )
[138] => Array ( [action] => update [item] => C-516 [time] => 2011-11-26 07:04:26.000000 )
[139] => Array ( [action] => update [time] => 2011-11-26 07:04:26.000000 [item] => C-515 )
[140] => Array ( [item] => C-518 [time] => 2011-11-26 07:04:26.000000 [action] => update )
[141] => Array ( [action] => update [item] => C-519 [time] => 2011-11-26 07:04:26.000000 )
[142] => Array ( [action] => update [item] => C-514 [time] => 2011-11-26 07:04:26.000000 )
[143] => Array ( [item] => C-503 [time] => 2011-11-26 07:04:26.000000 [action] => update )
[144] => Array ( [item] => C-517 [time] => 2011-11-26 07:04:26.000000 [action] => update )
[145] => Array ( [action] => update [time] => 2011-11-26 07:04:27.000000 [item] => C-520 )
[146] => Array ( [action] => update [item] => C-524 [time] => 2011-11-26 07:04:27.000000 )
[147] => Array ( [action] => update [item] => C-525 [time] => 2011-11-26 07:04:27.000000 )
Re: well howdy
-- We occasionally have issues with the importer where a specific post is broken in some weird way (bad UTF-8, invalid characters, and control sequences that snuck invisibly in are the most common) that the sync API just chokes and gets stuck on. If lastsync is always getting stuck on the same post (or around it if you move posts around), maybe take the contents of that entry and dump them into a text editor like BBEdit that will let you "zap gremlins", then paste the cleaned up text back into the entry body and see if that lets you progress further? (It's also possible that it's choking on a comment to the entry, in which case you'll have to delete it in order to proceeed, if it's not one that you made.)
-- I haven't looked at the LJ-to-WP importer, but: is it possible that the stuff it's grabbing with no problem is stuff you imported from LJ and it's choking on stuff you've posted to DW? If it doesn't know how to handle entry properties that we added on DW that LJ doesn't have, that may be why it's dying; if (for instance) the entry it's getting stuck at is one you used Markdown formatting in, that's an entryprop that LJ doesn't have and it might be freaking out at that.
-- Are you pulling comments and entries, or just entries? If you're trying to pull comments, too, and store them with the entries with threading intact, you may need to do a lot of checking the validity of the comment parent/child tree. If the entry it's choking on has any deleted or suspended comments, that may be it. (You wouldn't believe how much error checking we have to do with that in the importer, and it still breaks from time to time and we have to fix things manually.)
-- Is the entry it's choking on (by itemid) one of a long run of entries that are backdated/dated out of order? There's a weird issue I'm only vaguely remembering because there's special-casing for it in the import workflow where if you have too many entries backdated in a row, lastsync will just throw up its hands and die horribly. I think it happens when the number of items you're trying to get from the lastsync point returns no entry that is not backdated -- so like, if you're trying to fetch things 50 items at a time, and you hit a run where there are 51 backdated entries, lastsync gets really badly confused. I think I vaguely remember that the importer uses backdating in some cases, so if you're hitting a run of entries that were imported from LJ, that may be the problem.
Those are the big ones that come to mind; two of them are things that cause problems in the on-site importer all the time, and you might want to take a look at some of the error checking and logging we do. My gut instinct is that the last one I thought of (the backdating thing -- which I thought of when I was looking through the importer code) is most likely to be the culprit, with the "bad characters in the entry it's choking on" in second place and the other two in a distant third.
(If it does turn out to be the backdating thing, the good news is it's fixable; the bad news is it's going to be a pain, because the only real fix is to edit the entries and uncheck the backdated flag.)
Re: well howdy
In order:
-- I've found a few entries with bad UTF-8 and fixed those. That solved a couple of problems I was having early on with this project, but I've run out of them.
-- It's not even looking at the details where it's failing; it's not choking on unknown entryprops. where it's failing, it's just trying to build the items list.
-- No comment pickup yet except in that we're getting the big list of all items. If I have to throw out comments, so be it.
-- None of the bad dates point directly at an item - it's always a comment. But if I edit the entry that has that comment, the item lastsync gets stuck on changes. After I did that several times, I decided/realised it was systemic and that wasn't much of a solution. :/
-- I am fetching a lot more than 50 entries at a time. This feels dumb, but I'm not seeing a way to tell whether an entry is backdated, however. But at least some of the entries in this timeframe were echoed from my old band blog, which would've been direct - I set that up originally to use JournalPress to crosspost to here (as a normal post, with comments enabled), and then use Dreamwidth to crosspost further back to Livejournal with comments (there) disabled and a pointer back here. And I'm pretty sure JournalPress doesn't mark posts backdated, as I wanted them to appear in friendlists. (And posts from my new let's-amalgamate-everything personal blog absolutely do, I'm getting comments here on them.)
Re: well howdy
You can tell if an entry is backdated by editing it: if you're using the old update page, it'll be a checkbox near the date item labeled "date out of order" or "don't show on entry pages" (I forget if we renamed it there or not), and if you're using the new entry page it's in the Display Date panel and labeled "don't show on reading pages".
If it's always getting stuck on comments, I wonder if the script you're using just doesn't have good comment mapping error handling. Have you taken a look at jbackup.pl that Pau linked to above? Is there anything like that in the script you're using? I've never looked at it, so I don't know how well it handles weirdnesses. I know you've probably hit the point of "I've put this much effort into it, I'm going to make this work goddammit," but just to rule out issues with the script itself: if you try running jbackup.pl does it get stuck in the same place?
Re: well howdy
That's not it. We're expecting C- items mixed in with L- items; this is the one big pull of all item metadata.
Also, neither side is getting stuck on a single actual item, the comment identifiers keep changing, as per the debug output I provided in another reply. It's just that they have huge swaths of repeated lastsyncs.
I'm kinda thinking this would fine at this point if we didn't have the shutdown for repeated pulls of the same lastsync.
Re: well howdy
Re: well howdy
Re: well howdy
Re: well howdy
Which tells me mostly I think that the database on your end is fine as far as your own tools are concerned (which isn't really surprising, I mean, of course it is).
Sadly, from what I've been able to find, there are not tools to import Dreamwidth XML exports to Wordpress - I'm not the first person to try this, I'm just the first person who hasn't bailed and copied everything over by hand. (At least, that's what I was finding in searches before I started trying to adapt the Livejournal importer to Dreamwidth.) As far as I've been able to find, few people have tried to make this work and so far none have succeeded.
I wonder if there's something different in Wordpress's PHP implementation of the XMLRPC libraries? But now I'm just in "whelp I dunno" territory and am guessing.
Re: well howdy
Re: well howdy
I did it
I did it by taking the .xml Dreamwith will hand me and processing that and my code is a complete - I must stress, complete - trashfire, but by god it works.
21 years of lj+dw+my old band blog all imported. 7261 messages and their comments.
Wordpress did not enjoy this experience. We both did some things it will regret, and I learned a bunch about resource allocation in php running under a webserver.
But eventually, it worked. Not as perfectly as I'd like but 99%. Thanks for the help, team - particularly
Re: well howdy
Re: well howdy
Along the way I did start to see... they did some extraordinarily weird things in their comment processing? I mean, what am I looking at?! things?
And I'm wondering now if that's about resource control, because to bring in all my utter nonsense my way I had to up a whole bunch of limits - [wp-home]/wp-includes/functions.php needed to be given a five minute runtime max, everything else I bumped up to 90 seconds (defaults are 30 seconds in all cases); I lifted PHP's ability to allocate ram to an entire gig (though that was almost certainly excessive, it really topped out under 250mb but I was done screwing around with incrementalism lol), and I threw MariaDB's max packet size up also to 1G after reading about that being necessary in other large-database-import environments.
In short, this thing is a complete pig. But it runs - or, well, lumbers to completion - and that's what really matters. xD
Re: well howdy