roadrunnertwice: Dee perpetrates some Mess. (Arts and crafts (Little Dee))
Nick Eff ([personal profile] roadrunnertwice) wrote in [site community profile] dw_dev2020-06-05 02:27 pm

Named Markup Formats

I've been working on a thing, and I could use some feedback on the implementation. It might take a little explaining (because there's a fair amount of backstory), but I'll try and be as concise as possible.

Backstory: Raw and transformed text

DW stores the text of entries and comments raw, exactly as the user entered it. Then, whenever we need to display that text, we transform it to produce nice legible HTML. Those transformations include:

  • Turning <user> tags (which aren't real HTML) into [profile] user, which is really like a span plus an image plus a link.
  • Handling <cut> tags.
  • Several other things, etc., not important right now.

Most of those get applied to everything we display. But there are also some OPTIONAL transformations:

  • Turning normal line breaks into HTML <br> tags.
  • Turning bare URLs into clickable links.

Those get applied by default, but we've always had a "don't autoformat" checkbox (inherited from LJ) that could be used to disable them for an entry or comment. (BTW, under the hood the RTE saves entries as "don't autoformat.")

Then, later, DW added some other optional transforms, which had their own special enabling conditions:

  • Turning Markdown into HTML. (For entries that start with a special !markdown glyph, or comments submitted by email.)
  • Turning @mentions into user tags. (Currently applies to everything except "don't autoformat," but gets suppressed within certain HTML elements or their Markdown equivalents.)

ALL of these transforms get handled by something called the "html cleaner," at LJ::CleanHTML. At this point "cleaner" is kind of a misnomer; in actual fact, it's the central place where we handle all transformations of raw user-entered text into a fragment of display HTML.

The problems

In my understanding, the current state of affairs has two main problems:

  • The interface for choosing text transforms is incoherent. That happened gradually; we've added new transformations over time, and changed the interactions between them, and now it's weird:
    • Half of the interfaces for entering entry/comment text don't even include the "don't autoformat" checkbox anymore.
    • The way you enable Markdown has always been a mystery. For example, I want to use Markdown in comments (because typing html angle brackets on a telephone is bullshit), but currently it's impossible except when responding via email.
  • Introducing new text transforms is dangerous and chaotic. In mid-2019, we enabled @mentions for HTML-formatted content (previously they only worked in Markdown content), and about 40% of hell broke loose:
    • Current content suffered because we didn't have a good way to beta-test @mentions, so we didn't have a chance to learn about bugs and edge cases from our users (who are more ingenious at doing weird textual shit than we are) before enabling them globally.
    • Old posts weren't written to expect @mentions, so we ended up totally vandalizing any historical post that ever discussed CSS code (@media (min-width: etc...)), Perl or Ruby or Objective-C code, or a wide variety of other things that involve @ signs.

Questions: Does anyone disagree with those two problems or my characterization of them? Does anyone see any closely related problems that I'm not recognizing here?

The solution, maybe

I've got a pull request up for discussion right now that tries to solve these immediate problems, as well as several other problems I can see coming down the road at us.

In brief (lol, sorry, I'm trying!!), here's what I'm proposing we do:

  • Instead of a set of disconnected text transform options that can be independently twiddled, we codify a set of named markup formats. Each entry or comment will specify exactly one format.
    • To start with, this will include all of the valid combinations of the current formatting options, which ends up being fewer than you might think:
      • markdown0 - Markdown via the Text::Markdown module, plus our own @mentions.
      • html_casual1 - Classic HTML-but-it-respects-your-linebreaks, plus @mentions.
      • html_casual0 - Like casual 1, except without @mentions. Old content was written to expect this, but we don't respect that anymore. Imported and syndicated content still uses this, though.
      • html_raw0 - "Don't autoformat." No @mentions.
      • html_extra_raw - Syndicated content that we know will never use special DW tags like <user> or <cut>.
    • (For existing posts that don't specify a format, we use the existing independent options to guess a format. We also take the post date into account, so we can restore casual 0 on posts that were originally written in it.)
  • When formatting text for display, the specified format determines all of the transforms we use.
    • No more independent twiddlies. The format captures the entirety of the user's intent for how their markup should be handled.
  • Formats are versioned.
    • If we want to introduce new formatting behavior (like @mentions) or make changes that might have non-linear consequences (like switching from Text::Markdown to a different Markdown processor), we have to implement it as a new format (e.g. markdown1).
    • Old versions stick around in the HTML cleaner FOR ALL ETERNITY, so that we can continue to display content as it was originally intended.
    • This doesn't preclude making small changes for safety or consistency -- for example, putting a user tag inside a link always used to generate illegal link-within-a-link markup that we'd just ignore and let your browser sort out, but I recently made it so we strip the inner link. But anything that changes what a user would expect to get from their markup should cut a new format.
  • When posting or editing a comment or entry in the web UI, the form includes a drop-down for choosing which format you want to use.
    • (With a descriptive name, not weird IDs like html_casual1.)
    • We only offer users the "active" formats -- although we keep around legacy formats for displaying old content, we don't clutter the UI with them. Today, the active formats for non-syndicated posts would be markdown0, html_casual1, and html_raw0.
      • The exception is if you're editing older content that uses a legacy format. In that case, the menu includes the active formats PLUS the post's existing format, so you don't have to update all formatting within the post just to make a minor change.
    • We remember the last format you chose when making a new post, and use that as your default.
    • (new detail) If you deliberately change the format to something other than your current default, we give you the option to set that as your new default after saving your post/edit.
    • If we're introducing a new format, we might set it as active it but mark it as beta, and keep the format it replaces for a while. When posting with a beta format, you're accepting the possibility that the markup behavior might retroactively change for that post, until the beta ends.
  • The selected format in a comment/entry form might change other behavior of the form.
    • Specifically, I'm thinking of markup helper buttons for the most common text styles, which is another back-burner project I'm working on -- the bold/italic/link buttons would add HTML tags for the HTML formats, and Markdown styles for Markdown.
    • Also thinking of the future replacement RTE here -- RTE would get its own named format (even though practically speaking it will probably emit something that looks like html_raw0), and switching your format to "Rich" would initialize the editor and its controls. Switching your format to something non-rich would serialize the editor's buffer as HTML and take you back to a plain textarea; we wouldn't need a separate control for switching to the RTE beyond the normal format drop-down that everything else uses.
    • ...I guess we could also add syntax highlighting or something for HTML and Markdown, but I don't think anyone's planning on it and I doubt anyone's desperate for it.
  • (new detail) When posting via email, we'll add a new post header for formatting. You can specify either a format ID (like html_casual0), or a shortcut ID to choose the most recent versions of casual HTML or Markdown. If you omit the header, it posts with the current global default for that type of content, ignoring your user default.
    • Why: Email is sort of halfway between a UI and an API. People who are automating email posts want to choose a stable and predictable format (so use IDs), but people who are writing in their real mail client probably want their posts to act like the version of the web form they're accustomed to. We'll ignore user defaults because your email client doesn't give you any feedback about what's going to happen, so we don't want it to depend on hidden settings state.
  • (new detail) When posting via the new API, you can specify a format ID or fall back to a global default. User default is ignored.
  • (new detail) Email and API posts aren't limited to the currently active formats; they can specify any format they want, including obsolete ones. This should help us avoid breaking automation that people build.
  • (new detail) When posting via the old XMLRPC API (so, old LJ clients like Semagic), you cannot select a named format. You'll be limited to today's formatting options (the "don't autoformat" checkbox if supported by the client, and the !markdown secret glyph), which will forever behave exactly like they do in mid 2020 (resulting in either markdown0, html_casual1, or html_raw0).
    • Why: Old client programs want things to keep working the way they've always worked... so that's what they'll get. And we don't want to add more secret glyphs, because they make things harder to maintain and understand.
  • (new detail) If a browser extension happened to mangle the web forms to re-enable obsolete formats, we wouldn't care. From our perspective, that's the same thing as an API client posting with a weird format, which is fine. (Removing obsolete formats from the menu is just about keeping things simple for most users.)

I THINK this will free us up to be much more nimble about modernizing the way we process text, and will increase user choice while ALSO making the site less confusing to use, which is a rare combo.

Questions: Does the approach make sense? Can you think of anything it would sabotage or prevent us from doing in the future? Can you see anything excessively complicated in the code itself? (By which I mean, can we make this simpler while still solving those two problems.)

Non-linear consequences

Here are the things I can think of that might be affected in weird ways by this change:

  • External clients and integrations. We'll be expecting new entries and comments to arrive with a format specified (using the new-ish editor property), so anything that posts content without that will be locked into the "guessed" format that fits their other options. (that's raw 0 if they set "don't autoformat," and casual 1 if they didn't.) Over time, that'll drift more and more out of date; when we add some new @mention-like thing that forces us to cut a casual 2, we're going to keep guessing casual 1 for posts with no metadata, because all of OUR stuff properly sets the format type.
    • TBH, I think this is 100% fine. Feel free to make a case for otherwise, tho.
  • All the documentation/FAQ pages about formatting stuff will be immediately out of date.
    • On it, don't worry.
  • Once we finally make Markdown discoverable, people might start requesting we add other oddball lightweight markup formats, like textile or RST.
    • I think the answer to that is basically "no?" It should be "no."
  • Another thing that might happen is that we start getting pressure to add some of the niceties of more modern Markdown implementations, like fenced code blocks and ascii-art tables.
    • Well, at least we'd be able to do that safely, by cutting a markdown1 and leaving existing content on markdown0. It's just a question of what we actually want our Markdown to act like, and THAT is going to be an exciting conversation.
  • Doubling down on the importance and centrality of the HTML cleaner, which is a highly complex thicket of code that not a lot of people feel comfortable working with.
    • I'm not aware of any plans to replace the cleaner, but this would definitely require them to be rethought. That said, we're always going to need SOME central thing that governs text transformations, and we could move the implementation to a different spot in the code if need be.

Please suggest more of these if you think of them.

azurelunatic: Vivid pink Alaskan wild rose. (Default)

[personal profile] azurelunatic 2020-06-05 10:47 pm (UTC)(link)
This sounds like it will retain the obnoxious bug-in-practice that goes like:

* User goes to start an entry
* User picks a format at semi-random
* User posts
* User discovers that it was not the intended format
* User edits to pick intended format
[some time passes]
* User makes new entry
* Format of new entry retains the original, incorrect, randomly selected format
* User believes that last entry was posted with correct format, is annoyed that this one isn't
* User edits most recent entry
* Cycle repeats
azurelunatic: Vivid pink Alaskan wild rose. (Default)

[personal profile] azurelunatic 2020-06-06 05:19 pm (UTC)(link)
I wonder if there are people who (nearly) always use it.

Stats on that sound like a Mark question.
fred_mouse: line drawing of sheep coloured in queer flag colours with dream bubble reading 'dreamwidth' (Default)

[personal profile] fred_mouse 2020-06-07 09:25 am (UTC)(link)
Before I discovered the markdown tag and @ mentions, I almost always used raw html. Because I'm that kind of old, who started doing their web pages in raw html, and still writes that way even when there is a wysiwyg editor. So it might be that introducing an option that makes explicit the options for markdown might decrease the frequency with which raw html gets used?
dennisgorelik: 2020-06-13 in my home office (Default)

RTE

[personal profile] dennisgorelik 2020-06-06 03:40 am (UTC)(link)
What is "RTE"?
dennisgorelik: 2020-06-13 in my home office (Default)

Re: RTE

[personal profile] dennisgorelik 2020-06-06 04:59 am (UTC)(link)
> ongoing project to replace it.

What do you plan to replace "Rich text editor" with?
Would it make sense to just delete it?
pauamma: Cartooney crab wearing hot pink and acid green facemask holding drink with straw (Default)

[personal profile] pauamma 2020-06-06 04:57 am (UTC)(link)
I believe this (or perhaps a simplified version) should be posted to a user-centric community instead of/in addition to here, so that end users can opine.
chalcedony_starlings: Two scribbled waveforms, one off-black and one off-white, overlapping, on a flat darkish purpleish background. (scribble twins)

[personal profile] chalcedony_starlings 2020-06-06 10:41 am (UTC)(link)

I would like to mention tangentially, from a relatively technical end-user perspective, that the inconsistency in places is horribly grating even when I do know how to and often prefer to write raw HTML. Specifically, though I'm not sure where this fits into the above, the amount of <!--, newline, --> in our profile text because there's no way to disable having a <br> jammed into the middle of what would otherwise be a void between list items or something.

Also, I've gotten into the irritating habit of never writing newlines in comments (I'm doing it right now, in fact), because I have an Aversion to having </p><p> be rendered as <br><br>, and the “no, don't give me this comment form, give me the one which allows me to do that but also doesn't show context in a remotely ergonomic way etc.” alternative got too grating to use over time. I take it the Markdown conversion does better at this?

alexseanchai: Katsuki Yuuri wearing a blue jacket and his glasses and holding a poodle, in front of the asexual pride flag with a rainbow heart inset. (Default)

[personal profile] alexseanchai 2020-06-06 11:39 pm (UTC)(link)
agreed on wanting p tags dividing paragraphs instead of br tags
dennisgorelik: 2020-06-13 in my home office (Default)

Reduce complexity first?

[personal profile] dennisgorelik 2020-06-06 04:57 am (UTC)(link)
Would it be possible to delete some features before adding new features?
E.g., is it possible to delete "Disable Auto-Formatting" option?
dennisgorelik: 2020-06-13 in my home office (Default)

"don't autoformat," checkbox

[personal profile] dennisgorelik 2020-06-06 06:01 am (UTC)(link)
> If you mean the checkbox labeled "don't autoformat,"

Where can I see "don't autoformat," checkbox?

Here:
https://www.dreamwidth.org/update.bml
I see "Disable Auto-Formatting" checkbox.
alyndra: (circular reasoning)

Re: "don't autoformat," checkbox

[personal profile] alyndra 2020-06-06 02:41 pm (UTC)(link)
If you click reply to any comment and then click More Options, you will see “don’t auto-format.” But it really means the same thing as “disable auto-formatting.”

But to answer your original questions about if it is possible to get rid of some features and not have the RTE anymore, you might really like the Beta Create Entries page. You can toggle it on or off at dreamwidth.org/beta and if you click the Settings wheel there you can choose which options you want visible on the page when you’re making posts.
Edited 2020-06-06 14:46 (UTC)
dennisgorelik: 2020-06-13 in my home office (Default)

Re: "don't autoformat," checkbox

[personal profile] dennisgorelik 2020-06-06 03:07 pm (UTC)(link)
Why does Dreamwitch use different labels for naming the same thing?
The label on such checkbox should be "Don't auto-format:" or "Disable Auto-Formatting", but not both, right?
Or better yet - do not have that checkbox at all and activate such option through some form of markup, because such option is needed only for [rare] advanced usages.
azurelunatic: Vivid pink Alaskan wild rose. (Default)

Re: "don't autoformat," checkbox

[personal profile] azurelunatic 2020-06-06 05:13 pm (UTC)(link)
They definitely should become consistent.
dennisgorelik: 2020-06-13 in my home office (Default)

"Create Entries" Beta page

[personal profile] dennisgorelik 2020-06-06 03:19 pm (UTC)(link)
I opened:
https://www.dreamwidth.org/beta
and clicked "Turn ON beta testing" button in "New Create Entries Page" section.
My first impression is that the Beta version of:
https://www.dreamwidth.org/entry/new
looks better than the standard version.
It is cleaner and customizable.
I still need to actually use it (post blog entry) to decide if it is, actually, an improvement.

"Create Entries" Beta page may lose changes if I collapse "Settings" without saving. Which is a little bit counterintuitive. But it is a nitpick at this time.

Thank you.
dennisgorelik: 2020-06-13 in my home office (Default)

Further cleanup of "Create Entries" Beta page

[personal profile] dennisgorelik 2020-06-06 03:52 pm (UTC)(link)
"Spell check" button is no longer needed, because browsers' support for spellchecking is good enough already.
If significant number of Dreamwidth users really need "Dreamwidth spellchecking" - make "Spell check" button "hideable"/optional.

This is not critical change, of course, but like with any other cleanup - less features make product easier to use and easier to maintain.
azurelunatic: Vivid pink Alaskan wild rose. (Default)

Re: Further cleanup of "Create Entries" Beta page

[personal profile] azurelunatic 2020-06-06 05:17 pm (UTC)(link)
Spell check is planned for removal for all the good reasons you mentioned.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Further cleanup of "Create Entries" Beta page

[personal profile] dennisgorelik 2020-06-06 05:30 pm (UTC)(link)
Cool.
I also generally prioritize features removal over features addition, because feature removal makes codebase much easier to operate.
"Delete feature A and then add feature B" is easier than "Add feature B and then delete feature A".
Edited 2020-06-06 17:31 (UTC)
azurelunatic: Vivid pink Alaskan wild rose. (Default)

Re: Further cleanup of "Create Entries" Beta page

[personal profile] azurelunatic 2020-06-06 05:43 pm (UTC)(link)
From the perspective of maintaining the code, definitely.

From the perspective of having a working website, the deletion of some things can't go to production until the replacement is ready.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Further cleanup of "Create Entries" Beta page

[personal profile] dennisgorelik 2020-06-06 06:28 pm (UTC)(link)
> until the replacement is ready

"Spellchecker replacement" was ready and deployed many years ago.
alexseanchai: Katsuki Yuuri wearing a blue jacket and his glasses and holding a poodle, in front of the asexual pride flag with a rainbow heart inset. (Default)

Re: Further cleanup of "Create Entries" Beta page

[personal profile] alexseanchai 2020-06-06 08:08 pm (UTC)(link)
Create Entries replacement was not ready any length of time ago, because the rich text editor is not ready yet.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Further cleanup of "Create Entries" Beta page

[personal profile] dennisgorelik 2020-06-06 08:18 pm (UTC)(link)
I meant that "Spellchecker" feature, probably, should have been deleted earlier.
I did not mean that Create Entries replacement should be implemented after deleting old version of Create Entries page.
I find current approach with having both versions active during testing and transition - reasonable.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Reduce complexity first?

[personal profile] dennisgorelik 2020-06-06 06:05 am (UTC)(link)
> If you mean taking away the ability to do raw HTML without getting your linebreaks mangled

I do not mean that.
Advanced formatting (including command preserving raw HTML) could be done with advanced syntax in the posting text.
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2020-06-06 07:15 am (UTC)(link)
This is thorough and thought provoking.

As the person primarily responsible for much of the mess we're in today, the driving goal behind Markdown and @-mentions is to make Dreamwidth easier to use and more accessible (as a technology). Having to learn to write HTML is not something one should _have_ to do in order to express oneself, and so I strongly feel that we should be moving in directions to make it easier.

That said, we are of course built on the idea of customization and being able to really go deep in what kinds of things you can do to your content. That has many positive things, but ultimately, LJ::CleanHTML is the kind of thing you end up with when you want to make real ultimate cosmic power but also combine it with a reading page that has to combine _my_ journal customization insanity with _your_ entry customization insanity. God help us all.

So here we are. You're untangling my hacks, and I appreciate that.

I think basically I agree with your proposal and what you're thinking here, and I concur that we should draw the line at Markdown (and possibly some additions to it, since the Perl Markdown renderer is pretty minimal).

Ultimately tho, it doesn't look like this will really _change_ user behavior/experience? They will see a new dropdown/UI change that lets them explicitly select their formatting, which will save itself (but I saw the comments above, which is a good point to try to make sure we don't confuse/annoy users by changing their defaults unexpectedly).

One thing not mentioned, but I think because it's obvious but I'll say it anyway so it's documented, crossposting doesn't change here right? Since we just render that out as HTML and pass it raw to the target site and ask them not to mangle it. That should work just fine.

(As an aside, something I've been pondering as I've looked at LJ::CleanHTML in the past is if we could drastically simplify it with some modern technologies. It does a lot of work to try to block things like scripts, but possibly we can just use CSP or such to block those by policy and then simplify? I suppose this doesn't fix a lot of the "HTML fixing" we do though, so ultimately it probably doesn't actually simplify the cleaner by any material amount. Oh well.)
Edited (yes I got savaged by my own feature why) 2020-06-06 07:15 (UTC)
alexseanchai: Katsuki Yuuri wearing a blue jacket and his glasses and holding a poodle, in front of the asexual pride flag with a rainbow heart inset. (Default)

[personal profile] alexseanchai 2020-06-06 05:25 pm (UTC)(link)
if we, e.g., ended up with an RTE that had its own integrated "view source" mode, we could end up in a nested hierarchy, where I chose rich formatting but I'm viewing/editing HTML code as part of the implementation of "rich."

Why so? Like, on Tumblr, I can compose a post in rich text mode complete with (say) italics, switch to HTML mode and find the <i> tags already there, add (say) blockquote tags around a couple paragraphs, and switch back to rich text, and the blockquotes will persist. Might be mangled some, especially if I switch back and forth several times. (no, Tumblr, I want a single blockquote containing multiple paragraphs! this is a thing you have done before! why are we breaking the blockquote up by paragraph with <div> tags??) And I don't have the first idea how any such thing would be implemented on Dreamwidth, or to what extent its implementation on Tumblr might rely on other Tumblr functionality that Dreamwidth doesn't and doesn't plan to have. But it can be done.

[personal profile] pinterface 2020-06-06 05:56 pm (UTC)(link)

I agree, [personal profile] roadrunnertwice has done a really nice job thinking this through.

[LJ::CleanHTML] does a lot of work to try to block things like scripts, but possibly we can just use CSP or such to block those by policy and then simplify?

It might just be my age and remembering a time before CSP, but I'd pretty strongly recommend against relying on CSP alone. CSP is great as a backup for when something gets through, but there are a number of cases where even a perfectly-specified CSP is not going to work very well:

  • Any browser that doesn't understand CSP. (Older browsers, at least.)
  • Cross-posting. Gotta clean stuff before shipping it off, both so DW doesn't look like an attacker to any automated systems, and because relying on the other end to be secure seems like a not-great idea.
  • Non-HTML contexts. Presumably you'd still want to clean things before they end up in RSS, for instance, so JS doesn't end up in somebody's feed reader.

More generally, I'd argue that relying on CSP would move security issues from DW's control to third-party control (browser vendors, cross-post targets, e-mail clients, feed readers), which is a lot of faith to put in other people getting security right. Yes, use CSP! But in conjunction with existing security measures, not as a replacement.

It might be possible to simplify things by layering them—one layer to do input transformations, and another for security transformations, say—but that would require very careful design if input transformations need to produce things that would be rejected by the security layer (so no reserved HTML classes or ids, at least without shenanigans like using a random nonce or signed token that gets replaced with the reserved value).

hitchhiker: image of "don't panic" towel with a rocketship and a 42 (Default)

[personal profile] hitchhiker 2020-06-06 09:28 pm (UTC)(link)
i took a look at LJ::CleanHTML and the code was way nicer than i was expecting from these comments :) then again, i just read it rather than trying to modify it - are there specific thing that people find hard to work with in practice?
azurelunatic: Vivid pink Alaskan wild rose. (Default)

[personal profile] azurelunatic 2020-06-09 10:51 pm (UTC)(link)
(as an observer only)

Part of the cognitive load is the knowledge that if you screw up, there's the potential for both known and as yet unknown nastiness to get through.
alyndra: (Default)

[personal profile] alyndra 2020-06-06 09:08 am (UTC)(link)
When it sets your default for you, would that apply to posts or comments you email in, too? Or would you still have to identify if you want markdown at the top of your email?

I think editing a post to a different format should also set that format as your default, if we’re going with using a format sets it as default. I think if anything there should be a ticky (possibly on an interstitial page) asking to check if the user doesn’t want it set as the new default.

Otherwise just don’t set it automatically at all anywhere, and always include a ticky for making it your default. Those are the two options I see. I don’t see a compelling reason for the editing page to behave differently than post or comment.
annathecrow: screenshot from Star Wars: The Phantom Menace. A detail of the racing pod engines. (Default)

[personal profile] annathecrow 2020-06-06 01:46 pm (UTC)(link)
As for "any closely related problems", I'm wondering how would it behave when you switch between formats? Currently, when switching between RTE and raw HTML, it sometimes "bakes in" line breaks, adding
tags where you didn't want them, even repeatedly. (I don't have specific reproducible steps but I'm fairly certain it wasn't just a user error.)

Otherwise, I think this is a very neat solution to the problem and can't see any obvious issues with it.

(Also, yay for markdown in comment forms! I'm not very taken with the current DW markdown flavor, but it still beats writing HTML markup by hand.)
eftychia: Me in kilt and poofy shirt, facing away, playing acoustic guitar behind head (Default)

html_casual0

[personal profile] eftychia 2020-06-06 07:51 pm (UTC)(link)
I might be the only one affected by this, but ... "only offer users the "active" formats -- although we keep around legacy formats for displaying old content, we don't clutter the UI with them. Today, the active formats for non-syndicated posts would be markdown0, html_casual1, and html_raw0." -- I use a shell script to crosspost between DJ, IJ, & DW (I was using a command-line client for all three but it stopped working with DW so I switched to the script using post-by-mail for here), which means I'm basically using coding without @mentions, and have been using @ to mark Twitter usernames in attributions for quotations. So the ability to select html_casual0, especially in post-by-email, would be useful to me. (I've currently got the usual way Twitter IDs show up in my entries special-cased in the script
:%s/(@\([^ ]*\))/(@\1.twitter)/g

but sometimes they're not in quite that context and I have to hand-edit my entry after it's posted.)

Basically, I'm trying to post as though it were syndicated, despite not using a syndication API. So it'd be nice if I had access to a syndication-friendly input format.

(Also, is there a way to specify "do not autoformat" in post-by-email? My script currently tries to reformat from html_casual0 into something suitable for allow-autoformat.)

The other reason I'd like to keep an html_casual0 option is that it's easier for me to edit in HTML with line breaks in my source, but specify presentation linebreaks with <p> and <br />, which is what I used to get with "clive -p". So my script'd be more reliable if I can strip out the bit where it attempts to remove existing linebreaks and replace <p>..</p> and <br /> with new linebreaks.

I have no idea how many other people are still using command-line clients to post entries though. (BTW, if anybody's hacked Clive to work with DW again like it used to, and are willing to share your diffs, I'd be grateful. I was using Clive 0.4.9 when the DW API changed just enough to break it, but looking at the Changelog for Clive 0.4.10 it doesn't look like any DW support was added to it.)
runpunkrun: Pride flag based on Gilbert Baker's 1978 rainbow flag with hot pink, red, orange, yellow, sage, turquoise, blue, and purple stripes. (Default)

[personal profile] runpunkrun 2020-06-06 10:23 pm (UTC)(link)
I read this, and understood some. Here's my contribution: Because I am a monster, and because DW allows it, I'll often specify !markdown in a post and then go ahead and use an unholy mix of HTML and Markdown. Would that mixed usage--perhaps for older posts only--be covered by one of your format modules?

Also I use Semagic to post to DW, so I wouldn't be able to select a format for each post. Instead of having the site guess what format to use, would it be possible to create a "!markdown" like command to put at the head of a post to indicate the format? Or maybe have an option in settings where a user can choose a default format for new posts.
runpunkrun: Pride flag based on Gilbert Baker's 1978 rainbow flag with hot pink, red, orange, yellow, sage, turquoise, blue, and purple stripes. (Default)

[personal profile] runpunkrun 2020-06-06 11:21 pm (UTC)(link)
Got it! Thanks for the response!
fred_mouse: line drawing of sheep coloured in queer flag colours with dream bubble reading 'dreamwidth' (Default)

[personal profile] fred_mouse 2020-06-07 09:22 am (UTC)(link)
I've read this (and bits of the git repository you linked to) and while I have not noticed any issues, I'm not likely to, because I don't understand enough of the code I'm reading! I am, however, very willing to do some testing and see if I can break things, if there is a way for me to do so.
fred_mouse: line drawing of sheep coloured in queer flag colours with dream bubble reading 'dreamwidth' (Default)

[personal profile] fred_mouse 2020-06-11 11:52 am (UTC)(link)
Cool! Thanks. I've managed to at least get the site to load (mumble mumble safari mumble won't open http mumble mumble); will have to investigate further when I find a spoon or two.
jducoeur: (Default)

[personal profile] jducoeur 2020-06-07 06:25 pm (UTC)(link)
I don't know the code (haven't touched Perl seriously in 15 years), but as a techie user who wrestles with similar problems on my own site: LGTM. In particular, it sounds like it gets me what I *personally* want, which is to be able to use Markdown consistently without needing the glyph...
squirrelitude: (Default)

[personal profile] squirrelitude 2020-06-26 09:52 pm (UTC)(link)
This looks great! Couple of questions:

----

1) What happens if you find a security issue in an older formatter and need to overhaul or replace the implementation? (Perhaps the implementation is so broken it *can't* be fixed, and needs replacement.)

Can DW reasonably do a search to sample for posts and comments where the change makes a meaningful difference in output? And what happens if there's a discrepancy?

Technically, your proposal wouldn't change this situation all that much, since DW *already* has this problem. But it might offer more solutions. For example, you could "clone" markdown1 as markdown1_pre_20210101, set old markdown1 posts to use it, and fix markdown1 in place. If new posts are *forbidden* from using markdown1_pre_20210101, and editing an old post bearing that formatter switches it to markdown1 so that people can't use them to host exploits, then you could have your cake and eat it too...

This would be more complexity, and I'm not suggesting it's a great idea, but I think it's worth considering how that might play out. Old libraries sometimes have to be entirely replaced, and those corner cases can be *hard* to replicate.

(Another possibility is rendering all the old content to HTML in order to freeze it, and storing that alongside the source markup.)

----

2) You write about the scenario where someone edits an old post:

« The exception is if you're editing older content that uses a legacy format. In that case, the menu includes the active formats PLUS the post's existing format, so you don't have to update all formatting within the post just to make a minor change. »

What happens if I edit an old post and change the format selector to a newer format, and realize I've made a mistake and actually need it to go back to the deprecated format? Will the selector still show the *previous/original* format?
squirrelitude: (Default)

[personal profile] squirrelitude 2020-06-29 12:07 pm (UTC)(link)
As written (in the newly updated version of this PR), no, you'd be outta luck. [...] The consequences of oopsing it on one post won't be very large. [...] Anyway, I think I feel fairly comfortable leaving that as a problem to solve if-and-when.


Reasonable!

Are you trying to curdle all the milk in my fridge, or what. 😅


Haha, I'm adding this to the "quotes" section of my résumé. ;-)

But I think the most likely scenario would be that, as a matter of ops policy, mark and co would want to get vulnerable code OUT of the site in as complete a way as possible, even if we were pretty sure there were no sploits hiding in old posts.


And you might have to, e.g. if a software package becomes no longer feasibly available for the server environment.

Mismatch detection could be done with a dark launch. Run both formatter versions on the source, only use the output of the old one, and log the IDs of posts and comments where the mismatch occurred (and whether exceptions were caught on the new one). No need for a full DB scan, at least at first.

Tangent: For random sampling, I'm imagining you could even do a privacy-preserving transform on source and output HTML, where you replace all letters with "a" and all digits with "1" to allow you to look at non-public content where a mismatch was detected. (Sufficient? Acceptable by DW internal standards? No idea!)

Anyway, sounds like there are plenty of options and no decision is needed ahead of time. Agree that fix-forward is probably fine.